[Issue 1] Re: [Ietf-calsify] draft-ietf-calsify-rfc2445bis-01.txt / UTF-8

Bernard Desruisseaux bernard.desruisseaux at oracle.com
Wed Aug 23 08:06:13 PDT 2006


Hi Bill,

The question is "Do we actually care?".

I would say that as long as the iCalendar stream is a valid
UTF-8 document we are ok.

Cheers,
Bernard

Bill McQuillan wrote:
> I think Bernard has pointed out one issue that occurred to me also--since
> the information is UTF8 text, an application that does not understand an
> iCalendar object could still read it, for instance a text editor. The
> question becomes how would it react to a line break in the middle of a
> composed character.
> 
> After browsing in the Unicode standard I found this sentence in Annex # 14
> - Line Breaking Properties:
> 
>    Combining character sequences are treated as units for the purpose of
>    line breaking.
> 
> If the text editor assumes this property, there will likely be some loss of
> information.
> 
> On Tue, 2006-08-22, Bernard Desruisseaux wrote:
>> [
>>    For those not familiar with "combining character sequence" here's
>>    how it is defined by Unicode: A character sequence consisting of
>>    either a base character followed by a sequence of one or more
>>    combining characters, or a sequence of one or more combining
>>    characters.
>> ]
> 
>> Let me try to put this another way. We need to decide and justify:
> 
>> 1- Whether we want to allow "multi-octet characters" to be split
>>     across lines.
> 
>> Answer: No.
>> Why   : Otherwise the resulting text would end up being invalid
>>          in the specified encoding.
> 
>> 2- Whether we want to allow "combining character sequences" to be
>>     split across lines.
> 
>> Answer: Yes.
>> Why   : (1) I'm assuming that it is valid for a "combining
>>          character" to be preceded by the LF character (but I
>>          don't know this for a fact...), and thus the resulting
>>          text would still be valid in the specified encoding
>>          (but would sure "look" different).
> 
>>          (2) A "combining character sequence" could probably be
>>          longer than 75 octets in *theory*! But I'm sure we would
>>          never see this in practice though...
> 
>> With this approach:
> 
>> - You could open an iCalendar object specified in the charset 'X'
>>    in any application that support 'X' without errors.
> 
>> - You would still need to unfold the iCalendar object to be able
>>    to "interpret" all the characters properly.
> 
>> What do you think?
> 
>> Cheers,
>> Bernard
> 
>> Mark Crispin wrote:
>>> Unfortunately, your answers are circular.
>>>
>>> I understand that you assert
>>>  (a) it is not alright to fold in the middle of a UTF-8 sequence
>>>      ("multi-octet sequence" is ambiguous and imprecise)
>>> but that
>>>  (b) it is alright to fold between a character and a combining character.
>>>
>>> However, you also give assertion (a) as the answer the "why" question 
>>> for (a) and (b).
>>>
>>> Why is it not alright to fold in the middle of a UTF-8 sequence?
>>>
>>> Why is it alright to fold between a character and a combining character?
>>>
>>>
>>> What is wrong with the assertation:
>>>     A proper interpretation of the text is impossible until
>>>     all folding is removed and the strings are catenated.
>>>     Therefore, folding may appear anywhere, even in the
>>>     middle of a UTF-8 sequence.
>>> or, alternatively:
>>>     A proper interpretation of a subtext is impossible unless
>>>     all UTF-8 sequences and combining characters appear in
>>>     that subtext.  Therefore, folding may not in the middle
>>>     of a UTF-8 sequence or separating the UTF-8 sequences of
>>>     any (and all) combining characters from the character
>>>     being combined.
>>>
>>> Why is one, or the other, of the above two assertations inferior to your
>>> pair of assertations?
>>>
>>> I'm sorry for being such a troublemaker, and to be honest I really don't
>>> know which of these is best.  But someone's got to do it.  Whatever 
>>> decision is made, we need to justify why that decision and not the 
>>> alternatives.
>>>
>>> -- Mark --
>>>
>>> http://staff.washington.edu/mrc
>>> Science does not emerge from voting, party politics, or public debate.
>>> Si vis pacem, para bellum.
> 


More information about the Ietf-calsify mailing list