[Ietf-calsify] Status of simplification work

Mark Crispin MRC at CAC.Washington.EDU
Wed May 24 15:49:26 PDT 2006


On Wed, 24 May 2006, George Sexton wrote:
> I always tend to think of characters as the smallest atomic part that a 
> string can be broken into. If I'm doing character based I/O, then a character 
> is the smallest component I can write to the stream (all hail recursive 
> definitions). I rarely, if ever concern myself with typesetting issues.

Unicode breaks that assumption.

For example, consider the German glyph commonly known as "umlaut-a". 
That can be either the Unicode character U+00E4 LATIN SMALL LETTER A WITH 
DIAERESIS or it can be *two* Unicode characters: U+0061 LATIN SMALL LETTER 
A and U+0308 COMBINING DIAERESIS.

In UTF-8, the former is the octet string
 	0xc3 0xa4
and the latter is;
 	0x61 0xcc 0x88
(assuming my notepad calculations are correct).  Of course, if you were 
using good old ISO 8859, it would just be
 	0xe4

The point being is that you can't make any sort of assumptions about
characters or bytes.

Some glyphs (which are what most people mean when they say "character") 
weigh in as being more than a dozen Unicode characters!

Thus, you have to think of "octets", "Unicode characters", and "glyphs".

Now, it is true that many people (myself included) make simplifying 
assumptions for the purpose of their application.  But they have to keep 
track of these assumptions since when (not if) the assumption is broken 
they have to know where to go back and fix....  :-(

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum


More information about the Ietf-calsify mailing list