[Ietf-calsify] Status of simplification work
Mark Crispin
MRC at CAC.Washington.EDU
Wed May 24 15:49:26 PDT 2006
On Wed, 24 May 2006, George Sexton wrote:
> I always tend to think of characters as the smallest atomic part that a
> string can be broken into. If I'm doing character based I/O, then a character
> is the smallest component I can write to the stream (all hail recursive
> definitions). I rarely, if ever concern myself with typesetting issues.
Unicode breaks that assumption.
For example, consider the German glyph commonly known as "umlaut-a".
That can be either the Unicode character U+00E4 LATIN SMALL LETTER A WITH
DIAERESIS or it can be *two* Unicode characters: U+0061 LATIN SMALL LETTER
A and U+0308 COMBINING DIAERESIS.
In UTF-8, the former is the octet string
0xc3 0xa4
and the latter is;
0x61 0xcc 0x88
(assuming my notepad calculations are correct). Of course, if you were
using good old ISO 8859, it would just be
0xe4
The point being is that you can't make any sort of assumptions about
characters or bytes.
Some glyphs (which are what most people mean when they say "character")
weigh in as being more than a dozen Unicode characters!
Thus, you have to think of "octets", "Unicode characters", and "glyphs".
Now, it is true that many people (myself included) make simplifying
assumptions for the purpose of their application. But they have to keep
track of these assumptions since when (not if) the assumption is broken
they have to know where to go back and fix.... :-(
-- Mark --
http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum
More information about the Ietf-calsify
mailing list