[Chandler-dev] Unicode and letting u be u (umlaut)

Grant Baillie grant at osafoundation.org
Wed Apr 12 15:49:17 PDT 2006


I've run across a couple of cases of specifying unicode characters in  
Python code that were a little fishy, so I thought I'd send out a  
long, rambly email to the list.

The 10-second summary is: If you want to specify a non-ASCII  
character in a unicode string, the python \uxxxx escape is your  
friend. With anything else, you're playing with fire.

So, to cut a short story long, I was looking at a test case in  
Chandler, where we were trying to come up with a non-ASCII path to  
use in a Chandler profile directory:

TestCrypto.py:13:        u = u"profileDir_(\xc3\xbc)" # u umlaut

This actually succeeds in setting u to be a non-ASCII string, except  
that it doesn't contain a "u umlaut". When you specify a u"..." style  
string in Python, you're telling the interpreter to assume each  
character in the string is a unicode code point. Looking at the list in

<http://www.unicode.org/Public/UNIDATA/NamesList.txt>

you can determine that "u umlaut" is the Unicode character(*)

     00FC    LATIN SMALL LETTER U WITH DIAERESIS

but in the above, the \xc3 and \xbc are interpreted as:

    00C3    LATIN CAPITAL LETTER A WITH TILDE
    00BC    VULGAR FRACTION ONE QUARTER

Clearly, we don't want any vulgarity in our paths, now do we :) ?

It turns out that the author of the above code was having trouble  
entering u umlaut (in a console, or a code editor). As mentioned  
above, the easiest and most portable way to do this kind of thing is  
to use the \u escape, viz:

      u = u"profileDir_(\u00fc)" # u umlaut

In the case of source files, Python has some handy conventions for  
specifying what character encoding of a source file is (see <http:// 
docs.python.org/ref/encodings.html#encodings>). Unfortunately, it  
turns out that there's no convention that's adopted by many editors.  
Possibly this is a reason to require everyone to use emacs, or vim,  
but the resulting religious war would take us well past Chandler 1.0 :).

In the case of entering text in an interactive session, you're  
somewhat at the mercy of your terminal program, as well as your  
locale. To continue the story, the characters \xc3\xbc above (which  
are the UTF-8 encoding of \u00fc), did not come from nowhere. The  
developer mentioned earlier copy-and-pasted them from the following  
bit of text in the I18n Busy Developers Guide:

 >>> exampleInstance.exampleText = u"This is some unicode with non- 
ascii character: ü"
 >>> exampleInstance.exampleText
u"This is some unicode with non-ascii character: \xc3\xbc"

As we determined above, the printed-out value does not end with ü. In  
fact, what happened above was the terminal program was using UTF-8,  
but Python had no idea that that was the case, and converted the raw  
UTF-8 bytes to unicode characters.

--Grant

(*) It's also representable as the sequence of two characters

    0075    LATIN SMALL LETTER U
    0308    COMBINING DIAERESIS (Dialytika)
            = double dot above, umlaut
            = Greek dialytika
            = double derivative
            x (diaeresis - 00A8)

but that's a whole different can of fish, er crosstown bus.







More information about the chandler-dev mailing list