[Chandler-dev] Unicode and letting u be u (umlaut)

Brian Kirsch bkirsch at osafoundation.org
Wed Apr 12 16:46:50 PDT 2006


Yes Python Unicode is fun to work with isn't it :)

Grant raises a good point. The \u syntax is the best way to ensure that 
what you intended to render actually is correct.

Python provides a means to specify the source character set encoding at 
the top of a python file.
If one does that then text will be converted to Unicode from that 
character set in the file space.

For example:

# -*- coding: utf-8 -*- 

exampleText = u"This is some Unicode with non- ascii character: ü"
print repr(exampleText)
 >>> u'This is some Unicode with non- ascii character: \xfc'

However, in the command line interpreter Python does not know what the 
source encoding is and
it must be explicitly defined.

 >>> exampleText = unicode("This is some Unicode with non- ascii 
character: ü", "utf8")
 >>> exampleText
 >>> u'This is some Unicode with non- ascii character: \xfc'


My example in the i18n Busy Developer Guide should have been the above 
and not:


 >>> exampleInstance.exampleText = u"This is some Unicode with non-ascii 
character: ü"
 >>> exampleInstance.exampleText
u"This is some Unicode with non-ascii character: \xc3\xbc"

I have updated the guide to correct the error.


Using the \u syntax is a better choice because no encoding needs to be 
explicitly specified in the file or the terminal.

For example:

exampleText = u"This is some Unicode with non- ascii character: \u00FC"


Of course if your terminal uses the ASCII character set then it will not 
render correctly :)


--Brian








Brian Kirsch -  Cosmo Developer / Chandler Internationalization Engineer
Open Source Applications Foundation
543 Howard St. 5th Floor
San Francisco, CA 94105
http://www.osafoundation.org



Grant Baillie wrote:

> I've run across a couple of cases of specifying unicode characters in  
> Python code that were a little fishy, so I thought I'd send out a  
> long, rambly email to the list.
>
> The 10-second summary is: If you want to specify a non-ASCII  
> character in a unicode string, the python \uxxxx escape is your  
> friend. With anything else, you're playing with fire.
>
> So, to cut a short story long, I was looking at a test case in  
> Chandler, where we were trying to come up with a non-ASCII path to  
> use in a Chandler profile directory:
>
> TestCrypto.py:13:        u = u"profileDir_(\xc3\xbc)" # u umlaut
>
> This actually succeeds in setting u to be a non-ASCII string, except  
> that it doesn't contain a "u umlaut". When you specify a u"..." style  
> string in Python, you're telling the interpreter to assume each  
> character in the string is a unicode code point. Looking at the list in
>
> <http://www.unicode.org/Public/UNIDATA/NamesList.txt>
>
> you can determine that "u umlaut" is the Unicode character(*)
>
>     00FC    LATIN SMALL LETTER U WITH DIAERESIS
>
> but in the above, the \xc3 and \xbc are interpreted as:
>
>    00C3    LATIN CAPITAL LETTER A WITH TILDE
>    00BC    VULGAR FRACTION ONE QUARTER
>
> Clearly, we don't want any vulgarity in our paths, now do we :) ?
>
> It turns out that the author of the above code was having trouble  
> entering u umlaut (in a console, or a code editor). As mentioned  
> above, the easiest and most portable way to do this kind of thing is  
> to use the \u escape, viz:
>
>      u = u"profileDir_(\u00fc)" # u umlaut
>
> In the case of source files, Python has some handy conventions for  
> specifying what character encoding of a source file is (see <http:// 
> docs.python.org/ref/encodings.html#encodings>). Unfortunately, it  
> turns out that there's no convention that's adopted by many editors.  
> Possibly this is a reason to require everyone to use emacs, or vim,  
> but the resulting religious war would take us well past Chandler 1.0 :).
>
> In the case of entering text in an interactive session, you're  
> somewhat at the mercy of your terminal program, as well as your  
> locale. To continue the story, the characters \xc3\xbc above (which  
> are the UTF-8 encoding of \u00fc), did not come from nowhere. The  
> developer mentioned earlier copy-and-pasted them from the following  
> bit of text in the I18n Busy Developers Guide:
>
> >>> exampleInstance.exampleText = u"This is some unicode with non- 
> ascii character: ü"
> >>> exampleInstance.exampleText
> u"This is some unicode with non-ascii character: \xc3\xbc"
>
> As we determined above, the printed-out value does not end with ü. In  
> fact, what happened above was the terminal program was using UTF-8,  
> but Python had no idea that that was the case, and converted the raw  
> UTF-8 bytes to unicode characters.
>
> --Grant
>
> (*) It's also representable as the sequence of two characters
>
>    0075    LATIN SMALL LETTER U
>    0308    COMBINING DIAERESIS (Dialytika)
>            = double dot above, umlaut
>            = Greek dialytika
>            = double derivative
>            x (diaeresis - 00A8)
>
> but that's a whole different can of fish, er crosstown bus.
>
>
>
>
>
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>
> Open Source Applications Foundation "chandler-dev" mailing list
> http://lists.osafoundation.org/mailman/listinfo/chandler-dev



More information about the chandler-dev mailing list