[Chandler-dev] Unicode and letting u be u (umlaut)
Brian Kirsch
bkirsch at osafoundation.org
Wed Apr 12 16:46:50 PDT 2006
Yes Python Unicode is fun to work with isn't it :)
Grant raises a good point. The \u syntax is the best way to ensure that
what you intended to render actually is correct.
Python provides a means to specify the source character set encoding at
the top of a python file.
If one does that then text will be converted to Unicode from that
character set in the file space.
For example:
# -*- coding: utf-8 -*-
exampleText = u"This is some Unicode with non- ascii character: ü"
print repr(exampleText)
>>> u'This is some Unicode with non- ascii character: \xfc'
However, in the command line interpreter Python does not know what the
source encoding is and
it must be explicitly defined.
>>> exampleText = unicode("This is some Unicode with non- ascii
character: ü", "utf8")
>>> exampleText
>>> u'This is some Unicode with non- ascii character: \xfc'
My example in the i18n Busy Developer Guide should have been the above
and not:
>>> exampleInstance.exampleText = u"This is some Unicode with non-ascii
character: ü"
>>> exampleInstance.exampleText
u"This is some Unicode with non-ascii character: \xc3\xbc"
I have updated the guide to correct the error.
Using the \u syntax is a better choice because no encoding needs to be
explicitly specified in the file or the terminal.
For example:
exampleText = u"This is some Unicode with non- ascii character: \u00FC"
Of course if your terminal uses the ASCII character set then it will not
render correctly :)
--Brian
Brian Kirsch - Cosmo Developer / Chandler Internationalization Engineer
Open Source Applications Foundation
543 Howard St. 5th Floor
San Francisco, CA 94105
http://www.osafoundation.org
Grant Baillie wrote:
> I've run across a couple of cases of specifying unicode characters in
> Python code that were a little fishy, so I thought I'd send out a
> long, rambly email to the list.
>
> The 10-second summary is: If you want to specify a non-ASCII
> character in a unicode string, the python \uxxxx escape is your
> friend. With anything else, you're playing with fire.
>
> So, to cut a short story long, I was looking at a test case in
> Chandler, where we were trying to come up with a non-ASCII path to
> use in a Chandler profile directory:
>
> TestCrypto.py:13: u = u"profileDir_(\xc3\xbc)" # u umlaut
>
> This actually succeeds in setting u to be a non-ASCII string, except
> that it doesn't contain a "u umlaut". When you specify a u"..." style
> string in Python, you're telling the interpreter to assume each
> character in the string is a unicode code point. Looking at the list in
>
> <http://www.unicode.org/Public/UNIDATA/NamesList.txt>
>
> you can determine that "u umlaut" is the Unicode character(*)
>
> 00FC LATIN SMALL LETTER U WITH DIAERESIS
>
> but in the above, the \xc3 and \xbc are interpreted as:
>
> 00C3 LATIN CAPITAL LETTER A WITH TILDE
> 00BC VULGAR FRACTION ONE QUARTER
>
> Clearly, we don't want any vulgarity in our paths, now do we :) ?
>
> It turns out that the author of the above code was having trouble
> entering u umlaut (in a console, or a code editor). As mentioned
> above, the easiest and most portable way to do this kind of thing is
> to use the \u escape, viz:
>
> u = u"profileDir_(\u00fc)" # u umlaut
>
> In the case of source files, Python has some handy conventions for
> specifying what character encoding of a source file is (see <http://
> docs.python.org/ref/encodings.html#encodings>). Unfortunately, it
> turns out that there's no convention that's adopted by many editors.
> Possibly this is a reason to require everyone to use emacs, or vim,
> but the resulting religious war would take us well past Chandler 1.0 :).
>
> In the case of entering text in an interactive session, you're
> somewhat at the mercy of your terminal program, as well as your
> locale. To continue the story, the characters \xc3\xbc above (which
> are the UTF-8 encoding of \u00fc), did not come from nowhere. The
> developer mentioned earlier copy-and-pasted them from the following
> bit of text in the I18n Busy Developers Guide:
>
> >>> exampleInstance.exampleText = u"This is some unicode with non-
> ascii character: ü"
> >>> exampleInstance.exampleText
> u"This is some unicode with non-ascii character: \xc3\xbc"
>
> As we determined above, the printed-out value does not end with ü. In
> fact, what happened above was the terminal program was using UTF-8,
> but Python had no idea that that was the case, and converted the raw
> UTF-8 bytes to unicode characters.
>
> --Grant
>
> (*) It's also representable as the sequence of two characters
>
> 0075 LATIN SMALL LETTER U
> 0308 COMBINING DIAERESIS (Dialytika)
> = double dot above, umlaut
> = Greek dialytika
> = double derivative
> x (diaeresis - 00A8)
>
> but that's a whole different can of fish, er crosstown bus.
>
>
>
>
>
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>
> Open Source Applications Foundation "chandler-dev" mailing list
> http://lists.osafoundation.org/mailman/listinfo/chandler-dev
More information about the chandler-dev
mailing list