[Dev] Re: Chandler Internationalization .6 Specification is ready for review

Brian Kirsch bkirsch at osafoundation.org
Wed Jul 20 14:39:31 PDT 2005


Thanks for the feedback Ken. Please see my comments inline.

Ken Krugler wrote:

> Hi Brian,
>
> Sorry for not doing a quick, full review. Some issues I thought of 
> while quickly reading over the Wiki page:
>
> 1. Lucene has support (tokenizers, stemmers, etc) for various 
> languages, but you'd need to be able to include these (as needed), and 
> also "know" which language is being processed to decide which 
> language-specific plugins to apply.

Any specific under the hood Lucene questions would need to be answered 
by Andi Vajda who is the owner of PyLucene.


>
> 2. Related is the issue of using ICU to do searches inside of text, 
> versus indexed queries. I thought that was something you were going to 
> support in Chandler, right? Like I've got an email open, and I search 
> on some word.
>
> If you're doing this, then you want language-specific, folded (e.g. 
> case insensitive) searching. ICU supports this, but it would require 
> additional work I think, similar to Lucene.


I believe as long as the attributes have indexText=True that PyLucene 
will handle this case no problem. I have sent a mail to Andi to confirm 
my assumption.

>
> 3. So along these lines, how do you "pick" the language, if it's not 
> specified? Sometimes you know the language from meta info (like on web 
> pages), but otherwise it seems like you'll probably just want to use 
> the user's OS language setting. There are other approaches that try to 
> detect the language, similar to charset detection, but that typically 
> isn't warranted for a general-purpose app like Chandler. Anyway I 
> think this should be called out as a design decision.
>
Yes the locale set will come from the Operating System. Although 
mentioned already briefly in the spec I have added an explicit section 
detailing how the locale set is determined. Thanks for the suggestion.


> 4. To ensure smooth interoperability with ICU, I assume that 
> Chandler's Python will always be built using UTF-16, not UTF-32, 
> right? Otherwise it seems like you won't be able to leverage direct 
> copying of data between Python and ICU strings.

In the swig code for PyICU, Andi checks the Python unicode objects type 
(UCS-2 or UCS-4) when converting to and from ICU UnicodeStrings.


>
> 5. We'd talked about how big ICU code/data can be, and the need to 
> support installations of different language sets. Was that covered?
>
Yes it is big. I added a note to the spec that ICU size can 
significantly be reduced by removing locale data files such as Hebrew 
and Arabic which will not be supported in the Chandler 1.0 release.


> 6. I think somebody commented about the problems that can be caused by 
> translators messing up strings. You'd responded w/info about the ICU 
> message format. We'd talked about being able to do a consistency 
> check, comparing English to language X and validating that the 
> abstract structure of the message (number/type of parameters) hadn't 
> changed. Might be worth mentioning.
>
I added the consistency checker to the spec.


> 7. For doing a programmatic localization, you mentioned "Potential 
> tests are double the size of the LocalizableString text or insert in 
> each LocalizableString translation a non-8bit surrogate character 
> pair". I'm not sure what you mean by a non-8bit surrogate character pair.
>
> Some tests you can do are:
>
> a. Replace vowels with vowel + umlaut (Motley Crüe localization). 
> Other substitutions are possible as well (C -> Ç, etc)
>
> b. Replace ASCII with full-width ASCII


Update the .6 spec to be more clear. When I stated non-8 bit surrogate 
character pair what I really meant was a Unicode surrogate character 
pair where a single displayable Glyph is represented by two or more 
Unicode codepoints such as your example above of Motley Crüe which is a 
u + a umlaut.

> equivalents. so "help" becomes "ÇàÇÖÇåÇê", which also tests expanding 
> the width of text.
>
> 8. Do you mention the issue of making gettext use native OS fallback 
> settings?
>

The OS locale set will be determined by the Chandler I18nManager. The 
gettext api has built in fallback support. Passing it a locale set array 
is all gettext needs to perform the correct fallback behavior.

> Related to this might be noting that using .po files might preclude 
> some Mac OS X localization customization by end users, since the 
> file/structure won't match what's standard for Mac apps.


Added a footnote to the spec addressing this point.

>
> Anyway, it's 9pm so I'm off to put my daughter to bed. Hope this helps...
>
> -- Ken
>
>-- 
>  
>
> Ken Krugler
> TransPac Software, Inc.
> <http://www.transpac.com>
> +1 530-470-9200



-- 
Brian Kirsch - Email Framework Engineer
Open Source Applications Foundation
543 Howard St. 5th Floor
San Francisco, CA 94105
(415) 946-3056
http://www.osafoundation.org



More information about the Dev mailing list