[Dev] Index issues

Grant Baillie grant at osafoundation.org
Wed Jan 25 09:57:23 PST 2006


On Jan 25, 2006, at 09:25, Reid Ellis wrote:

> Most apps that deal with this have a "default encoding" preference  
> which the user can set to whatever they like, since they might know  
> what encoding most of their email is in.

I think that if you know the encoding (charset) most of your email  
in, you're probably a geek :).

Note that "encoding" != "locale". Knowing a default encoding doesn't  
tell you how to sort (e.g. iso-8859-1 could be used by German, French  
or Norwegian users, all of whom have different sorting rules).  
Similarly for tokenizing (unicode) strings for indexing. Andi made  
some good suggestions for guessing locale earlier in the thread.

In a way, this would be worse if the email world made the (sensible)  
switch to UTF-8 everywhere (unless clients simultaneously starting  
setting the Content-Language MIME header).

In general, using the user's default locale in the case where you  
don't know a LOB's language will work for the majority of users. The  
people whom it won't work for are multilingual: It's not that  
uncommon to see people working in US English companies who have a lot  
of personal email conducted in languages other than English.

(BTW, I agree with Andi's comment below that sorting should use the  
user's locale).

--Grant


> I assume that Chandler's locale is derived from the OS's locale?
>
> Reid
>
> On Tue Jan 24 2006, at 21:21, Andi Vajda wrote:
>> On Tue, 24 Jan 2006, Brian Kirsch wrote:
>>> Andi,
>>> What do recommend doing in the case where a locale is not know  
>>> for the text?
>>>
>>> Email is a great example, in most cases no language (locale)  
>>> headers are supplied.
>>
>> When no locale is supplied, the encoding supplied could be used  
>> for clues for
>> using a set of heuristics helping to 'guess' a locale. In the case  
>> of email, for example, the domain of the sender may also provide a  
>> clue.
>> That guess may be better than nothing but not by much...
>>
>> A good guess at this is important for full text indexing.
>>
>> When sorting email addresses, however, I'd think that the Chandler  
>> user's locale would prevail over the potential locale of the data  
>> being sorted.
>>
>> Andi..
>>
>>>
>>> -Brian
>>>
>>> Brian Kirsch - Email Framework Engineer
>>> Open Source Applications Foundation
>>> 543 Howard St. 5th Floor
>>> San Francisco, CA 94105
>>> (415) 946-3056
>>> http://www.osafoundation.org
>>>
>>>
>>>
>>> Andi Vajda wrote:
>>>> On Tue, 24 Jan 2006, Brian Kirsch wrote:
>>>>> One issue to remember, if we are sorting on the name of the  
>>>>> user i.e. Brian Kirsch <bkirsch at osafoundation.org> then the  
>>>>> sort order will need to be localized with PyICU.
>>>> Last year, I added a new index class called StringIndex. It  
>>>> understands locale and uses PyICU's collator support for  
>>>> comparing strings.
>>>> Similarly, I realized recently that for full text indexing's  
>>>> sake, LOBs (at least, if not all attributes) should also have a  
>>>> locale aspect so that when full text indexing (and queries) are  
>>>> run, an analyzer that is appropriate for the language of the  
>>>> locale is used to break up the text (or queries) in tokens.
>>>> Andi..
>>>
>
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>
> Open Source Applications Foundation "Dev" mailing list
> http://lists.osafoundation.org/mailman/listinfo/dev



More information about the Dev mailing list