[pylucene-dev] Document encoding?

Jarek Zgoda jarek.zgoda at sensisoft.com
Thu Mar 8 00:13:29 PST 2007


Andi Vajda napisał(a):

>> It seems that I cann't properly store UTF-8 encoded documents using 
>> PyLucene (by "properly" I mean the documents are searchable and can be 
>> returned in form they have been stored). Should I use only unicode 
>> objects in my search/indexing machinery code, as PyLucene returns 
>> search result's fields as unicode objects?
> 
> PyLucene wraps Java Lucene by compiling it with gcj. Java only uses 
> Unicode.
> If you pass utf-8 strings to PyLucene APIs, they are converted to 
> Unicode before being passed to the wrapped Java Lucene APIs because 
> that's all they understand.

Conversion from byte-strings to unicode assumes some knowledge of source 
encoding so I expect this to be a source of problems (not counting 
confusion like mine)...

Thank you all.

-- 
Jarek Zgoda

"We read Knuth so you don't have to."


More information about the pylucene-dev mailing list