[pylucene-dev] some interestinh problem

Andi Vajda vajda at osafoundation.org
Wed Dec 27 12:31:48 PST 2006


On Wed, 27 Dec 2006, Yura Smolsky wrote:

> s = u"some unicode"
> summary = s.encode("utf8")
>
> thats I do for all string. I index strings this way without
> problem for a long time, but for this particular data below it returns InvalidArgsError.

Why not pass unicode to PyLucene directly ?
If you pass a regular, utf-8, string, PyLucene has to convert it back to 
unicode anyway since that's all Java does.

Andi..

>
>>> sometimes i receive weird exception for unicode data, which is
>>> japanese text. here an entry from the log:
>>>
>>> 2006-12-27 11:01:18,541 ERROR
>>> Traceback (most recent call last):
>>>  File "/home/search/lib/index/Index.py", line 91, in indexDocument
>>>    doc.add(Field("summary", fields['summary'], Field.Store.YES, Field.Index.TOKENIZED))
>>> InvalidArgsError: (<type 'PyLucene.Field'>, '__init__', ('summary', '\xe7\xb4\xa0\xe6\x95\xb5\xe3\x81\xaa\xe3\x82\xaf\xe3\x83\
>>> xaa\xe3\x82\xb9\xe3\x83\x9e\xe3\x82\xb9\xe3\x83\x97\xe3\x83\xac\xe3\x82\xbc\xe3\x83\xb3\xe3\x83\x88\xe3\x81\x8c\xe5\xb1\x8a\xe
>>> 3\x81\x8d\xe3\x81\xbe\xe3\x81\x97\xe3\x81\x9f\xef\xa3\xa6 \xe3\x83\xab\xe3\x82\xa4\xe3\x82\xb5\xe3\x83\xb3\xe3\x82\xbf\xe3\x81
>>> \x95\xe3\x82\x93\xe3\x81\x8b\xe3\x82\x89\xef\xa6\xa8 \xe3\x81\x84\xe3\x81\x88\xe3\x81\x84\xe3\x81\x88\xe3\x80\x82\xef\xbc\x91\
>>> xe5\xb9\xb4\xe9\xa0\x91\xe5\xbc\xb5\xe3\x81\xa3\xe3\x81\x9f\xe3\x80\x8c\xe8\x87\xaa\xe5\x88\x86\xe3\x80\x8d\xe3\x81\x8b\xe3\x8
>>> 2\x89\xe3\x80\x8c\xe8\x87\xaa\xe5\x88\x86\xe3\x80\x8d\xe3\x81\xab\xe3\x80\x82\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\xef\xbc\x88\
>>> xe7\xac\x91\xef\xbc\x89 \xe3\x81\x9d\xe3\x82\x8c\xe3\x82\x82\xe3\x80\x8e\xe8\xa6\xaa\xe3\x81\xb0\xe3\x81\x8b\xe3\x82\xb0\xe3\x
>>> 83\x83\xe3\x82\xba\xe3\x80\x8f\xf0\x95\xbe\xb9', <Field_Store: YES>, <Field_Index: TOKENIZED>))
>>>
>>> what is actually wrong with parameters?
>
> AV> Dunno, it could be a problem with converting to Unicode ?
>
> AV> It looks like the argument is a regular python string instance, not a unicode
> AV> string instance. Because Java uses only unicode strings, regular python
> AV> strings are converted to Unicode by assuming they're utf-8 encoded. Is that
> AV> the case with this string ?
>
> AV> A way around the problem is to convert the string to Unicode yourself before
> AV> passing it to PyLucene.
>
> AV> If you send in a piece of code that reproduces the problem, I can be more
> AV> helpful.
>
> AV> Andi..
>
>
>
>
> --
> Yura Smolsky
>
>
>


More information about the pylucene-dev mailing list