[Dev] pylucene fsdirectory patch and unicode issue

Andi Vajda vajda at osafoundation.org
Sat May 1 16:16:48 PDT 2004


I just integrated your changes. Here is what I did:

  - The patches were relative to the 'old' build sources. Our build is in
    flux, we are about to move to a 'new' build and there has been quite a bit
    of shifting around in our CVS repository. The sources of PyLucene that are
    actively maintained are in internal/PyLucene.
    For more information on the 'new' build infrastructure, please see:
        http://wiki.osafoundation.org/bin/view/Jungle/NewBuildInstructions
    I manually added FSDirectory to PyLucene.i and to the make files and
    re-generated the rest.

  - The problem with the unicode test was that you were passing a unicode
    string to an InputStreamReader. As in Java, where I borrowed this idea
    from, input streams are for bytes and readers are for unicode chars.
    If you want to read unicode chars from a unicode string you can:
      - encode it as utf-8 bytes and pass it to an InputStreamReader, which is
        a little wasteful since the job of the InputStreamReader is to stream
        unicode characters
      - or pass the unicode string to a StringReader which I added a class
        for in your tester and to repository/util/Streams.py. Your tester is
        also checked into the new internal/PyLucene/test directory.

Thanks !

Andi..

On Thu, 29 Apr 2004, Kapil Thangavelu wrote:

> hi folks,
>
> attached is a patch against cvs head to add lucene's standard
> fsdirectory store to PyLucene. swig files were regenerated with swig
> 1.3.21
>
> also attached is a unittest file, with one failing test (prefix XXX)
> which attempts to index unicode with pylucene, using a copy of the input
> stream reader from repository.utils.Streams which does string encoding.
> i was wondering if anyone had any idea as to the cause of this error,
> because afaics they should return the same value because the encoding by
> input stream reader amounts to the following
>
> unicode(u'sample text'*20).encode('utf-8')
> unicode('sample text'*20).encode('utf-8')
>
> and the return values are both of type str and have the same value.
>
> i've attached the traceback from the unit test as well.
>
> cheers,
>
> -kapil
>



More information about the Dev mailing list