[pylucene-dev] Question on design of a long-running BSD-based reader/writer class

Andi Vajda vajda at osafoundation.org
Thu Jan 4 15:47:05 PST 2007


On Thu, 4 Jan 2007, Terry Jones wrote:

> I have a fairly simple question about trying to write a long-running
> BSD-based reader/writer class. I've written something that works, but it's
> extremely slow (0.4 seconds each time I add a tiny document).
>
> In summary, I want to have a class that looks something like this:
>
>    class Indexer:
>        def __init__(): pass
>        def addDoc(): pass
>        def search(): pass
>        def close(): pass
>
> I'll have a program make itself an instance i of this class and then I want
> to periodically call i.addDoc() and i.search(), in no particular order. A
> search should of course be able to find anything previously added by addDoc.
> Before I shut down I'll call the close method.
>
> I'm using the BSD variant of PyLucene because I want transactions.
>
> Without going into too much detail, here's what I currently do.
>
> 1) __init__ creates two new DB instances, and calls db.open on them. These
>    are stored as self.file1 and self.file2. This is done inside a transaction.
>
> 2) addDoc makes a new Document (doc) and calls doc.add to add two fields
>    (with just a few chars of data in them). Then it does
>
>        txn = None
>        try:
>            txn = self.env.txn_begin(None)
>            directory = DbDirectory(txn, self.file1, self.file2, 0)
>            writer = IndexWriter(directory, self.analyzer, self.createIndex)
>            writer.addDocument(doc)
>            writer.close()
>        except:
>            if txn is not None:
>                txn.abort()
>            raise
>        else:
>            txn.commit()
>
>
> This addDoc function is very slow. Sorry I'm not reporting what exactly is
> slow, the profiling data from profile or hotshot doesn't show me and I've
> not gone to time the individual PyLucene calls.
>
> I don't see a way around this. I want to use transactions, the DbDirectory
> needs to be passed an open transaction, and IndexWriter must be passed the
> newly created directory.  So it doesn't look like I can store a writer in
> self. I could think about opening a transaction in __init__ but I'd still
> need to commit it at some point and open another, so that doesn't seem to
> help.
>
> I'm wondering if I am doing something wrong here.
>
> If not, is the slowness due to using the Berkeley directory? I.e., would
> the problem go away if I used a normal FSDirectory or a RAMDirectory?
>

While there is a certain overhead with transactions and opening and closing 
an index for every addition, I did notice that there was a fair amount of 
thrashing around in the Lucene directory I/O and got things to be considerably 
faster by batching all updates and doing them in a RAMDirectory before adding 
the RAMDirectory contents to the DBDirectory via the addIndexes API.

For an example, see the code around line 485 at:
http://svn.osafoundation.org/chandler/trunk/chandler/repository/persistence/FileContainer.py

Andi..


More information about the pylucene-dev mailing list