[pylucene-dev] Question on design of a long-running BSD-based
reader/writer class
Andi Vajda
vajda at osafoundation.org
Thu Jan 4 15:47:05 PST 2007
On Thu, 4 Jan 2007, Terry Jones wrote:
> I have a fairly simple question about trying to write a long-running
> BSD-based reader/writer class. I've written something that works, but it's
> extremely slow (0.4 seconds each time I add a tiny document).
>
> In summary, I want to have a class that looks something like this:
>
> class Indexer:
> def __init__(): pass
> def addDoc(): pass
> def search(): pass
> def close(): pass
>
> I'll have a program make itself an instance i of this class and then I want
> to periodically call i.addDoc() and i.search(), in no particular order. A
> search should of course be able to find anything previously added by addDoc.
> Before I shut down I'll call the close method.
>
> I'm using the BSD variant of PyLucene because I want transactions.
>
> Without going into too much detail, here's what I currently do.
>
> 1) __init__ creates two new DB instances, and calls db.open on them. These
> are stored as self.file1 and self.file2. This is done inside a transaction.
>
> 2) addDoc makes a new Document (doc) and calls doc.add to add two fields
> (with just a few chars of data in them). Then it does
>
> txn = None
> try:
> txn = self.env.txn_begin(None)
> directory = DbDirectory(txn, self.file1, self.file2, 0)
> writer = IndexWriter(directory, self.analyzer, self.createIndex)
> writer.addDocument(doc)
> writer.close()
> except:
> if txn is not None:
> txn.abort()
> raise
> else:
> txn.commit()
>
>
> This addDoc function is very slow. Sorry I'm not reporting what exactly is
> slow, the profiling data from profile or hotshot doesn't show me and I've
> not gone to time the individual PyLucene calls.
>
> I don't see a way around this. I want to use transactions, the DbDirectory
> needs to be passed an open transaction, and IndexWriter must be passed the
> newly created directory. So it doesn't look like I can store a writer in
> self. I could think about opening a transaction in __init__ but I'd still
> need to commit it at some point and open another, so that doesn't seem to
> help.
>
> I'm wondering if I am doing something wrong here.
>
> If not, is the slowness due to using the Berkeley directory? I.e., would
> the problem go away if I used a normal FSDirectory or a RAMDirectory?
>
While there is a certain overhead with transactions and opening and closing
an index for every addition, I did notice that there was a fair amount of
thrashing around in the Lucene directory I/O and got things to be considerably
faster by batching all updates and doing them in a RAMDirectory before adding
the RAMDirectory contents to the DBDirectory via the addIndexes API.
For an example, see the code around line 485 at:
http://svn.osafoundation.org/chandler/trunk/chandler/repository/persistence/FileContainer.py
Andi..
More information about the pylucene-dev
mailing list