[pylucene-dev] indexing performance
Pete
pfein at pobox.com
Wed Jul 4 06:19:26 PDT 2007
On Wednesday July 4 2007 12:52 am, Filip de Waard wrote:
> Hello,
>
> Until today, I've never had a single worry about performance in my
> short but exciting Python experience. However, now I'm trying to
> index over six million books from a MySQL database using PyLucene and
> I'd like to speed it up.
You're pretty much just using Python here to move bits between a pair of
libraries. There's not much optimization to be done there. FWIW, this is
the sort of thing Python is good for, being the 2nd Best Language For
Everything.
> Tomorrow I'll start playing with a profiler, but in the meantime:
> does anyone have any recommendations as to how to be most efficient
> in regard to the Python code, database interaction and of course the
> PyLucene indexing process? Or maybe I'm doing something horribly
> wrong in my script?
Yup, several problems.
1. You need an ORDER BY clause with SELECT ... LIMIT, otherwise you'll get a
random set of rows each time. Well, at least every other database on earth
works like this, but then again, we're talking about MySQL.
2. You need configure your IndexWriter for bulk loading. See the
setMaxBufferedDocs, setMaxMergeDocs & setMergeFactor methods at
http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/index/IndexWriter.html
IIRC, these have changed slightly in 2.1. Basically, you'll want to increase
those values as much as possible without causing your script to swap, and run
an optimize() at the end.
3. Make sure you have sufficient hardware. More RAM will allow you to bump
the above values. A fast disk / properly configured RAID array will help
too, but RAM's the bottleneck.
> Any pointer would be most appreciated.
Good luck *searching* an index of this size. I dunno what an acceptable
response time is for your application, but you're gonna need a pretty beefy
box with *lots* of RAM to query 6 million documents quickly.
--Pete
--
Peter Fein || 773-575-0694 || pfein at pobox.com
http://www.pobox.com/~pfein/ || PGP: 0xCCF6AE6B
irc: pfein at freenode.net || jabber: peter.fein at gmail.com
More information about the pylucene-dev
mailing list