[pylucene-dev] Lucene 2.3.0 JCC - Search Optimization Solutions

Andi Vajda vajda at osafoundation.org
Sat Feb 23 16:06:54 PST 2008


On Sat, 23 Feb 2008, João Rodrigues wrote:

> I've finally got round to setup Lucene 2.3.0 in my two production boxes
> (Ubuntu 7.10 and Windows XP), after quite a trouble with the JCC compilation
> methods. Now, I have my application all up and running and.... it's damn
> slow :(
>
> I'll be brief: I have a 6.6GB index, with more than 5.000.000 biomedical
> abstracts indexed. Each document has two fields: an integer, which I will
> want to retrieve upon search (the ID of the document, sort of), and an 80
> words, stored, tokenized, string, which will be searched upon. So, I insert
> the query (say, foo bar), it builds previously sort of a "boolean query"
> with a format such as: 'foo' AND 'bar'. Then it parses it and spits out the
> results.
> Problem is, unlike most of the posts I've read, I don't want the first 10 or
> 100 results. I want the first 10.000, or even all of them. I've read an
> HitCollector is due for this task, but my first search on google got me an
> expressive "HitCollector is too slow on PyLucene", so, I kind of sorted out
> that option. It takes minutes to get me the results I need, as it is right
> now. I'll post the code on pastebin and link it for those who feel in a good
> mood to read n00b's code and help (see below). I've tracked down the problem
> to the "doc.get("PMID")" method in the Searcher function.

Ah yes, I remember now. I haven't timed the 
crossing-the-barrier-back-into-Python time in jcc-PyLucene. If it's as 
expensive as before, as it likely is, then having it in a tight loop such as 
calling into a python HitCollector is not going to perform well. The good 
news is that writing a Java HitCollector implementation and adding it to 
your build isn't too hard. Move the tight loop code into Java since that is 
where the action is.

> My question is: how can I make my search faster? My index wasn't optimized
> because it was huge and it was built with GCC. By now, it is probably
> optimized (I left an optimizer running last night) so, that is taken care
> of. I've considered threading as well, since I'll perform three different
> searches for "round". Thing is, I'm pretty green when it comes to
> programming (I'm a biologist) and I've never understood pretty much how
> threading works. If someone can point me to the right tutorial or
> documentation, I'd be glad enough to hack it up myself. Another option I've
> been given was to use an implementation of Lucene written in either C# or
> C++. However, Lucene.net isn't up to date, and neither is CLucene..
>
> So, if you think you can give out a tip on how to make my script run faster,
> I'd thank you more than a lot. It's a shame that my project fails because of
> this technical handicap :(

That is where the java-user at lucene.apache.org mailing list can be helpful.
The ways you tune an index and queries to yield fast searches is all 
Lucene, nothing special in PyLucene.

Andi..


More information about the pylucene-dev mailing list