[pylucene-dev] Lucene 2.3.0 JCC - Search Optimization Solutions

Andi Vajda vajda at osafoundation.org
Sat Feb 23 15:55:02 PST 2008


On Sat, 23 Feb 2008, João Rodrigues wrote:

> I've finally got round to setup Lucene 2.3.0 in my two production boxes
> (Ubuntu 7.10 and Windows XP), after quite a trouble with the JCC compilation
> methods. Now, I have my application all up and running and.... it's damn
> slow :(
>
> I'll be brief: I have a 6.6GB index, with more than 5.000.000 biomedical
> abstracts indexed. Each document has two fields: an integer, which I will
> want to retrieve upon search (the ID of the document, sort of), and an 80
> words, stored, tokenized, string, which will be searched upon. So, I insert
> the query (say, foo bar), it builds previously sort of a "boolean query"
> with a format such as: 'foo' AND 'bar'. Then it parses it and spits out the
> results.
>
> Problem is, unlike most of the posts I've read, I don't want the first 10 or
> 100 results. I want the first 10.000, or even all of them. I've read an
> HitCollector is due for this task, but my first search on google got me an
> expressive "HitCollector is too slow on PyLucene", so, I kind of sorted out
> that option. It takes minutes to get me the results I need, as it is right
> now. I'll post the code on pastebin and link it for those who feel in a good
> mood to read n00b's code and help (see below). I've tracked down the problem
> to the "doc.get("PMID")" method in the Searcher function.
>
> My question is: how can I make my search faster? My index wasn't optimized
> because it was huge and it was built with GCC. By now, it is probably
> optimized (I left an optimizer running last night) so, that is taken care
> of. I've considered threading as well, since I'll perform three different
> searches for "round". Thing is, I'm pretty green when it comes to
> programming (I'm a biologist) and I've never understood pretty much how
> threading works. If someone can point me to the right tutorial or
> documentation, I'd be glad enough to hack it up myself. Another option I've
> been given was to use an implementation of Lucene written in either C# or
> C++. However, Lucene.net isn't up to date, and neither is CLucene..
>
> So, if you think you can give out a tip on how to make my script run faster,
> I'd thank you more than a lot. It's a shame that my project fails because of
> this technical handicap :(

Have you considered asking the Lucene user list, java-user at lucene.apache.org ?
All PyLucene is, is a Python/C++ wrapper around Java Lucene.
http://lucene.apache.org/java/docs/mailinglists.html#Java%20User%20List

Andi..

>
> LINKS:
>
> http://pastebin.com/m6c384ede -> Main Code
> http://pastebin.com/m3484ebfc --> Searcher Functions
>
> Best regards to you all,
>
> João Rodrigues
>


More information about the pylucene-dev mailing list