[pylucene-dev] HitCollector in PyLucene extremely slow

Andi Vajda vajda at osafoundation.org
Mon Sep 3 20:49:35 PDT 2007


On Mon, 3 Sep 2007, John Kleven wrote:

> When using a HitCollector via PyLucene (i.e.,
> overiding the collect() API) has anybody else noticed
> a massive slowdown?
>
> Even if i set my collector to return immediately in
> the collect(doc_id, score) callback, so not even
> touching any of the ids or scores, on a collection of
> 540,000 documents, I get a an avg search time of .11
> seconds.
>
> If I go through the standard IndexSearcher.search,
> which still uses a hit collector on the Java backside
> (TopDocCollector.java if interested), I get avg search
> times of 0.0104 -- and it is actually doing something
> (namely, tracking the highest scored docs in a
> priority queue up to size 100).
>
> Is this order-of-magnitude slowdown something that I
> can expect just because of the java->python callback
> via the collect() function?
>
> To get this up to speed, is my only option to code my
> collector in Java, add in the hooks, then compile a
> custom (gulp) PyLucene version?

I don't know enough about what you're trying to do to have much of an opinion. 
It seems to me though, that you're comparing apples and oranges. In the 
python case you're using a HitCollector python customization that returns 
nothing and in the Java case you're using a TopDocCollector that actually 
does something.

If indeed it turns out that calling into Python Java is the culprit, then your 
best bet is what you're suggesting.
I doubt it, though. The only possibly expensive call apart from your python 
code is the acquiring of the python GIL (Global Interpreter Lock). If there is 
no contention for the GIL, it should be really fast acquiring it.

The rest of the Java->Python boundary crossing is the marshalling of Java 
objects into Python ones, the call to your method itself and the reverse 
marshalling of the return value.

Andi..


More information about the pylucene-dev mailing list