[pylucene-dev] HitCollector in PyLucene extremely slow
Andi Vajda
vajda at osafoundation.org
Mon Sep 3 20:49:35 PDT 2007
On Mon, 3 Sep 2007, John Kleven wrote:
> When using a HitCollector via PyLucene (i.e.,
> overiding the collect() API) has anybody else noticed
> a massive slowdown?
>
> Even if i set my collector to return immediately in
> the collect(doc_id, score) callback, so not even
> touching any of the ids or scores, on a collection of
> 540,000 documents, I get a an avg search time of .11
> seconds.
>
> If I go through the standard IndexSearcher.search,
> which still uses a hit collector on the Java backside
> (TopDocCollector.java if interested), I get avg search
> times of 0.0104 -- and it is actually doing something
> (namely, tracking the highest scored docs in a
> priority queue up to size 100).
>
> Is this order-of-magnitude slowdown something that I
> can expect just because of the java->python callback
> via the collect() function?
>
> To get this up to speed, is my only option to code my
> collector in Java, add in the hooks, then compile a
> custom (gulp) PyLucene version?
I don't know enough about what you're trying to do to have much of an opinion.
It seems to me though, that you're comparing apples and oranges. In the
python case you're using a HitCollector python customization that returns
nothing and in the Java case you're using a TopDocCollector that actually
does something.
If indeed it turns out that calling into Python Java is the culprit, then your
best bet is what you're suggesting.
I doubt it, though. The only possibly expensive call apart from your python
code is the acquiring of the python GIL (Global Interpreter Lock). If there is
no contention for the GIL, it should be really fast acquiring it.
The rest of the Java->Python boundary crossing is the marshalling of Java
objects into Python ones, the call to your method itself and the reverse
marshalling of the return value.
Andi..
More information about the pylucene-dev
mailing list