{Spam?} Re: [pylucene-dev] {Spam?} HighFreqTerms from org.apache.lucene.misc

Andi Vajda vajda at osafoundation.org
Tue Apr 1 10:35:58 PDT 2008


On Tue, 1 Apr 2008, Dirk Rothe wrote:

> On Wed, 26 Mar 2008 16:13:12 +0100, Andi Vajda <vajda at osafoundation.org> 
> wrote:
>
>> 
>> On Mar 26, 2008, at 2:16, "Dirk Rothe" <d.rothe at semantics.de> wrote:
>> 
>>> I cannot find the HighFreqTerms Class from [1] in the flattened lucene 
>>> Namespace. Any obvious reasons why?
>> 
>> Probably because it's in a contrib jar file not currently on the list of 
>> jar files in the PyLucene build. Adding the jar file to the list in 
>> Makefile and rebuilding PyLucene should be enough to resolve the issue.
>> 
>> Andi..
>
> Ok, but by inspecting the java code, this was pretty trivial to implement in 
> Python. Only curiosity, but do you think the java version would be 
> (significantly) faster. I'm not sure I understand the performance 
> implications from the jcc bridge.

I don't know. How about measuring it ?

The jcc bridge involves converting some literals from java to python (such 
as strings), releasing the GIL (global interpreter lock) when leaving python 
and reacquiring it when returnig.

The jcc bridge also keeps track of the java objects returned to python so 
that they don't get garbage collected until python no longer uses them. This 
is implemented via a C++ multimap.

It's been shown before that using a python HitCollector (used in a very 
tight loop by the Lucene core) is significantly slower than using the java 
equivalent [1].

Andi..

[1] http://www.google.com/search?q=python+hitcollector&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a

>
>
> def getHighFreqTerms(indexPath,fieldName,topN):
>   ''' get top n terms from field given by fieldName '''
>   reader = IndexReader.open(indexPath)
>   terms = reader.terms()
>   result = []
>   while terms.next():
>       if terms.term().field() == fieldName:
>           result.append((terms.docFreq(),unicode(terms.term())))
>   term = terms.next()
>   reader.close()
>
>   result.sort(reverse=True)
>   return result[:topN]
>
>
>
> _______________________________________________
> pylucene-dev mailing list
> pylucene-dev at osafoundation.org
> http://lists.osafoundation.org/mailman/listinfo/pylucene-dev


More information about the pylucene-dev mailing list