{Spam?} Re: [pylucene-dev] {Spam?} HighFreqTerms from
org.apache.lucene.misc
Andi Vajda
vajda at osafoundation.org
Tue Apr 1 10:35:58 PDT 2008
On Tue, 1 Apr 2008, Dirk Rothe wrote:
> On Wed, 26 Mar 2008 16:13:12 +0100, Andi Vajda <vajda at osafoundation.org>
> wrote:
>
>>
>> On Mar 26, 2008, at 2:16, "Dirk Rothe" <d.rothe at semantics.de> wrote:
>>
>>> I cannot find the HighFreqTerms Class from [1] in the flattened lucene
>>> Namespace. Any obvious reasons why?
>>
>> Probably because it's in a contrib jar file not currently on the list of
>> jar files in the PyLucene build. Adding the jar file to the list in
>> Makefile and rebuilding PyLucene should be enough to resolve the issue.
>>
>> Andi..
>
> Ok, but by inspecting the java code, this was pretty trivial to implement in
> Python. Only curiosity, but do you think the java version would be
> (significantly) faster. I'm not sure I understand the performance
> implications from the jcc bridge.
I don't know. How about measuring it ?
The jcc bridge involves converting some literals from java to python (such
as strings), releasing the GIL (global interpreter lock) when leaving python
and reacquiring it when returnig.
The jcc bridge also keeps track of the java objects returned to python so
that they don't get garbage collected until python no longer uses them. This
is implemented via a C++ multimap.
It's been shown before that using a python HitCollector (used in a very
tight loop by the Lucene core) is significantly slower than using the java
equivalent [1].
Andi..
[1] http://www.google.com/search?q=python+hitcollector&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
>
>
> def getHighFreqTerms(indexPath,fieldName,topN):
> ''' get top n terms from field given by fieldName '''
> reader = IndexReader.open(indexPath)
> terms = reader.terms()
> result = []
> while terms.next():
> if terms.term().field() == fieldName:
> result.append((terms.docFreq(),unicode(terms.term())))
> term = terms.next()
> reader.close()
>
> result.sort(reverse=True)
> return result[:topN]
>
>
>
> _______________________________________________
> pylucene-dev mailing list
> pylucene-dev at osafoundation.org
> http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
More information about the pylucene-dev
mailing list