[pylucene-dev] TermDocs.read() method
Martin Bachwerk
bachwerk at i5.informatik.rwth-aachen.de
Tue Sep 9 09:52:29 PDT 2008
Oh, sure.
I've been iterating over all terms in one document.. and counting the
total number of occurencies for this term in all documents:
tf = ireader.getTermFreqVector(docID, 'content')
1. no leak but 5-10x slower
for word, freq in zip(tf.getTerms(), tf.getTermFrequencies()):
docs = ireader.termDocs(Term('content', word))
while docs.next():
totalHits += docs.freq()
2. quick, but fills up memory
for word, freq in zip(tf.getTerms(), tf.getTermFrequencies()):
td = ireader.termDocs(Term('content', word))
ptd = PythonTermDocs(td)
values = ptd.read(docNum)
totalHits = sum(values[len(values)/2:])
ptd.close()
td.close()
Hope this helps,
Martin
>
> On Sep 9, 2008, at 3:39, Martin Bachwerk
> <bachwerk at i5.informatik.rwth-aachen.de> wrote:
>
>> Hey again,
>>
>> I'm honestly very poor with memory allocation and stuff, but when
>> using this .read() method instead of an iteration over all termDocs
>> with next() I get a huge memory leak.. it just goes up and up and up
>> and never down.. I've tried using del on the values list, td.close()
>> and running gc.collect() at times, but nothing seems to make any
>> difference.
>
> Can you please send a small piece of code that I can run to reproduce
> this leak ?
>
> Thanks !
>
> Andi..
>
>>
>>
>> I'm running Python 2.4 atm and can't change to 2.5 yet for different
>> reasons, so I would really appreciate some help here. I will test it
>> on a 2.5 maching though, just to see if it's the same there or better.
>>
>> Thanks!
>> Martin
>>>
>>> On Mon, 8 Sep 2008, Martin Bachwerk wrote:
>>>
>>>> I've been trying to use the read() method on TermDocs as described for
>>>> PyLucene (with an int to specify the number of documents to read in).
>>>> However, I've been getting an error, that sort of suggests, that the
>>>> call is actually trying to run the Java API version of the method
>>>> (with
>>>> 2 arrays as arguments and an integer n as return value).. This
>>>> actually
>>>> works too, but only asfar as the integer, I can't find a way to
>>>> fill the
>>>> two arrays.. :(
>>>>
>>>> Error trace:
>>>> docs, freqs = td.read(10)
>>>> InvalidArgsError: (<type 'TermDocs'>, 'read', (10,))
>>>>
>>>> Could someone please help! I'm using PyLucene 2.3.1.
>>>
>>> The docs are out of data here, sorry.
>>>
>>> In the new PyLucene (the one built with JCC, the one you're
>>> running), the docs should say that a PythonTermDocs instance should
>>> be wrapped around the TermDocs instance as follows: (also see
>>> SpecialsFilter.py sample)
>>>
>>> values = PythonTermDocs(td).read(10)
>>> docs=values[:len(values)/2]
>>> freqs=values[len(values/2):]
>>>
>>> Yes, this is quite ugly and I intend to change the way arrays are
>>> handled in JCC before I release version 2.0 so that this kind of
>>> kludge is no longer necessary.
>>>
>>> Andi..
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Martin
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> pylucene-dev mailing list
>>>> pylucene-dev at osafoundation.org
>>>> http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
>>>>
>>> _______________________________________________
>>> pylucene-dev mailing list
>>> pylucene-dev at osafoundation.org
>>> http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
>>>
>>>
>>
>
>
More information about the pylucene-dev
mailing list