[pylucene-dev] TermDocs.read() method

Martin Bachwerk bachwerk at i5.informatik.rwth-aachen.de
Tue Sep 9 09:52:29 PDT 2008


Oh, sure.

I've been iterating over all terms in one document.. and counting the
total number of occurencies for this term in all documents:

tf = ireader.getTermFreqVector(docID, 'content')

1. no leak but 5-10x slower
for word, freq in zip(tf.getTerms(), tf.getTermFrequencies()):
  docs = ireader.termDocs(Term('content', word))
  while docs.next():
    totalHits += docs.freq()

2. quick, but fills up memory
for word, freq in zip(tf.getTerms(), tf.getTermFrequencies()):
  td = ireader.termDocs(Term('content', word))
  ptd = PythonTermDocs(td)
  values = ptd.read(docNum)
  totalHits = sum(values[len(values)/2:])
  ptd.close()
  td.close()

Hope this helps,

Martin
>
> On Sep 9, 2008, at 3:39, Martin Bachwerk 
> <bachwerk at i5.informatik.rwth-aachen.de> wrote:
>
>> Hey again,
>>
>> I'm honestly very poor with memory allocation and stuff, but when 
>> using this .read() method instead of an iteration over all termDocs 
>> with next() I get a huge memory leak.. it just goes up and up and up 
>> and never down.. I've tried using del on the values list, td.close() 
>> and running gc.collect() at times, but nothing seems to make any 
>> difference.
>
> Can you please send a  small piece of code that I can run to reproduce 
> this leak ?
>
> Thanks !
>
> Andi..
>
>>
>>
>> I'm running Python 2.4 atm and can't change to 2.5 yet for different 
>> reasons, so I would really appreciate some help here. I will test it 
>> on a 2.5 maching though, just to see if it's the same there or better.
>>
>> Thanks!
>> Martin
>>>
>>> On Mon, 8 Sep 2008, Martin Bachwerk wrote:
>>>
>>>> I've been trying to use the read() method on TermDocs as described for
>>>> PyLucene (with an int to specify the number of documents to read in).
>>>> However, I've been getting an error, that sort of suggests, that the
>>>> call is actually trying to run the Java API version of the method 
>>>> (with
>>>> 2 arrays as arguments and an integer n as return value).. This 
>>>> actually
>>>> works too, but only asfar as the integer, I can't find a way to 
>>>> fill the
>>>> two arrays.. :(
>>>>
>>>> Error trace:
>>>> docs, freqs = td.read(10)
>>>> InvalidArgsError: (<type 'TermDocs'>, 'read', (10,))
>>>>
>>>> Could someone please help! I'm using PyLucene 2.3.1.
>>>
>>> The docs are out of data here, sorry.
>>>
>>> In the new PyLucene (the one built with JCC, the one you're 
>>> running), the docs should say that a PythonTermDocs instance should 
>>> be wrapped around the TermDocs instance as follows: (also see 
>>> SpecialsFilter.py sample)
>>>
>>>  values = PythonTermDocs(td).read(10)
>>>  docs=values[:len(values)/2]
>>>  freqs=values[len(values/2):]
>>>
>>> Yes, this is quite ugly and I intend to change the way arrays are 
>>> handled in JCC before I release version 2.0 so that this kind of 
>>> kludge is no longer necessary.
>>>
>>> Andi..
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Martin
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> pylucene-dev mailing list
>>>> pylucene-dev at osafoundation.org
>>>> http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
>>>>
>>> _______________________________________________
>>> pylucene-dev mailing list
>>> pylucene-dev at osafoundation.org
>>> http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
>>>
>>>
>>
>
>




More information about the pylucene-dev mailing list