[pylucene-dev] TermDocs.read() method
Martin Bachwerk
bachwerk at i5.informatik.rwth-aachen.de
Tue Sep 9 10:03:48 PDT 2008
The index is about 500MB big (328419 documents)..
maxheap is set to 512m..
Martin
>
>
> On Sep 9, 2008, at 9:52, Martin Bachwerk
> <bachwerk at i5.informatik.rwth-aachen.de> wrote:
>
>> Oh, sure.
>>
>> I've been iterating over all terms in one document.. and counting the
>> total number of occurencies for this term in all documents:
>>
>> tf = ireader.getTermFreqVector(docID, 'content')
>>
>> 1. no leak but 5-10x slower
>> for word, freq in zip(tf.getTerms(), tf.getTermFrequencies()):
>> docs = ireader.termDocs(Term('content', word))
>> while docs.next():
>> totalHits += docs.freq()
>>
>> 2. quick, but fills up memory
>> for word, freq in zip(tf.getTerms(), tf.getTermFrequencies()):
>> td = ireader.termDocs(Term('content', word))
>> ptd = PythonTermDocs(td)
>> values = ptd.read(docNum)
>> totalHits = sum(values[len(values)/2:])
>> ptd.close()
>> td.close()
>
> How big is your index ?
> How much java memory did you give your process (via initVM) ?
>
> Andi..
>
>>
>>
>> Hope this helps,
>>
>> Martin
>>>
>>> On Sep 9, 2008, at 3:39, Martin Bachwerk
>>> <bachwerk at i5.informatik.rwth-aachen.de> wrote:
>>>
>>>> Hey again,
>>>>
>>>> I'm honestly very poor with memory allocation and stuff, but when
>>>> using this .read() method instead of an iteration over all termDocs
>>>> with next() I get a huge memory leak.. it just goes up and up and
>>>> up and never down.. I've tried using del on the values list,
>>>> td.close() and running gc.collect() at times, but nothing seems to
>>>> make any difference.
>>>
>>> Can you please send a small piece of code that I can run to
>>> reproduce this leak ?
>>>
>>> Thanks !
>>>
>>> Andi..
>>>
>>>>
>>>>
>>>> I'm running Python 2.4 atm and can't change to 2.5 yet for
>>>> different reasons, so I would really appreciate some help here. I
>>>> will test it on a 2.5 maching though, just to see if it's the same
>>>> there or better.
>>>>
>>>> Thanks!
>>>> Martin
>>>>>
>>>>> On Mon, 8 Sep 2008, Martin Bachwerk wrote:
>>>>>
>>>>>> I've been trying to use the read() method on TermDocs as
>>>>>> described for
>>>>>> PyLucene (with an int to specify the number of documents to read
>>>>>> in).
>>>>>> However, I've been getting an error, that sort of suggests, that the
>>>>>> call is actually trying to run the Java API version of the method
>>>>>> (with
>>>>>> 2 arrays as arguments and an integer n as return value).. This
>>>>>> actually
>>>>>> works too, but only asfar as the integer, I can't find a way to
>>>>>> fill the
>>>>>> two arrays.. :(
>>>>>>
>>>>>> Error trace:
>>>>>> docs, freqs = td.read(10)
>>>>>> InvalidArgsError: (<type 'TermDocs'>, 'read', (10,))
>>>>>>
>>>>>> Could someone please help! I'm using PyLucene 2.3.1.
>>>>>
>>>>> The docs are out of data here, sorry.
>>>>>
>>>>> In the new PyLucene (the one built with JCC, the one you're
>>>>> running), the docs should say that a PythonTermDocs instance
>>>>> should be wrapped around the TermDocs instance as follows: (also
>>>>> see SpecialsFilter.py sample)
>>>>>
>>>>> values = PythonTermDocs(td).read(10)
>>>>> docs=values[:len(values)/2]
>>>>> freqs=values[len(values/2):]
>>>>>
>>>>> Yes, this is quite ugly and I intend to change the way arrays are
>>>>> handled in JCC before I release version 2.0 so that this kind of
>>>>> kludge is no longer necessary.
>>>>>
>>>>> Andi..
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> pylucene-dev mailing list
>>>>>> pylucene-dev at osafoundation.org
>>>>>> http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
>>>>>>
>>>>> _______________________________________________
>>>>> pylucene-dev mailing list
>>>>> pylucene-dev at osafoundation.org
>>>>> http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> pylucene-dev mailing list
>> pylucene-dev at osafoundation.org
>> http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
> _______________________________________________
> pylucene-dev mailing list
> pylucene-dev at osafoundation.org
> http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
>
>
More information about the pylucene-dev
mailing list