[pylucene-dev] memory leak status

Andi Vajda vajda at osafoundation.org
Wed Jan 16 22:26:25 PST 2008

On Thu, 17 Jan 2008, Brian Merrell wrote:

> The docs are pretty standard English language documents.  I can try to find
> some decent spam (I hesitate to give our the actual documents for privacy
> reasons) but I imagine anything you have would reproduce the problem.  I am
> using the basic MeetLucene Indexer.py with the following Analyzer:

It would be helpful if all I had to do to reproduce the problem was to 
type in a one-liner in a shell. With what you sent me, I still have to do 
work to figure out how to include your code into the indexer.py program. 
It's not rocket science but since you've already done it, it's less work for 
you to send this via email than for me to reconstruct it.

Maybe my assumptions are wrong and it's more work for you too ?

Now, that being said, looking at your code, it's quite possible that the 
leak is with the BrianFilter instances. You can verify this by checking the 
number of PythonTokenFilter instances left in the env after each document is 
indexed. I'm assuming, possibly wrongly, that BrianAnalyzer's tokenStream() 
method is called once for every document. If that's indeed the case, you are 
going to leak BrianFilter instances and it's going to be necessary to 
finalize() these instances manually.

  1. to verify how many PythonTokenFilter instances you have in env:
       print env._dumpRefs('class org.osafoundation.lucene.analysis.PythonTokenFilter', 0)
     where env is the result of the call to initVM() and can also be obtained
     by calling getVMEnv()

  2. to call finalize() on your BrianFilter instances, you need to first keep
     track of them. Then, after each document is indexed, call finalize() on
     the accumulated objects. Below is a modification of BrianAnalyzer to
     support this (assuming one single thread of execution per analyzer)

       class BrianAnalyzer(PythonAnalyzer):
           def __init__(self):
               super(BrianAnalyzer, self).__init__()
               self._filters = []
           def tokenStream(self, fieldName, reader):
               filter = BrianFilter(LowerCaseFilter(StandardFilter(StandardTokenizer(reader))))
               return filter
           def finalizeFilters(self):
               for filter in self._filters:
               del self._filters[:]

     after each document is indexed, call brianAnalyzer.finalizeFilters()


More information about the pylucene-dev mailing list