[pylucene-dev] Re: lucene.JavaError: java.lang.OutOfMemoryError: Java heap space

Brian Merrell brian at merrells.org
Tue Jan 8 18:26:00 PST 2008


Andi,

Thanks for the quick reply.  I haven't used Java in years so my apologies if
I am not able to provide useful debug info without some guidance.

Memory does seem to be running low when it crashes.  According to top,
python is using almost all of the 4GB when it bails.

I don't know what Java VM I am using.  How do I determine this?

I will try running it calling gc.collect() and running optimize and see if
that helps.  Any suggestions on how to debug _dumpRefs?

Thanks for your help,

-brian

P.S.  My filter is implemented in Python.  In fact here is the code:

class BrianFilter(PythonTokenFilter):
    def __init__(self, tokenStream):
        super(BrianFilter, self).__init__(tokenStream)
        self.input = tokenStream
        self.return_bigram = False
        self.newtoken=None
    def next(self):
        if self.return_bigram:
            bigram=self.previous.termText()+"|"+self.newtoken.termText()
            self.return_bigram=False
            return Token(bigram, 0, len(bigram))
        else:
            self.previous=self.newtoken
            self.newtoken=self.input.next()
            if self.previous:
                self.return_bigram=True
            return self.newtoken

class BrianAnalyzer(PythonAnalyzer):
    def tokenStream(self, fieldName, reader):
        return
BrianFilter(LowerCaseFilter(StandardFilter(StandardTokenizer(reader))))


On 1/8/08, Brian Merrell <brian at merrells.org> wrote:
>
> I get an OutOfMemoryError: Java heap space after indexing less than 40,000
> documents.  Here are the details.PyLucene-2.2.0-2 JCC
> Ubuntu 7.10 64bit running on 4GB Core 2 Duo
> Python 2.5.1
>
> I am starting Lucene with the following:
> lucene.initVM(lucene.CLASSPATH, maxheap='2048m')
> Mergefactor (I've tried everything from 10 - 10,000)
> MaxMergeDocs and MaxBufferedDocs are at their defaults
>
> I believe the problem somehow stems from a filter I've written that turns
> tokens into bigrams (each token returns two tokens, the original token and a
> new token created from concatenating the text of the current and previous
> token).  These bigrams add a lot of unique tokens but I didn't think that
> would be a problem (aren't they all flushed out to disk?)
>
> Any ideas or suggestions would be greatly appreciated.
>
> -brian
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.osafoundation.org/pipermail/pylucene-dev/attachments/20080108/e8b80169/attachment.htm


More information about the pylucene-dev mailing list