[pylucene-dev] lucene.JavaError: java.lang.OutOfMemoryError: Java heap space

Andi Vajda vajda at osafoundation.org
Tue Jan 8 17:44:40 PST 2008


On Tue, 8 Jan 2008, Brian Merrell wrote:

> I get an OutOfMemoryError: Java heap space after indexing less than 40,000
> documents.  Here are the details.PyLucene-2.2.0-2 JCC
> Ubuntu 7.10 64bit running on 4GB Core 2 Duo
> Python 2.5.1
>
> I am starting Lucene with the following:
> lucene.initVM(lucene.CLASSPATH, maxheap='2048m')
> Mergefactor (I've tried everything from 10 - 10,000)
> MaxMergeDocs and MaxBufferedDocs are at their defaults
>
> I believe the problem somehow stems from a filter I've written that turns
> tokens into bigrams (each token returns two tokens, the original token and a
> new token created from concatenating the text of the current and previous
> token).  These bigrams add a lot of unique tokens but I didn't think that
> would be a problem (aren't they all flushed out to disk?)
>
> Any ideas or suggestions would be greatly appreciated.

Well, there is always the possibility of a memory leak in the JCC 
python/java interface. I don't know of one at the moment but it's a clear 
possibility. A bug.

Is your filter implemented in Python ?
If so, have you tried forcing Python garbage collection at times using 
gc.collect() ? Have you tried forcing GCs in Java with System.gc() ?
What are the Python memory stats when Java is running out of RAM ?

There is a little known method on the JCCEnv object returned by initVM() 
called _dumpRefs(). That method returns a dict filled with all the Java 
objects thought to be currently help by Python. There is a table, a C++ 
multimap, that holds all the Java objects returned from the JVM to Python. 
Until python releases all references to these objects, this table prevents 
these java objects from being GC'ed. If, for some reasons, objects were not 
released from the table as expected, Java couldn't GC these objects.

If your filter is not implemented in Python, could you try the same in Java 
with regular Java Lucene ?

Also, whose Java VM are you running ? You're saying you're on 64-bit Ubuntu 
but not what VM ? Could it be that you're running 64-bit libgcj (is there 
such a thing ?) If so, could you try Sun's Java 6 ?

Do you have the same problems on a 32-bit system ?

Maybe it's a Lucene issue ? Have you tried optimizing the index from time to 
time ?

Andi..


More information about the pylucene-dev mailing list