[pylucene-dev] Large index files: Sort leads to "GC Warning: Repeated allocation of very large block"

Marc Weeber marc at weeber.net
Tue Dec 18 14:34:20 PST 2007


Hi all,

I am using the the following things
- Debian etch linux
- PyLucene GCC, latest from the GCC trunk
- gcc 4.2.1 with -DLARGE_CONFIG added to the source
- large index of 17Gb, 50M documents

In this index, I want to look for the cooccurrence of two words. For  
this, I use a booleanQuery:

q = PyLucene.BooleanQuery()
q.add(PyLucene.TermQuery(PyLucene.Term('profile', 'umls/C0086418')),  
PyLucene.BooleanClause.Occur.MUST)
q.add(PyLucene.TermQuery(PyLucene.Term('profile', 'umls/C0003062')),  
PyLucene.BooleanClause.Occur.MUST)

In this case, the cooccurrence is in about 30,000 documents

this all goes OK if I do a search, it eats about 120M of memory.  
However, if I sort on another field using PyLucene.Sort('date',  
False), I get the "GC Warning: Repeated allocation of very large  
block" . This process eats about 500M of memory.

Interestingly, if I use a query term that does not occur in the index  
(and cooccurrence is 0), it still costs 500M of memory. Also, before I  
compiled with -DLARGE_CONFIG, memory use was lower but the warning was  
still there

Is there a way to a) be more prudent on the memory usage or b) another  
more memory efficient (and without warnings) way of getting the  
cooccurrence info?

thanks in advance for any insights from all of you,

best,

Marc



More information about the pylucene-dev mailing list