[pylucene-dev] Large index files: Sort leads to "GC Warning:
Repeated allocation of very large block"
Aaron Lav
asl2 at pobox.com
Wed Dec 19 10:04:42 PST 2007
On Wed, Dec 19, 2007 at 10:16:36AM +0100, Marc Weeber wrote:
> Hi Andi and others,
>
> I downloaded and installed the jcc version (man, that was a positively
> different experience!), and changed my test script accordingly. The
> problem is still there: the sort asks for a humongeous amount of
> memory. I have to provide a maxheap='470m' or it will die with an out
> of memory error.
I think the problem is that (if I'm reading the code correctly) Lucene
caches in-memory the fields on which you sort, so it doesn't have to
go back to the underlying documents, and so you can sort on indexed
but not stored fields. See
org.apache.lucene.search.FieldSortedHitQueue.java, which calls
FieldCache.java, which is implemented in FieldCacheImpl.java.
The caches are indexed by reader, and are arrays indexed by document
number, so their length is proportional to the total number of
documents in the index. Thus, if you have a lot of documents, sorting
by fields can be memory-intensive, especially if the fields are
lengthy strings. So for your 50M document store, if your per-field
data for sorting is ~8 bytes, that might explain your ~400M additional
memory usage.
If you have a lot of dead space (reader.maxDoc() >> reader.numDocs()),
optimizing should decrease memory usage.
As Andi says, java-users at lucene.apache.org is more likely to be
helpful here.
Aaron Lav (asl2 at pobox.com / http://www.pobox.com/~asl2)
More information about the pylucene-dev
mailing list