[pylucene-dev] Large index files: Sort leads to "GC Warning: Repeated allocation of very large block"

Marc Weeber marc at weeber.net
Thu Dec 20 06:47:42 PST 2007


hi all,


On 19 dec 2007, at 19:04, Aaron Lav wrote:

> On Wed, Dec 19, 2007 at 10:16:36AM +0100, Marc Weeber wrote:
>> Hi Andi and others,
>>
>> I downloaded and installed the jcc version (man, that was a  
>> positively
>> different experience!), and changed my test script accordingly. The
>> problem is still there: the sort asks for a humongeous amount of
>> memory. I have to provide a maxheap='470m' or it will die with an out
>> of memory error.
>
> I think the problem is that (if I'm reading the code correctly) Lucene
> caches in-memory the fields on which you sort, so it doesn't have to
> go back to the underlying documents, and so you can sort on indexed
> but not stored fields.  See
> org.apache.lucene.search.FieldSortedHitQueue.java, which calls
> FieldCache.java, which is implemented in FieldCacheImpl.java.
>
> The caches are indexed by reader, and are arrays indexed by document
> number, so their length is proportional to the total number of
> documents in the index.  Thus, if you have a lot of documents, sorting
> by fields can be memory-intensive, especially if the fields are
> lengthy strings.  So for your 50M document store, if your per-field
> data for sorting is ~8 bytes, that might explain your ~400M additional
> memory usage.

I think you're right. The field to sort on is a date field in the  
string format of YYYY-MM-DD. I indeend started looking into the java  
sorting things, and I am not too much surprised any more of the memory  
load. Good thing is that after the first search+sort, it is *really*  
fast: a cooccurrence search (two terms per doc in a boolean query)  
together with a sort on date in the 50M collection is between 50ms and  
200ms (timed in python, before and after the search) , with no real  
difference between jcc and gcc scripts

>
>
> If you have a lot of dead space (reader.maxDoc() >> reader.numDocs()),
> optimizing should decrease memory usage.
do you mean a .optimize() on the index? That I already have done. Or  
do you mean something different?
>
>
> As Andi says, java-users at lucene.apache.org is more likely to be
> helpful here.

I have done that, I'll wait for a reply there

thansk for your help,

Marc


>
>
>    Aaron Lav (asl2 at pobox.com / http://www.pobox.com/~asl2)
> _______________________________________________
> pylucene-dev mailing list
> pylucene-dev at osafoundation.org
> http://lists.osafoundation.org/mailman/listinfo/pylucene-dev



More information about the pylucene-dev mailing list