[pylucene-dev] pylucene and recommendations for RAM

David Pratt fairwinds at eastlink.ca
Thu Apr 5 08:43:50 PDT 2007


Hi Pete. Many thanks for this advice. It would seem that perhaps a 
cluster would best solve this and then spread over some number of lower 
end servers. From what i read on large indexing, this seems to be the 
approach (but with as much RAM as possible per server). I am looking at 
costs so the lower end 2GB RAM servers are attractive but just use more 
of them.

I have only used pylucene for tests on smaller indexes. Is a cluster 
arrangement possible using pylucene? I am not a java programmer so would 
like to stay with what I know. Many thanks.

Regards,
David

Pete wrote:
> On Thursday April 5 2007 9:33 am, David Pratt wrote:
>> I realize that the amount of RAM needed will be based on the size of the
>> index, how many documents and what you are storing in the index itself -
>> but some anecdotal information would be helpful. I am looking at an
>> index that could reach 20 - 50 million documents. Will a commodity
>> server with 2Gb be enough?
> 
> IIRC, it's more a function of how quickly you're adding data than total size. 
> Though this may be incorrect when merging segments (aka optimizing).  A fast 
> disk helps quite a lot too. 
> 
> You'll want to configure the IndexWriter for bulk loading.  The relevant items 
> are setMergeFactor, which controls how often segments are merged on disk, and 
> setMaxBufferedDocs, which controls how many docs are held in RAM before being 
> written out.  A higher value for both will be faster, though be aware that an 
> index build with a high merge factor is slower to query, so you'd probably 
> want to optimize() at the end.  On our indexing server, with ~4kb documents, 
> setMaxBufferedDocs(200) uses about 700MB of RAM.  See the Javadocs & Lucene 
> In Action for more details.
> 
> On the searching front, a dedicated commodity box w/ 2 GB can probably serve 
> around 2 million documents (again, depending on document size).  Multiple 
> CPUs will let you serve more simultaneous queries.
> 
>> I guess it is possible to build a test index with sample data to
>> determine this also. Many thanks.
> 
> You should probably ask the Lucene list, but please report any test results 
> here as well (you could put them on the wiki too).
> 


More information about the pylucene-dev mailing list