[pylucene-dev] pylucene and recommendations for RAM
Pete
pfein at pobox.com
Thu Apr 5 07:59:26 PDT 2007
On Thursday April 5 2007 9:33 am, David Pratt wrote:
> I realize that the amount of RAM needed will be based on the size of the
> index, how many documents and what you are storing in the index itself -
> but some anecdotal information would be helpful. I am looking at an
> index that could reach 20 - 50 million documents. Will a commodity
> server with 2Gb be enough?
IIRC, it's more a function of how quickly you're adding data than total size.
Though this may be incorrect when merging segments (aka optimizing). A fast
disk helps quite a lot too.
You'll want to configure the IndexWriter for bulk loading. The relevant items
are setMergeFactor, which controls how often segments are merged on disk, and
setMaxBufferedDocs, which controls how many docs are held in RAM before being
written out. A higher value for both will be faster, though be aware that an
index build with a high merge factor is slower to query, so you'd probably
want to optimize() at the end. On our indexing server, with ~4kb documents,
setMaxBufferedDocs(200) uses about 700MB of RAM. See the Javadocs & Lucene
In Action for more details.
On the searching front, a dedicated commodity box w/ 2 GB can probably serve
around 2 million documents (again, depending on document size). Multiple
CPUs will let you serve more simultaneous queries.
> I guess it is possible to build a test index with sample data to
> determine this also. Many thanks.
You should probably ask the Lucene list, but please report any test results
here as well (you could put them on the wiki too).
--
Peter Fein || 773-575-0694 || pfein at pobox.com
http://www.pobox.com/~pfein/ || PGP: 0xCCF6AE6B
irc: pfein at freenode.net || jabber: peter.fein at gmail.com
More information about the pylucene-dev
mailing list