[pylucene-dev] pylucene and recommendations for RAM
Pete
pfein at pobox.com
Thu Apr 5 09:12:38 PDT 2007
On Thursday April 5 2007 10:43 am, David Pratt wrote:
> Hi Pete. Many thanks for this advice. It would seem that perhaps a
> cluster would best solve this and then spread over some number of lower
> end servers. From what i read on large indexing, this seems to be the
> approach (but with as much RAM as possible per server). I am looking at
> costs so the lower end 2GB RAM servers are attractive but just use more
> of them.
>
> I have only used pylucene for tests on smaller indexes. Is a cluster
> arrangement possible using pylucene? I am not a java programmer so would
> like to stay with what I know. Many thanks.
For indexing? Not really sure how'd that work. If you want to serve all
searches for all of the documents off one box, you're gonna have to move all
of the indexes together at some point. It's possible to use multiple servers
to create indexes, ship them to a single box and then merge.
As for searching a collection this large, your options are either Big Iron or
distribution. Google's pretty convincingly demonstrated that the later is
the way to go. Hadoop (http://lucene.apache.org/hadoop/about.html) is a
lucene-based platform for doing exactly this, but it's a) Java b) nowhere
near done. I believe http://hyperestraier.sourceforge.net/ has support for
distribution (and Python bindings) but I haven't tried it.
The short version: if you can partition your index into logically distinct
chunks and have no need to perform searches across these chunks, distribution
is pretty straightforward - it's really just setting up a bunch of small
servers. If you can't partition your data this way, the problem is much
harder. AFAIK (and I've done quite a lot of research), there is no mature
OSS package to do this in any language (and certainly not Python). There are
a number of commercial solutions, including http://www.dieselpoint.com/
(Java, but interoperable).
See my message title "Distributed Indexes, Pycon, was Re: [pylucene-dev] Is
there PyNutch?" from February 19 in the archives for a discussion of some of
these issues.
--
Peter Fein || 773-575-0694 || pfein at pobox.com
http://www.pobox.com/~pfein/ || PGP: 0xCCF6AE6B
irc: pfein at freenode.net || jabber: peter.fein at gmail.com
More information about the pylucene-dev
mailing list