[pylucene-dev] pylucene and recommendations for RAM
David Pratt
fairwinds at eastlink.ca
Thu Apr 5 11:35:50 PDT 2007
I was reading of scaling in Lucene with Remote Parallel Multisearcher. I
have not tried this beast yet and would be interested in hearing from
anyone who has attempted it use. I see that there have been some
previous posts about it a couple of years back. I think if something
like this could work, it may be possible.
Regards,
David
Pete wrote:
> On Thursday April 5 2007 10:43 am, David Pratt wrote:
>> Hi Pete. Many thanks for this advice. It would seem that perhaps a
>> cluster would best solve this and then spread over some number of lower
>> end servers. From what i read on large indexing, this seems to be the
>> approach (but with as much RAM as possible per server). I am looking at
>> costs so the lower end 2GB RAM servers are attractive but just use more
>> of them.
>>
>> I have only used pylucene for tests on smaller indexes. Is a cluster
>> arrangement possible using pylucene? I am not a java programmer so would
>> like to stay with what I know. Many thanks.
>
> For indexing? Not really sure how'd that work. If you want to serve all
> searches for all of the documents off one box, you're gonna have to move all
> of the indexes together at some point. It's possible to use multiple servers
> to create indexes, ship them to a single box and then merge.
>
> As for searching a collection this large, your options are either Big Iron or
> distribution. Google's pretty convincingly demonstrated that the later is
> the way to go. Hadoop (http://lucene.apache.org/hadoop/about.html) is a
> lucene-based platform for doing exactly this, but it's a) Java b) nowhere
> near done. I believe http://hyperestraier.sourceforge.net/ has support for
> distribution (and Python bindings) but I haven't tried it.
>
> The short version: if you can partition your index into logically distinct
> chunks and have no need to perform searches across these chunks, distribution
> is pretty straightforward - it's really just setting up a bunch of small
> servers. If you can't partition your data this way, the problem is much
> harder. AFAIK (and I've done quite a lot of research), there is no mature
> OSS package to do this in any language (and certainly not Python). There are
> a number of commercial solutions, including http://www.dieselpoint.com/
> (Java, but interoperable).
>
> See my message title "Distributed Indexes, Pycon, was Re: [pylucene-dev] Is
> there PyNutch?" from February 19 in the archives for a discussion of some of
> these issues.
>
More information about the pylucene-dev
mailing list