[pylucene-dev] pylucene and recommendations for RAM

David Pratt fairwinds at eastlink.ca
Thu Apr 5 11:35:50 PDT 2007


I was reading of scaling in Lucene with Remote Parallel Multisearcher. I 
  have not tried this beast yet and would be interested in hearing from 
anyone who has attempted it use. I see that there have been some 
previous posts about it a couple of years back. I think if something 
like this could work, it may be possible.

Regards,
David


Pete wrote:
> On Thursday April 5 2007 10:43 am, David Pratt wrote:
>> Hi Pete. Many thanks for this advice. It would seem that perhaps a
>> cluster would best solve this and then spread over some number of lower
>> end servers. From what i read on large indexing, this seems to be the
>> approach (but with as much RAM as possible per server). I am looking at
>> costs so the lower end 2GB RAM servers are attractive but just use more
>> of them.
>>
>> I have only used pylucene for tests on smaller indexes. Is a cluster
>> arrangement possible using pylucene? I am not a java programmer so would
>> like to stay with what I know. Many thanks.
> 
> For indexing?  Not really sure how'd that work.  If you want to serve all 
> searches for all of the documents off one box, you're gonna have to move all 
> of the indexes together at some point.  It's possible to use multiple servers 
> to create indexes, ship them to a single box and then merge.
> 
> As for searching a collection this large, your options are either Big Iron or 
> distribution.  Google's pretty convincingly demonstrated that the later is 
> the way to go.  Hadoop (http://lucene.apache.org/hadoop/about.html) is a 
> lucene-based platform for doing exactly this, but it's a) Java b) nowhere 
> near done.  I believe http://hyperestraier.sourceforge.net/ has support for 
> distribution (and Python bindings) but I haven't tried it.
> 
> The short version: if you can partition your index into logically distinct 
> chunks and have no need to perform searches across these chunks, distribution 
> is pretty straightforward - it's really just setting up a bunch of small 
> servers.  If you can't partition your data this way, the problem is much 
> harder.  AFAIK (and I've done quite a lot of research), there is no mature 
> OSS package to do this in any language (and certainly not Python).  There are 
> a number of commercial solutions, including http://www.dieselpoint.com/ 
> (Java, but interoperable).
> 
> See my message title "Distributed Indexes, Pycon, was Re: [pylucene-dev] Is 
> there PyNutch?" from February 19 in the archives for a discussion of some of 
> these issues. 
> 


More information about the pylucene-dev mailing list