Distributed Indexes, Pycon, was Re: [pylucene-dev] Is there PyNutch?
Pete
pfein at pobox.com
Mon Feb 19 15:45:24 PST 2007
On Wednesday February 14 2007 1:06 pm, Jack L wrote:
> Hello Brian,
>
> Thanks for the reply. (I'm not sure if this discussion is interesting
> to PyLucene dev list. If it's considered OT, I shall take the next
> email offline.)
I consider this on-topic, largely cause I'm interested & there's nowhere else
to discuss it. ;)
> I looked at the first link you sent. It's not actually what I'm
> looking for. In our set up, we have multiple crawler/indexer/searcher
> boxes talking to one merger/web server front-end using Nutch IPC.
> The front-end box sends queries to multiple back-end searchers and
> merge the results it has received, and presents them in a web page.
> I'm hoping to find a way to replace the front-end Java implementation
> with Python. So, the piece I'm looking for does not touch the
> segments. Instead, it speaks Nutch IPC and parses the query
> strings, issues queries to the back-end, and merges results and puts
> them in a web page.
I've been kicking around this sort of idea around with my coworkers recently.
While I haven't used Nutch/Solr, we've used techniques from the later.
Some background:
We're a python shop [0]. In general, we're working with relatively small data
sets, but running rather complex queries and pre-indexing analysis. We use
an in-house spider and a Python webserver. The front-end runs on a local
copy of a PyLucene index updated via the Solr in-process technique [1]
We're starting to push up against the capacity limits of querying on a single
server and are thinking about how to partition the index to multiple boxes. In
Java, this appears to be done using a MultiSearcher and RemoteSearchable.
The later is implemented on Java RMI [2], which is not in PyLucene [3].
Here, the fun begins. As best I can tell, MultiSearcher/RemoteSearchable
require multiple calls to slave machines per query. The general thought would
be to re-implement such a thing in Python, using something like Perpsective
Broker [4].
I don't really want to do this, however, as it just doesn't sound like my idea
of a good time. I'm starting to formulate some thoughts for alternate
approaches, but haven't totally sorted it out. So, the question on
everyone's mind is:
For all you folks using PyLucene for *queries*, how do you scale beyond a
single machine?
Anyone going to PyCon? Want to have a Birds of a Feather on Lucene/Text
Search/Distributed Computing? [5]
--Pete
[0] Interested? We're hiring. This message only hints at the sorts of
problems we're trying to solve. Contact me off-list.
[1] http://wiki.apache.org/solr/CollectionDistribution . This was not the
easiest thing in the world to get working acceptably with PyLucene, though
that appears to have more to do with Boehm GC. It also requires about 2x the
RAM during the switchover period and beats on the disk.
[2]
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/MultiSearcher.html
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/RemoteSearchable.html
http://java.sun.com/j2se/1.4.2/docs/api/java/rmi/server/UnicastRemoteObject.html
[3] http://www.archivesat.com/pylucene_developers/thread323504.htm
[4] http://twistedmatrix.com/projects/core/documentation/howto/pb-intro.html
[5] http://us.pycon.org/TX2007/BoF
More information about the pylucene-dev
mailing list