[pylucene-dev] Clustering with PyLucene

Andi Vajda vajda at osafoundation.org
Fri Mar 7 08:48:41 PST 2008

On Mar 7, 2008, at 7:28, Sebastian Steins <sebastian.steins at gmail.com>  

> Hi there,
> first of all, I want to introduce myself, because I have not posted  
> here before. My name is Sebastian and I am currently working as web  
> application developer. Search and Information Retrieval are not part  
> of my current work, but I am interested in those fields as a hobby.
> I was triggered by the book "Programming Collective Intelligence"  
> which describes very complex algorithms like clustering in a very  
> easy way and shows the solutions in plain python code with SQLite  
> bindings.
> I was very ardent by the solutions in the book, so I tried to modify  
> them for some experiments and I wanted to use Lucene instead of  
> PyLucene.

Did you mean to say "Lucene instead of SQLlite" ?

> For now, I have a simple script which inserts articles from a RSS  
> feed into a Lucene index using PyLucene.
> An article has also outgoing links, which I store this way:
> #### Code ####
> for link in params['links']:
> doc.add(self.Lucene.Field("linksto", link,  
> self.Lucene.Field.Store.YES, self.Lucene.Field.Index.UN_TOKENIZED))
> #### /Code ####
> Is that a good way? Or is there another way in Lucene to store  
> "relational" data? How would it be possible to retrieve the document  
> with the most incoming links? Or the document with the greatest  
> number of outgoing links?

Lucene is not relational database but I assume you know that. It  
indexes text tokens and returns documents that contain them.

> Additionally, I want to calculate the similarity between documents  
> with my script, using K-Means, Dendograms and other things (mostly  
> described in the book mentioned above). Therefore, I would have to  
> compare a recently found (crawled) article, which is to be written  
> to the index with all articles in the Lucene index. How can that be  
> achieved in a more elegant way than doing a for-loop from 0 to  
> numDocs()? Is there a cheaper (in means of computer ressources) way?

Lucene can do that for you if you index your documents with term  
vectors. The "Lucene in Action" book (recommended reading) has an  
example on how to implement this. The sample code (called  
MoreLikeThis) is available in PyLucene in Python

> Unfortunatelly, I am not very familar with Java, so my reasearch for  
> the above questions in the sites around the Lucene-community did not  
> help really.

Try harder.
The java-user at lucene.apache.org has a large community of users that  
can help you with Lucene how-to questions that are independent of the  
eventual implementation language.

Lucene is implemented in Java, PyLucene just wraps it with Python  
wrappers. Familiarity with Java code and docs can be helpful. The  
pylucene-dev list you just wrote to is about specific issues related  
to that, not general Lucene how-to topics.

> I found Mahout, a Java-prorgramm for k-means and other similar  
> algorithms for Lucene. However, this didn't help, because I want to  
> implement my experiments in Python, not Java.

Get the theory questions answered first with the book, the web and the  
java-user list. Then, take a look at the many python samples that ship  
with PyLucene to see how you could apply the solutions you found with  


> Thank you very much for your help!
> Sebastian
> _______________________________________________
> pylucene-dev mailing list
> pylucene-dev at osafoundation.org
> http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

More information about the pylucene-dev mailing list