[pylucene-dev] Clustering with PyLucene

Sebastian Steins sebastian.steins at gmail.com
Fri Mar 7 07:28:05 PST 2008


Hi there,

first of all, I want to introduce myself, because I have not posted  
here before. My name is Sebastian and I am currently working as web  
application developer. Search and Information Retrieval are not part  
of my current work, but I am interested in those fields as a hobby.
I was triggered by the book "Programming Collective Intelligence"  
which describes very complex algorithms like clustering in a very easy  
way and shows the solutions in plain python code with SQLite bindings.

I was very ardent by the solutions in the book, so I tried to modify  
them for some experiments and I wanted to use Lucene instead of  
PyLucene.

For now, I have a simple script which inserts articles from a RSS feed  
into a Lucene index using PyLucene.

An article has also outgoing links, which I store this way:
#### Code ####
for link in params['links']:
doc.add(self.Lucene.Field("linksto", link,  
self.Lucene.Field.Store.YES, self.Lucene.Field.Index.UN_TOKENIZED))
#### /Code ####

Is that a good way? Or is there another way in Lucene to store  
"relational" data? How would it be possible to retrieve the document  
with the most incoming links? Or the document with the greatest number  
of outgoing links?

Additionally, I want to calculate the similarity between documents  
with my script, using K-Means, Dendograms and other things (mostly  
described in the book mentioned above). Therefore, I would have to  
compare a recently found (crawled) article, which is to be written to  
the index with all articles in the Lucene index. How can that be  
achieved in a more elegant way than doing a for-loop from 0 to  
numDocs()? Is there a cheaper (in means of computer ressources) way?



Unfortunatelly, I am not very familar with Java, so my reasearch for  
the above questions in the sites around the Lucene-community did not  
help really. I found Mahout, a Java-prorgramm for k-means and other  
similar algorithms for Lucene. However, this didn't help, because I  
want to implement my experiments in Python, not Java.


Thank you very much for your help!



Sebastian


More information about the pylucene-dev mailing list