[Chandler-dev] Multivariate Analysis: the first weekly report
Xun Luo
sherwoodluo at gmail.com
Fri Jun 9 20:29:24 PDT 2006
Hi Phillippe,
+ Accomplished tasks
1. I instrumented the feed parcel to make it display a 'tag' in the
detailed view. The data I used is slashdot.net articles as they have
predefined classifications. I downloaded 300 articles and used the
following five classification as "tag"s:
Interviews/hardware/it/science/linux.
The first 250 articles are used to train PCA and the rest 50 are
used to test automatic tagging. I am still implemeting the experiment.
Question 1: how do think about the experiment? by this design, I
am simulating personal documents and assume that personal documents
will have similar nature of these feed documents.
2. I had a python PCA
By mapping all the feed articles with a specific classification
tag to a "eigen-article", I could obtain a map from tag to virtual
article and then compare un-tagged articles to the five
"eigen-article"s. This seems straightforward. However I am now
thinking about the original space. Ideally, the original space should
be the whole vocabulary space. However, there is difficulty to use
lucene, I am trying hard to make use of the
inversed-document-frequency list. For the time being, I am using an
alternative.
I used a reduced original space: a word set expanded from the
classifications. Specifically, I used lucene's function to find out
top 10 words from each classification with highese term-frequency. and
thus I had a 50-word space (10 words by 5 classifications). I created
an extra vector space to hold the information of each article. By
using PCA, I get about 5 word-combinations that could classify almost
all new aritcles correctly.
Question 2: How to make use of the underlying
inverse-document-frequency data in lucene is now a big obstacle for
me. I am going to ask for people's help from java-lucene group. It
will be great if there could be an expert at OSAF.
3. Sandbox.
Bear assigned the sandbox account for me. I havn't used it yet.
Due to lots of codes are messy at this moment.
I would appreciate any help on lucene from OSAF. It is indeed a
powerful horse for text retrieval yet I have to learn to ride it.
Best,
Xun
More information about the chandler-dev
mailing list