[Chandler-dev] Multivariate Analysis: the first weekly report
Philippe Bossut
pbossut at osafoundation.org
Mon Jun 12 13:25:11 PDT 2006
Hi Xun,
Xun Luo wrote:
> Hi Phillippe,
>
> + Accomplished tasks
> 1. I instrumented the feed parcel to make it display a 'tag' in the
> detailed view. The data I used is slashdot.net articles as they have
> predefined classifications. I downloaded 300 articles and used the
> following five classification as "tag"s:
> Interviews/hardware/it/science/linux.
> The first 250 articles are used to train PCA and the rest 50 are
> used to test automatic tagging. I am still implemeting the experiment.
>
> Question 1: how do think about the experiment? by this design, I
> am simulating personal documents and assume that personal documents
> will have similar nature of these feed documents.
I think this is pretty cool! It's good to have an almost infinite set of
existing data to test the algorithm.
As to whether the PIM items will have the same nature than the feed
documents, I think there is quite an overlap (with email in particular)
but PIM data also have strong semantic that are absent from your test.
For instance, if the location field says "Paris" you can be sure that
this is the city, not the heir of a hotel chain... So PIM semantic
attached to the field avoids confusion.
It shouldn't be difficult though to introduce this in your analysis (in
the example above, may be encoding it as "location:Paris" will be
enough) and it's certainly easier to move from semantically non
differentiated data (as you're doing) to strong semantic than the contrary.
Note also that we *do* need the text analysis you're doing (as we
already established in a previous discussion on IRC).
> 2. I had a python PCA
> By mapping all the feed articles with a specific classification
> tag to a "eigen-article", I could obtain a map from tag to virtual
> article and then compare un-tagged articles to the five
> "eigen-article"s. This seems straightforward. However I am now
> thinking about the original space. Ideally, the original space should
> be the whole vocabulary space. However, there is difficulty to use
> lucene, I am trying hard to make use of the
> inversed-document-frequency list. For the time being, I am using an
> alternative.
> I used a reduced original space: a word set expanded from the
> classifications. Specifically, I used lucene's function to find out
> top 10 words from each classification with highese term-frequency.
You mean: with the highest IDF right?
> and
> thus I had a 50-word space (10 words by 5 classifications). I created
> an extra vector space to hold the information of each article. By
> using PCA, I get about 5 word-combinations that could classify almost
> all new aritcles correctly.
Do you mean that your experimentation is already done? I'm a little
confused here with your use of tense.
> Question 2: How to make use of the underlying
> inverse-document-frequency data in lucene is now a big obstacle for
> me. I am going to ask for people's help from java-lucene group. It
> will be great if there could be an expert at OSAF.
Our lucene expert at OSAF is Andi Vajda. For such a specialized use
though, you're likely to get your answer faster on the java-lucene group.
On this whole paragraph, I think your approach is correct (use IDF to
weight the keywords, reduce the vocabulary set using trained data,
etc...). My question is: how do you make sure that the original space
you select discriminates between the different tags? Ideally, you should
run the PCA on the whole set of training data and verify that the
different groups do cluster on some subspaces of the first eigen
vectors. I feel like the verification step (making sure that the
representation space will discriminate between the groups) is missing.
> 3. Sandbox.
> Bear assigned the sandbox account for me. I havn't used it yet.
> Due to lots of codes are messy at this moment.
Don't worry if it's messy: we do not build those sandboxes so you won't
break anything. Use them if it helps you to mark "milestones" in your
own development. Committing often to keep track of what you're doing is
one of the best practice a dev can stick to.
> I would appreciate any help on lucene from OSAF. It is indeed a
> powerful horse for text retrieval yet I have to learn to ride it.
Andi?... :)
Cheers,
- Philippe
More information about the chandler-dev
mailing list