[Chandler-dev] Multivariate Analysis: the first weekly report
mmmm at osafoundation.org
Mon Jun 12 14:08:49 PDT 2006
> Hi Xun,
> Xun Luo wrote:
>> Hi Phillippe,
>> + Accomplished tasks
>> 1. I instrumented the feed parcel to make it display a 'tag' in the
>> detailed view. The data I used is slashdot.net articles as they have
>> predefined classifications. I downloaded 300 articles and used the
>> following five classification as "tag"s:
>> The first 250 articles are used to train PCA and the rest 50 are
>> used to test automatic tagging. I am still implemeting the experiment.
>> Question 1: how do think about the experiment? by this design, I
>> am simulating personal documents and assume that personal documents
>> will have similar nature of these feed documents.
> I think this is pretty cool! It's good to have an almost infinite set
> of existing data to test the algorithm.
You are probably aware that there are ways to use the data set more
efficiently, bootstrap and jackknife to name a few, which improve your
chances of obtaining a statistically reliable result from a small data
set. 250 data points may seem like a lot but actually it is not and as
the dimensionality of the data space increases it is actually a really
small sample (curse of dimensionality).
BTW are you using standard or robust PCA? To me this application looks
like one that could really benefit from the robust version and it is not
that much harder to implement...
> As to whether the PIM items will have the same nature than the feed
> documents, I think there is quite an overlap (with email in
> particular) but PIM data also have strong semantic that are absent
> from your test.
> For instance, if the location field says "Paris" you can be sure that
> this is the city, not the heir of a hotel chain... So PIM semantic
> attached to the field avoids confusion.
This is interesting. Where is this semantical understanding of the words
coming from? I did not see it on Xun's previous mail and just assumed
that he was making an eigenspace on words. I real world applications
syntax tends to be manageable but semantics is very very hard to master...
> It shouldn't be difficult though to introduce this in your analysis
> (in the example above, may be encoding it as "location:Paris" will be
> enough) and it's certainly easier to move from semantically non
> differentiated data (as you're doing) to strong semantic than the
> Note also that we *do* need the text analysis you're doing (as we
> already established in a previous discussion on IRC).
I concur. This kind of feature, if you can make it really work, is
valuable to users. Of course there lot's of existing applications using
ICA and SOM (maybe even TS-SOM) to name a few that have been used on
similar problems. In this case it might even be interesting to see
comparable results between PCA and ICA but maybe this is too much for a
summer project... Well, two thumbs up as I really like this project...
>> 2. I had a python PCA
>> By mapping all the feed articles with a specific classification
>> tag to a "eigen-article", I could obtain a map from tag to virtual
>> article and then compare un-tagged articles to the five
>> "eigen-article"s. This seems straightforward. However I am now
>> thinking about the original space. Ideally, the original space should
>> be the whole vocabulary space. However, there is difficulty to use
>> lucene, I am trying hard to make use of the
>> inversed-document-frequency list. For the time being, I am using an
>> I used a reduced original space: a word set expanded from the
>> classifications. Specifically, I used lucene's function to find out
>> top 10 words from each classification with highese term-frequency.
> You mean: with the highest IDF right?
>> thus I had a 50-word space (10 words by 5 classifications). I created
>> an extra vector space to hold the information of each article. By
>> using PCA, I get about 5 word-combinations that could classify almost
>> all new aritcles correctly.
> Do you mean that your experimentation is already done? I'm a little
> confused here with your use of tense.
>> Question 2: How to make use of the underlying
>> inverse-document-frequency data in lucene is now a big obstacle for
>> me. I am going to ask for people's help from java-lucene group. It
>> will be great if there could be an expert at OSAF.
> Our lucene expert at OSAF is Andi Vajda. For such a specialized use
> though, you're likely to get your answer faster on the java-lucene group.
> On this whole paragraph, I think your approach is correct (use IDF to
> weight the keywords, reduce the vocabulary set using trained data,
> etc...). My question is: how do you make sure that the original space
> you select discriminates between the different tags? Ideally, you
> should run the PCA on the whole set of training data and verify that
> the different groups do cluster on some subspaces of the first eigen
> vectors. I feel like the verification step (making sure that the
> representation space will discriminate between the groups) is missing.
If discrimination is your objective, then there are better approaches
than PCA, see for example LDA, QDA etc in discriminant analysis and even
then you cannot guarantee that clusters separate. In this case you need
supervised training but on the other hand if you can define good
discrimination a priori then it should be possible to do this. Kernel
versions might yield better results but that is an another story...
>> 3. Sandbox.
>> Bear assigned the sandbox account for me. I havn't used it yet.
>> Due to lots of codes are messy at this moment.
> Don't worry if it's messy: we do not build those sandboxes so you
> won't break anything. Use them if it helps you to mark "milestones" in
> your own development. Committing often to keep track of what you're
> doing is one of the best practice a dev can stick to.
>> I would appreciate any help on lucene from OSAF. It is indeed a
>> powerful horse for text retrieval yet I have to learn to ride it.
More information about the chandler-dev