[Chandler-dev] Multivariate Analysis: the first weekly report
Xun Luo
sherwoodluo at gmail.com
Mon Jun 12 23:45:26 PDT 2006
1. Similarity of test data to PIM data:
(Phillippe)
> > I think this is pretty cool! It's good to have an almost infinite set
> > of existing data to test the algorithm.
(Markku)
> You are probably aware that there are ways to use the data set more
> efficiently, bootstrap and jackknife to name a few, which improve your
> chances of obtaining a statistically reliable result from a small data
> set.
(Xun)
For PIM, it is actually hard to obtain training samples in large
quantity. My idea is that for the SoC project time frame, it is more
likely that I will set up the framework on available data rather than
on real PIM data. I would rather like to stick to article feeds.
In the "future work". We may add an extra layer above PIM data, which
contains fine stat methods to increase sampling efficiency. (As
bootstrap and jackknife mentioned by Markku). The assumption is that:
PIM data + fine stat methods = (or close to) text corpus.
2. Semantics:
(Phillippe)
> > For instance, if the location field says "Paris" you can be sure that
> > this is the city, not the heir of a hotel chain... So PIM semantic
> > attached to the field avoids confusion.
(Markku)
> This is interesting. Where is this semantical understanding of the words
> coming from? I did not see it on Xun's previous mail and just assumed
> that he was making an eigenspace on words. I real world applications
> syntax tends to be manageable but semantics is very very hard to master...
> >
(Xun)
I concur with Markku. Semantics are good, in the mean time quite
empirical. Most tricky text analysis algorithms do fine tuning over a
common method. As the task is more for differentiating ( I assume,
please DO point out if I am wrong. Till now I am still not seeing
possible ways for un-supervised tag recommendation. It could be done
by give a ranked list of pre-defined tags though), a post-stage
refinement could always be applicable after we set the framework up
using vanilla PCA.
3. Stat methods:
(Markku)
> ICA and SOM (maybe even TS-SOM) to name a few that have been used on
> comparable results between PCA and ICA but maybe this is too much for a
> summer project... Well, two thumbs up as I really like this project...
(Xun)
True. I used to implement ICA by my own and found it forbidding. There
is a very good Matlab library of ICA called fastICA, yet importing it
would be a challenge. Again my opinion is to make expandable framework
rather than excellent classification result at this stage.
I should come up with an achietectual document with the design. Which
incorporate enough space for future algorithmic improvement. This will
be done this week. There are dozens of fine text analysis algorithms
available and emerging (SIGIR and TREC are still very active and
innovative conferences). I would treat a open framework which leaves
enough space for them a major accomplishment for SoC.
4. Lucene:
(Phillippe)
a) > > You mean: with the highest IDF right?
b) > > Do you mean that your experimentation is already done? I'm a little
> > confused here with your use of tense.
c) > > etc...). My question is: how do you make sure that the original space
> > you select discriminates between the different tags? Ideally, you
> > should run the PCA on the whole set of training data and verify that
> > the different groups do cluster on some subspaces of the first eigen
> > vectors. I feel like the verification step (making sure that the
> > representation space will discriminate between the groups) is missing.
(Xun)
a) Yes, with higheset IDF.
b) I already did experiment on reduced original space. As you have
seen experiment on reduced original space is still quite different
from the experiment envisioned in 1). That's why I said 1) is not
finished yet.
c) True. By no means I could say an arbitrary choice of reduced
original space will be effective. The only way I should do is to make
use of the whole vocabulary space, rather than proving a subset choice
is right. This week I had been mainly working on this and it took
over the majority of the effort. It will be rewarding though once
accomplished.
5. Sandbox and Milestones
(Xun)
checking in is truly important. I will check the code by the end of this week.
More information about the chandler-dev
mailing list