[Chandler-dev] [MVA] Next steps after SoC project

Philippe Bossut pbossut at osafoundation.org
Tue Aug 29 15:32:26 PDT 2006


Hi Xun,

I posted my SoC review to Google and I expect the process over there to 
be expedited swiftly. I think we have now thanks to your work several 
interesting things:
- a working parcel to make experiments
- a set of data performed on Slashdot feeds showing that auto tagging 
after a learning phase is feasible
- a path to incorporate PCA into Chandler (using MDP)

I've been discussing with vikSIT on IRC what should be the next steps. 
Eventually, the goal will be to have auto tagging turned on (optionally) 
on the trunk but we won't even see that in Chandler before we have a 
tagging feature in (planned for alpha6). In the meantime, there's a set 
of short term things we should do:
1- clean up 
http://wiki.osafoundation.org/bin/view/Journal/InternProjectMVA : I 
posted a set of comments on this page back in July but they still 
haven't be properly incorporated (assuming I was right). I think you 
should do that Xun.
2- update the parcel to be running against the 0.7alpha4 trunk code: 
right now, it runs against 0.7alpha1 which is kind of old. I know that 
the egg stuff sort of threw a wrench in your plan but we should get pass 
that hurdle. May be vikSIT or other contributors (Markku?) could help there.
4- incorporate MDP library: we need to find a place where to park this 
and start to use it. Classically in Chandler we download tarballs and 
create a specific Makefile for such projects under /external. That's 
what we do for icu, PyLucene, twisted and the like so it looks like we 
should do just the same for MDP. May be bear can help / give us some 
advice here.
5- make the relevant changes in the MVA project to call the MDP library: 
the first thing will be to compute the eigenvectors correctly. Right now 
the code simply compute an average vector (cumulative, non normalized) 
per tag (till we reach a given threshold) then project the new vectors 
on them. The threshold for accumulation is not grounded into data, the 
threshold for attribution (when projecting) is not grounded into data 
either, we run no analysis to which dimension contribute to variance or 
not. We need to improve all of  that and use the MDP calls for that.

For the time being, we'll continue to use sandbox/xluo as a repository 
for this code. Eventually, we'll want to move that to chandler/projects 
once we prove that it's worthwhile but, for the moment, it would be a 
drag on the project to maintain that code off the trunk (it would be 
submitted to all QA/build/release engineering constraints we have for 
everything that lands on the trunk) and most of the would be 
contributors don't have svn trunk privileges anyway. However working 
with patches off several sandboxes will be annoying so I'm proposing 
that we consider sandbox/xluo as the official repo for MVA.

Also, I propose we continue with the Slashdot experiment you started 
though, clearly, we'll have to grow out of it in a while but I see no 
reason to do that as long as we don't have a better MVA model.

What do you guys think?
BTW, just as a roll call, who on this list would be interested to 
contribute to this project moving forward?

Cheers,
- Philippe


More information about the chandler-dev mailing list