[Chandler-dev] [SoC] Multivariate Analysis for PIM Data project

Philippe Bossut pbossut at osafoundation.org
Wed May 24 23:58:26 PDT 2006


Hi Xun,

Congratulation again for being selected as a SoC student. I hope we'll 
have a fruitful summer. For one thing, we do have a great project and 
I'm glad we're going started on this.

Here under are some comments on your project plan (your text in /Italic/).

/Multivariate analysis for search and data exploration in PIM data

'The idea is to use multivariate analysis tools (Principal Component 
Analysis, segmentation) to create a model of the data in a PIM and map 
newly acquired data. The objective is to allow some level of sorting 
(tagging) to be automatically done by the software' (quoted from Dr. 
Bossut's description on OSAF wiki).
/
Left here for other list readers to get some context on what this 
project is about.

/   1. User Data Tagging
      In Chandler 0.6, user data in PIM are described by Content Item 
objects. Content Item is the abstract super-kind for things like 
Contacts, Calendar Events, Tasks, Mail Messages, and Notes. Different 
classes of user data are defined by Kinds, and consequently have 
specific attributes to be recorded and persisted to Repository. Chandler 
regards all the Kinds of user data as interexchangable. Consequently, a 
Mail Message item could be a Task item at the same time. A list named 
Tags specifies all the tags that a Content Item is associated with. As 
there is no explicit UI to assign value to the tags that an Content Item 
is associated with, by my understanding, the aim of the project is about 
fully- or semi-automated value assigning to the Tags list.
/
We already had a private discussion on this that would be hard to rehash 
so, to summarize, I think we agreed that we could:
- use the existing collections as a proxy to tags: the idea is that, 
assuming a big enough data set, we can identify a cluster (using some 
segmentation technique) that corresponds to said collections and, 
therefore, automatically mark incoming data as part of some collections. 
The good news is that since collections are not hierarchical, a false 
positive won't bury an item in a deep tangle of "folders". We will 
however have to identify which collections are relevant: some 
collections (like "All", "Trash", "In", "Out") shouldn't be mapped, 
others are strictly internal. If I'm not mistaken, we should limit this 
to user defined SmartCollection (i.e. all the SmartCollection minus the 
OOTB collections).
- experiment with some auto tagging: you had some ideas on this
- use the result of clustering to provide some "find similar" feature

/   2. Model Analysis By PCA
      All the attributes of a Content Item together form a vector. By 
grouping these vectors that are close to each other and divide these 
vectors that are far apart, the user data in Chandler could be 
clustered/segmented. Each segmented subset user data could be assigned a 
tag or a set of tags, in this way automated tagging is done. Whenever 
new user data are acquired, they are assigned tags based on their 
similarities with the existing data segments. By using PCA, principle 
components which describe the majority of the sample variation are 
figured out. These principle components construted a subspace of the 
original attribute vector space. Every attribute vector is then 
projected into the subspace that best encodes the variation of known 
attribute vectors. By dividing projected vectors in the subspace along 
the principle componets axes, we are hopeful to get segmentation results 
of good quality.
      It is worth noting that once Content Items are abstracted into 
attribute vectors, there are also other segmentation techniques that 
could be applied, include the classic k-means, and hyper-plane 
separation methods such as Support Vector Machine (SVM). The 
implementation effort of k-means is as same as PCA, and there is 
off-the-shelf SVM library in C that could be used in a SWIG way.
/
This is the gist of it. One thing that worries me is the storage of the 
vector per item. That could be quite a load on our repository. I 
understand that you need the vectors to compute the distances 
(similarities). Another approach though is to model the clusters with 
few cluster specific data (center of gravity and ellipsoid dimension in 
a reduced PCA space, this is where PCA comes handy really). But may be 
I'm jumping the gun here. We need to have real data on what is the cost 
of this vector.

/   3. Tagging Dynamics
      Tagging of Content Items is initially done over a given set of 
user data. With the elapse of time, old user data may be modified or 
deleted, and new user data maybe inserted. Under some cases, meta-data 
about tags themselves, such as their names and their senses, may be 
changed as well. All these events introduce information change in the 
underlying statistical model, and cause tagging dynamics. One possible 
strategy to deal with tagging dynamics is to set thresholds of 
statistical model changes. A model ajustment action is triggered when 
the change exceeds a pre-set threshold. For example, newly acquired user 
data that does not fit the existing tags will be tagged 'unknown' in the 
begining. When the number Content Items that have 'unknown' tags 
increase to a specific amount, new cluster/segmentation emerge among 
these Items and are properly tagged.

I would like to apply for project 'Behavior Analysis and Prediction' at 
the same time (at second priority). Because this project has strong 
correlation with the first one and lots of the work could be overlapping.
/
It turns out that Google did not grant us a student on this project so, 
well, it could be a possibility. I do think however that the 2 projects 
have only partial overlap. The one you're starting now doesn't use any 
instrumentation data for instance.

/3. Implementation Plan

Implementation of the automated tagging of user data feature could be 
broken down to three sub-tasks. Each sub-task accounts for about 1/3 of 
the whole development effort. Namely, to instrument the current Content 
Item class and its sub-classes with attribute vector information; To 
create new statistical class which accomodates information of the 
principle components, weight centroids of each tagged cluster, as well 
as all the available tags; To create callback functions or 'hooks' of 
the new statistical class, and instrument these functions to the user 
data update functions in Chandler.

   1. New member variables to Content Items and its sub-classes

      An attribute vector, which maps the values of each attribute of a 
Content Item, should be added as a member variable of Content Item 
class. The vector codes values of common attributes into vector bits. In 
this way, categorical or nominal types of attributes could be mapped to 
numerical types, and subsequently available for PCA computation.
/
That seems to be a straightforward addition though, again, I'm worry 
about adding any persistent data to Content Item, especially when this 
data is redundant with the existing one. See my comment here above 
though with "jumping the gun"...

/      One of the design considerations here is that different sub-class 
of Content Item owns different set of attributes. So should we put the 
attribute vector in each sub-class, or maintain a one-size-fits-all 
vector in Content Item? I propose to adapt a hybrid solution: the class 
Content Item maintains a vector called base_vector, while the subclass 
maintain a vector called kind_vector. Common attributes that are shared 
by all user data, such as entry-time and people-involved, are coded in 
base_vector. While sub-class specific attributes, such as a Mail 
Message's address or a Task's completion percentage, are coded in 
kind_vector. The vector which is used for PCA computation is a weighted 
concatenation of these two. When doing weighted concatenation, the 
kind_vector part of the attribute vector is mapped into an universal 
format, so camparision between two different sub-class data is made 
possible.
/
The worry here is that the base_vector will be too poor to be of much 
use unless we extract keywords from the body and displayName attributes 
of the base item (which, on second thought, we should do and might be 
the most semantically significant data...). The kind_vector will be 
where the semantically significant data is (email addresses in 
particular, this should give a good clue of where something should go in 
lots of cases). The mapping though is a pretty tough nut to crack and 
it's possible that only analysis of data of the same kind are really 
possible (though we can always get something with base_vector alone, 
which is already a progress compared to any other PIM out there). The 
mapping issue is something we face also in "attribute redirection", the 
problem we have when trying to populate say the "who" column in the 
summary table view for items of different kind. I hope that the people 
involved in "attribute redirection" will chime in (or I'll go and ask 
them... :) )

/   2. New class to hold statistical information and computation.

      A new class, which represents a statistical multivariate analysis 
method, should be added. Member variables of this class include a set of 
available tags, to be assigned to the data segments/clusters. This class 
should also contain the centroids or cores of the data 
segments/clusters, so when new user data is acquired, it could be 
compared with the centroids or cores of existing clusters, and then 
decision is made about which tag should be applied on. This class should 
has the member methods such as:
          * update_weights(ItemCollection C) to use a collection of user 
data to adjust the internal statistical parameters.
          * assign_tag(ContentItem I) to assign tag(s) to a given user 
data item based on its segment result.
          * data_segmentation(ItemCollection C) to segment a collection 
of user data into optimized partitions, etc.
/
/      Instead of making this class to be directly a PCA class, I would 
rather propose it to be an abstract class that does not rely on specific 
multivariable data analysis methods. In other words, this abstract class 
defines the 'Policies' of using an analysis method to fullfil automated 
tagging tasks, and leaves detailed implementation to its sub-classes. 
The PCA class should be derived from this abstract class, implement the 
interfaces the abstrat class defined, and owns specific data for PCA 
analysis method. If we would like to implement other analysis methods in 
the future, such as k-means or SVM, we just derive new sub-classes of 
this abstract class, just as the way we derive the PCA sub-class.

      For the proposed project, PCA will be the only fully implemented 
analysis method. I had developed a full-fledged PCA analyzer for a past 
project, the code could be looked at 
http://smpp-ric.org/viewvc/dataglove_pca/. The code was written in 
Matlab but translation into Python should be straightforward. The PCA 
class will also own some private member variables, such as threshold 
values and criterions to select principle components, in term of how 
much variation should be preserved.
/
This is what I alluded to above when talking about "modeling the 
clusters" (or centroids in your text). That seems like a natural idea to 
me and makes me question again the wisdom of storing the vector per 
content item in a persistent way...

/   3. Callbacks and 'hooks' instrumentation for Chandler user data updates.

      Callbacks, or 'hooks', are the mechanisms that glue Chandler 
Content Items and the statistical class together. These hooks are 
bilateral, i.e., some of them are provided by Content Item classes and 
to be used by statitical class, and some of them are vice versa. For 
example, the Content Item class needs to call an analytical method 
callback provided by the statistical class, during its update of user 
data content. The callback may trigger a model update for the internal 
data of the statistical class. On the other hand, when a new analytical 
methods is introduced or there is parameter change for the underlying 
model of the current analytical method, the statistical class calls a 
Content Item class callback to update the data view in Chandler UI.
/
Because its class mapping may change? (and therefore its tags or 
collection binding). Not sure otherwise why the Chandler UI would change...

/   4. Language considerations.

      There is no doubt that Python will be the final language for 
project implementation. Using SWIG to wrap modules written in other 
language, such as C/C++ is also feasible, albeit introduces some extra 
effort in code management. If we regard the tagging process as an 
information pipeline, it could be seen that data come in as untagged 
Content Items, processed by the multivariate analysis methods, and come 
out as tagged (sorted) Content Items. For fast prototype setting and 
experimental purposes, it is possible that during the development stage, 
part 2 of the information pipeline, i.e. multivariate analysis, be 
implemented in the form of a language other than python. In my opinion, 
at least for the PCA method this intermediate step could be done with 
complied Matlab modules.
/
Sounds good to me. Matlab will also provide you with visualization 
methods that can be very useful when experimenting with data sets.

/   5. Resource considerations.

      In case more subtle algorithms and strategies need to be used, 
which involves Natual Language Processing, Information Retrieval and 
Statistical Linguistics. Consulation help could be obtained through 
faculties in the Computer Science department of University of Illinois 
at Chicago. Quite a few UIC Professors are recognized experts in the 
above fields, including chairmen for ACM SIGMOD and ACM SIGKDD conferences.
/
/4. Timeline

The whole project spans from May 22, 2006 to August 21, 2006, a total 
time period of 13 weeks.

    * Before May 22 (interim period before kicking off)
      Scrutiny project-related Chandler code and general PCA methods 
implementations packages; Try to come up as rich and specific ideas as I 
could; Continue to update this proposal (I wish it to be a solution of 
quality to the project finally, no matter got accepted or rejected); 
Keep in touch with the mentor; Read literature of similar work, if there 
are any.
/
This is "done"... ;)

/    * May 22 to May 26 (week 1)
      Orientation of OSAF development system and other related documents 
kept by the mentor, if there are any; Commercement of development over 
Chandler code.
/
It seems to me that you're already well into this, having already 
compiled Chandler from scratch on Ubuntu and written a parcel as an 
exercise. We need to see where you can park your stuff in a public 
repository. One idea is to create a branch from our SVN repo. This would 
certainly make tracking and final integration easier. We usually though 
do not grant commit privileges automatically to newcomers. They 
typically need to submit a couple of patches that go through a review 
process. Another idea is to use some sort of sandbox. We have this 
problem for the other SoC students so we need to solve this soon...

/    * May 29 to June 16 (week 2,3,4)
      Instrumentation of Content Item class to add attribute vector 
support, include the base class and all current subclasses: Contacts, 
Calendar Events, Tasks, Mail Messages, and Notes; Implement one part of 
the PCA class which does data segmentation.
/
One thing you'll need is a good data set for experimentation. We have a 
3000 items calendar we use for perf testing but that's all auto 
generated data and finding a pattern in that won't work. Looks to me 
that we'll need to create some sort of test data set (with known 
clusters) for testing.

Another tricky aspect is defining the dimensionality of the space: 
should we use every single attribute as a dimension? should we extract 
some keywords from text (notes field, subject) and use them as well? I'm 
inclined to say "yes" to both, certainly the second question since most 
of the semantic will be there... We need to find something to extract 
keywords from a string. There's certainly some research to do to find a 
suitable algo or open source implementation of this (note that we do not 
use GPL licensed code in Chandler since we intend to be Apache licensed).

/    * June 19 to June 23 (week 5)
      Midterm report of the project to the mentor; Develop phase one 
postmortem; Continue to work on PCA class; Unit test of the completed 
modules.

    * June 26 to July 14 (week 6,7,8)
      Implement the other part of the PCA class which maps data segments 
to tags; If needed, implement a help class to for tag management; Unit 
test of the completed modules.

    * July 17 to July 28 (week 9,10)
      Develop phase two postmortem; Callbacks instrumentation in Content 
Item classes and statistical method class; Unit test of the completed 
modules; Integration test of the tagging pipeline.

    * July 31 to August 18 (week 11,12,13)
      Data-intensive test of the MVA auto-tagging feature; Tune up of 
the clustering/segmentation accuracy by optimization of MVA parameters; 
Document the whole project; Release the deliverables -- documentation, 
source code and SVN repository.
/
Sounds like a great plan, fairly ambitious but feasible. We'll need to 
find some real user target data somewhere around mid July.

So, the next steps are:
- define somewhere a sandbox or branch for you to start developing
- look at the schema for each kind and see which attribute should be 
used in the vector
- research a keyword extraction method
- define vector and vector encoding per content item

How does that sound?

That was a long email for a start so may be we should branch more 
focused discussion items in the future.

Cheers,
- Philippe



More information about the chandler-dev mailing list