[Chandler-dev] [SoC] Multivariate Analysis for PIM Data project
Philippe Bossut
pbossut at osafoundation.org
Wed May 24 23:58:26 PDT 2006
Hi Xun,
Congratulation again for being selected as a SoC student. I hope we'll
have a fruitful summer. For one thing, we do have a great project and
I'm glad we're going started on this.
Here under are some comments on your project plan (your text in /Italic/).
/Multivariate analysis for search and data exploration in PIM data
'The idea is to use multivariate analysis tools (Principal Component
Analysis, segmentation) to create a model of the data in a PIM and map
newly acquired data. The objective is to allow some level of sorting
(tagging) to be automatically done by the software' (quoted from Dr.
Bossut's description on OSAF wiki).
/
Left here for other list readers to get some context on what this
project is about.
/ 1. User Data Tagging
In Chandler 0.6, user data in PIM are described by Content Item
objects. Content Item is the abstract super-kind for things like
Contacts, Calendar Events, Tasks, Mail Messages, and Notes. Different
classes of user data are defined by Kinds, and consequently have
specific attributes to be recorded and persisted to Repository. Chandler
regards all the Kinds of user data as interexchangable. Consequently, a
Mail Message item could be a Task item at the same time. A list named
Tags specifies all the tags that a Content Item is associated with. As
there is no explicit UI to assign value to the tags that an Content Item
is associated with, by my understanding, the aim of the project is about
fully- or semi-automated value assigning to the Tags list.
/
We already had a private discussion on this that would be hard to rehash
so, to summarize, I think we agreed that we could:
- use the existing collections as a proxy to tags: the idea is that,
assuming a big enough data set, we can identify a cluster (using some
segmentation technique) that corresponds to said collections and,
therefore, automatically mark incoming data as part of some collections.
The good news is that since collections are not hierarchical, a false
positive won't bury an item in a deep tangle of "folders". We will
however have to identify which collections are relevant: some
collections (like "All", "Trash", "In", "Out") shouldn't be mapped,
others are strictly internal. If I'm not mistaken, we should limit this
to user defined SmartCollection (i.e. all the SmartCollection minus the
OOTB collections).
- experiment with some auto tagging: you had some ideas on this
- use the result of clustering to provide some "find similar" feature
/ 2. Model Analysis By PCA
All the attributes of a Content Item together form a vector. By
grouping these vectors that are close to each other and divide these
vectors that are far apart, the user data in Chandler could be
clustered/segmented. Each segmented subset user data could be assigned a
tag or a set of tags, in this way automated tagging is done. Whenever
new user data are acquired, they are assigned tags based on their
similarities with the existing data segments. By using PCA, principle
components which describe the majority of the sample variation are
figured out. These principle components construted a subspace of the
original attribute vector space. Every attribute vector is then
projected into the subspace that best encodes the variation of known
attribute vectors. By dividing projected vectors in the subspace along
the principle componets axes, we are hopeful to get segmentation results
of good quality.
It is worth noting that once Content Items are abstracted into
attribute vectors, there are also other segmentation techniques that
could be applied, include the classic k-means, and hyper-plane
separation methods such as Support Vector Machine (SVM). The
implementation effort of k-means is as same as PCA, and there is
off-the-shelf SVM library in C that could be used in a SWIG way.
/
This is the gist of it. One thing that worries me is the storage of the
vector per item. That could be quite a load on our repository. I
understand that you need the vectors to compute the distances
(similarities). Another approach though is to model the clusters with
few cluster specific data (center of gravity and ellipsoid dimension in
a reduced PCA space, this is where PCA comes handy really). But may be
I'm jumping the gun here. We need to have real data on what is the cost
of this vector.
/ 3. Tagging Dynamics
Tagging of Content Items is initially done over a given set of
user data. With the elapse of time, old user data may be modified or
deleted, and new user data maybe inserted. Under some cases, meta-data
about tags themselves, such as their names and their senses, may be
changed as well. All these events introduce information change in the
underlying statistical model, and cause tagging dynamics. One possible
strategy to deal with tagging dynamics is to set thresholds of
statistical model changes. A model ajustment action is triggered when
the change exceeds a pre-set threshold. For example, newly acquired user
data that does not fit the existing tags will be tagged 'unknown' in the
begining. When the number Content Items that have 'unknown' tags
increase to a specific amount, new cluster/segmentation emerge among
these Items and are properly tagged.
I would like to apply for project 'Behavior Analysis and Prediction' at
the same time (at second priority). Because this project has strong
correlation with the first one and lots of the work could be overlapping.
/
It turns out that Google did not grant us a student on this project so,
well, it could be a possibility. I do think however that the 2 projects
have only partial overlap. The one you're starting now doesn't use any
instrumentation data for instance.
/3. Implementation Plan
Implementation of the automated tagging of user data feature could be
broken down to three sub-tasks. Each sub-task accounts for about 1/3 of
the whole development effort. Namely, to instrument the current Content
Item class and its sub-classes with attribute vector information; To
create new statistical class which accomodates information of the
principle components, weight centroids of each tagged cluster, as well
as all the available tags; To create callback functions or 'hooks' of
the new statistical class, and instrument these functions to the user
data update functions in Chandler.
1. New member variables to Content Items and its sub-classes
An attribute vector, which maps the values of each attribute of a
Content Item, should be added as a member variable of Content Item
class. The vector codes values of common attributes into vector bits. In
this way, categorical or nominal types of attributes could be mapped to
numerical types, and subsequently available for PCA computation.
/
That seems to be a straightforward addition though, again, I'm worry
about adding any persistent data to Content Item, especially when this
data is redundant with the existing one. See my comment here above
though with "jumping the gun"...
/ One of the design considerations here is that different sub-class
of Content Item owns different set of attributes. So should we put the
attribute vector in each sub-class, or maintain a one-size-fits-all
vector in Content Item? I propose to adapt a hybrid solution: the class
Content Item maintains a vector called base_vector, while the subclass
maintain a vector called kind_vector. Common attributes that are shared
by all user data, such as entry-time and people-involved, are coded in
base_vector. While sub-class specific attributes, such as a Mail
Message's address or a Task's completion percentage, are coded in
kind_vector. The vector which is used for PCA computation is a weighted
concatenation of these two. When doing weighted concatenation, the
kind_vector part of the attribute vector is mapped into an universal
format, so camparision between two different sub-class data is made
possible.
/
The worry here is that the base_vector will be too poor to be of much
use unless we extract keywords from the body and displayName attributes
of the base item (which, on second thought, we should do and might be
the most semantically significant data...). The kind_vector will be
where the semantically significant data is (email addresses in
particular, this should give a good clue of where something should go in
lots of cases). The mapping though is a pretty tough nut to crack and
it's possible that only analysis of data of the same kind are really
possible (though we can always get something with base_vector alone,
which is already a progress compared to any other PIM out there). The
mapping issue is something we face also in "attribute redirection", the
problem we have when trying to populate say the "who" column in the
summary table view for items of different kind. I hope that the people
involved in "attribute redirection" will chime in (or I'll go and ask
them... :) )
/ 2. New class to hold statistical information and computation.
A new class, which represents a statistical multivariate analysis
method, should be added. Member variables of this class include a set of
available tags, to be assigned to the data segments/clusters. This class
should also contain the centroids or cores of the data
segments/clusters, so when new user data is acquired, it could be
compared with the centroids or cores of existing clusters, and then
decision is made about which tag should be applied on. This class should
has the member methods such as:
* update_weights(ItemCollection C) to use a collection of user
data to adjust the internal statistical parameters.
* assign_tag(ContentItem I) to assign tag(s) to a given user
data item based on its segment result.
* data_segmentation(ItemCollection C) to segment a collection
of user data into optimized partitions, etc.
/
/ Instead of making this class to be directly a PCA class, I would
rather propose it to be an abstract class that does not rely on specific
multivariable data analysis methods. In other words, this abstract class
defines the 'Policies' of using an analysis method to fullfil automated
tagging tasks, and leaves detailed implementation to its sub-classes.
The PCA class should be derived from this abstract class, implement the
interfaces the abstrat class defined, and owns specific data for PCA
analysis method. If we would like to implement other analysis methods in
the future, such as k-means or SVM, we just derive new sub-classes of
this abstract class, just as the way we derive the PCA sub-class.
For the proposed project, PCA will be the only fully implemented
analysis method. I had developed a full-fledged PCA analyzer for a past
project, the code could be looked at
http://smpp-ric.org/viewvc/dataglove_pca/. The code was written in
Matlab but translation into Python should be straightforward. The PCA
class will also own some private member variables, such as threshold
values and criterions to select principle components, in term of how
much variation should be preserved.
/
This is what I alluded to above when talking about "modeling the
clusters" (or centroids in your text). That seems like a natural idea to
me and makes me question again the wisdom of storing the vector per
content item in a persistent way...
/ 3. Callbacks and 'hooks' instrumentation for Chandler user data updates.
Callbacks, or 'hooks', are the mechanisms that glue Chandler
Content Items and the statistical class together. These hooks are
bilateral, i.e., some of them are provided by Content Item classes and
to be used by statitical class, and some of them are vice versa. For
example, the Content Item class needs to call an analytical method
callback provided by the statistical class, during its update of user
data content. The callback may trigger a model update for the internal
data of the statistical class. On the other hand, when a new analytical
methods is introduced or there is parameter change for the underlying
model of the current analytical method, the statistical class calls a
Content Item class callback to update the data view in Chandler UI.
/
Because its class mapping may change? (and therefore its tags or
collection binding). Not sure otherwise why the Chandler UI would change...
/ 4. Language considerations.
There is no doubt that Python will be the final language for
project implementation. Using SWIG to wrap modules written in other
language, such as C/C++ is also feasible, albeit introduces some extra
effort in code management. If we regard the tagging process as an
information pipeline, it could be seen that data come in as untagged
Content Items, processed by the multivariate analysis methods, and come
out as tagged (sorted) Content Items. For fast prototype setting and
experimental purposes, it is possible that during the development stage,
part 2 of the information pipeline, i.e. multivariate analysis, be
implemented in the form of a language other than python. In my opinion,
at least for the PCA method this intermediate step could be done with
complied Matlab modules.
/
Sounds good to me. Matlab will also provide you with visualization
methods that can be very useful when experimenting with data sets.
/ 5. Resource considerations.
In case more subtle algorithms and strategies need to be used,
which involves Natual Language Processing, Information Retrieval and
Statistical Linguistics. Consulation help could be obtained through
faculties in the Computer Science department of University of Illinois
at Chicago. Quite a few UIC Professors are recognized experts in the
above fields, including chairmen for ACM SIGMOD and ACM SIGKDD conferences.
/
/4. Timeline
The whole project spans from May 22, 2006 to August 21, 2006, a total
time period of 13 weeks.
* Before May 22 (interim period before kicking off)
Scrutiny project-related Chandler code and general PCA methods
implementations packages; Try to come up as rich and specific ideas as I
could; Continue to update this proposal (I wish it to be a solution of
quality to the project finally, no matter got accepted or rejected);
Keep in touch with the mentor; Read literature of similar work, if there
are any.
/
This is "done"... ;)
/ * May 22 to May 26 (week 1)
Orientation of OSAF development system and other related documents
kept by the mentor, if there are any; Commercement of development over
Chandler code.
/
It seems to me that you're already well into this, having already
compiled Chandler from scratch on Ubuntu and written a parcel as an
exercise. We need to see where you can park your stuff in a public
repository. One idea is to create a branch from our SVN repo. This would
certainly make tracking and final integration easier. We usually though
do not grant commit privileges automatically to newcomers. They
typically need to submit a couple of patches that go through a review
process. Another idea is to use some sort of sandbox. We have this
problem for the other SoC students so we need to solve this soon...
/ * May 29 to June 16 (week 2,3,4)
Instrumentation of Content Item class to add attribute vector
support, include the base class and all current subclasses: Contacts,
Calendar Events, Tasks, Mail Messages, and Notes; Implement one part of
the PCA class which does data segmentation.
/
One thing you'll need is a good data set for experimentation. We have a
3000 items calendar we use for perf testing but that's all auto
generated data and finding a pattern in that won't work. Looks to me
that we'll need to create some sort of test data set (with known
clusters) for testing.
Another tricky aspect is defining the dimensionality of the space:
should we use every single attribute as a dimension? should we extract
some keywords from text (notes field, subject) and use them as well? I'm
inclined to say "yes" to both, certainly the second question since most
of the semantic will be there... We need to find something to extract
keywords from a string. There's certainly some research to do to find a
suitable algo or open source implementation of this (note that we do not
use GPL licensed code in Chandler since we intend to be Apache licensed).
/ * June 19 to June 23 (week 5)
Midterm report of the project to the mentor; Develop phase one
postmortem; Continue to work on PCA class; Unit test of the completed
modules.
* June 26 to July 14 (week 6,7,8)
Implement the other part of the PCA class which maps data segments
to tags; If needed, implement a help class to for tag management; Unit
test of the completed modules.
* July 17 to July 28 (week 9,10)
Develop phase two postmortem; Callbacks instrumentation in Content
Item classes and statistical method class; Unit test of the completed
modules; Integration test of the tagging pipeline.
* July 31 to August 18 (week 11,12,13)
Data-intensive test of the MVA auto-tagging feature; Tune up of
the clustering/segmentation accuracy by optimization of MVA parameters;
Document the whole project; Release the deliverables -- documentation,
source code and SVN repository.
/
Sounds like a great plan, fairly ambitious but feasible. We'll need to
find some real user target data somewhere around mid July.
So, the next steps are:
- define somewhere a sandbox or branch for you to start developing
- look at the schema for each kind and see which attribute should be
used in the vector
- research a keyword extraction method
- define vector and vector encoding per content item
How does that sound?
That was a long email for a start so may be we should branch more
focused discussion items in the future.
Cheers,
- Philippe
More information about the chandler-dev
mailing list