[Chandler-dev] chandler MVA update: using named entity types for high-priority filtering

Xun Luo sherwoodluo at gmail.com
Thu Jun 22 22:47:22 PDT 2006


> I'd be interested to know how you made the "PCA work for predefined tags".
I have to first say that the feed plugin in chandler 0.7alpha3 is so
different from its counterpart in 0.6.1. So all my work are on 0.6.1 code.
1. Detailed implementation of classification for predefined tags.
1.1Obtaining labeled data
    To make ContentItem tagging a supervised learning task. I chose the feed
channel, with slashdot.org news as input. The category (more acurate,
'section') of the article is used as pre-defined label.
    Two changes are made to chandler feed channel implementation to achieve
this task.   The code is based on version 0.6.1.
     a) although chandler displays 'category' of the article, it is actually
incorrectly mapped to 'subject' field of the feed. I changed this to
'section' field. So it better reflects slashdot.org  categories.
     b) the original feed channel just do retrieval once, so the feed
articles are limited to 10-20, I added time delay in the fetching code, so
could retrieve up to 300 articles.
    After these modifications, I am able to get 300 articles of 5
pre-defined categories. (please refer to slashdot.org for its
categoriztion).
    file modified:
           parcel/feeds/block.py
           parcel/feeds/channel.py

1.2 build the category vectors
    it is very lucky that lucene has its api for term vector access. (please
refer to 'lucene in action' book,  chapter 5 for details about term
vectors). This satisfies a critical need for PCA. Term vector is a tuple of
{ 'term', 'frequency'} pairs, and maps one-to-one to each document. It
contains all the information we need for PCA calculation. Just the format is
a little different from a regular matrix row representation (to think each
vector is a compact representation of a matrix row with only the none-zero
elements). The 300 articles are able to be reprented in form below, of
course they has been greatly simplified:
    Article 1:  { {'ipod', 2}, {'california', 1}, {'sales', 1}}
    Article 2:  { {'spam', 1}, {'microsoft', 1}}.
    .....
    Article 300: ....
By extending each vector to be of a length that equals to the number of
distinct words in all 300 articles, these 300 vectors form a regular matrix.

    Article 1:  { {'ipod', 2}, {'california', 1}, {'sales', 1}, {'spam', 0},
{'microsoft', 0}}
    Article 2:  { {'ipod', 0}, {'california', 0}, {'sales', 0}, {'spam', 1},
{'microsoft', 1}}.
    .....
    Article 300: ....
Then PCA is used to transform the matrix. Articles that have same categories
are summed up in transformed space, and thus "category vector" is obtained
for each category.
    file modified:
           parcel/feeds/block.py
           parcel/feeds/channel.py
           repository/persistence/FileContainer.py
           repository/persistence/DBRepository.py
           repository/persistence/DBTermIO.py

1.3 classify the new article
   then, when a new article comes in. It is firstly transformed in PCA space
and then is computed consine distance with the category vectors. Cosine
distance is as
        cosine(v1, v2) = [v1 (dot product) v2]/ [ length(v1) * length(v2)]
The category with smallest cosine value is chosen to be the category the new
article belongs to. Actually for my test, I did not choose to classify new
article, but used part of the 300 articles as unlabeled. The reason is that
to retrieve new article will entail quite some coding in modified feed
parcel. The classification accuracy is about 76%.
    file modified:
           parcel/feeds/block.py
           parcel/feeds/channel.py
           repository/persistence/FileContainer.py
           repository/persistence/DBRepository.py
           repository/persistence/DBTermIO.py

> I've high hopes that the Lucene synonym recognizer is going to help a
> lot here to extract named entities beyond this small set. This is  a
> great set though to start with.

2. Will synonym help in named entity extraction?
I don't think so. The reason is that synonym and named entitie sets are two
differnent things. Lucene use wordnet for synonym sets generation. For
example, look up 'paris' in wordnet, you will find the following synset:
    "city of light", "French capital".
But will never find "London", although both "paris" and "london" are named
city entities. Same for people's name. Entity recognition has to be done by
NLP routines. In the mean time, some named entities could be of large
importance and they could serve directly as the tag of the ContentItem.
That's my idea of how unsupervised tagging could be archieved, yet I admit
it's very hard and think this will be of good work for Diana.  There need to
be some heuristic rules, such as:
     if  there is <people name> and term "meeting" in the ContentItem, and
they form a sentence where the <people name> is subject
     then use <people name> as one of the tags for the ContentItem
This work is really out of my scope yet I would like to give help at my best
to experiment the rules. The result is just uncertain, depend s on how well
they are tuned.

3. Minipar
I use Minipar is just because it is a lightweight library that could help me
to get a fast prototype.
> - What is MINIPAR's license?
     In its readme file, it says "A royalty-free license is granted for the
use of this software for NON_COMMERCIAL PURPOSES ONLY"
> - Does MINIPAR gives you a clue as to what the semantic of the entity
> is? (i.e. if it's a people name or location name)
     Yes, it does.
> - What about prefixing such entities with their semantic class? (e.g.
> "Smith" is coded in the content_vector as "People:Smith"). That way, if
> we extract other semantic entities from Chandler's data, we could cross
     This is feasible in chandler. As lucene supports a keyword for each
term. In chander, currently such kine of keywords has already been used when
persisting data into lucene index. This is a very good idea.

4. Issues
4.1 following chandler evolution
The code change from 0.6.1 to 0.7alpha3 is REALLY huge. I honestly do not
think I will be able to have a 0.7alpha3 porting for MVA in the SoC
timeframe, with my own effort. I am confident to have things working on
0.6.1 though.
4.2 more interested party involved
I hope my explanation in section 1 is clear. If so, we could safely draw the
conclusion that MVA for pre-defined tags classfication is a fully practical
thing in chandler. That is good news and proves the idea is not a toy. It
would be great if more people in OSAF will be interested in and participate
the discussion on this topic.
I would say that NLP, which covers part of section 3 and 4, is also very
hopeful and will be pratical to implement, too.

Best,
Xun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.osafoundation.org/pipermail/chandler-dev/attachments/20060623/a0b8b328/attachment.html


More information about the chandler-dev mailing list