[Chandler-dev] chandler MVA update: using named entity types for high-priority filtering

Xun Luo sherwoodluo at gmail.com
Mon Jun 19 11:11:45 PDT 2006


-------
I am still using the sub word-set approach for article classification.
As the PCA works o.k. for predefined tags, I did not look into further
work to adapt more advanced stat algorithms, such as ICA and refined
sampling method. I also had made little progress in lucene wrapping,
except running through the examples in the "lucene in action" book.

++++
PCA and similar dimension reduction methods could keep track of basic
semantics, but obviously lack of the power of capturing high-level
semantics, such as the example Phillippe gave for address extraction.
So I am now trying an new add-on algorithm to extract named entities
and used them as automatic tags. The named entity extraction is not
used as substitution of PCA but a higher priority add-on. The
pseudo-code of the algorithm is as following:

    global predefined_set = {'lbl_1', 'lb_2', 'lb_3', ...}
    assign_tag(content_vector cv)
    {
         /* extract named entities from content vector */
         named_entity_set = extract_named_entities(contect_vector);
         sort named entities in named_entity_set by term frequency(TF);
         foreach named entity that has TF > threshold (tentatively 2).
             add the named entity to tag set of the item;

        /* then use PCA classifier to classify contect vector  */
        tag_classified = classify(contect_vector, predefined_set);
        add tag_classified to tag set of the item;
     }

Here, named entities are obtained through unsupervised learning while
tag_classifed comes from supervised learning. They will be completing
each other to some extent.

I have already downloaded a simple named entity recognizer MINIPAR:
http://www.cs.ualberta.ca/~lindek/minipar.htm, which could recognize
people's name, organization name and city names. Any suggestion input
is welcome.

Xun


More information about the chandler-dev mailing list