[Chandler-dev] chandler MVA update: using named entity types for high-priority filtering

Philippe Bossut pbossut at osafoundation.org
Thu Jun 22 08:15:40 PDT 2006


Hi Xun,

Xun Luo wrote:
> I am still using the sub word-set approach for article classification.
> As the PCA works o.k. for predefined tags, I did not look into further
> work to adapt more advanced stat algorithms, such as ICA and refined
> sampling method. I also had made little progress in lucene wrapping,
> except running through the examples in the "lucene in action" book.
I'd be interested to know how you made the "PCA work for predefined tags".
> ++++
> PCA and similar dimension reduction methods could keep track of basic
> semantics, but obviously lack of the power of capturing high-level
> semantics, such as the example Phillippe gave for address extraction.
> So I am now trying an new add-on algorithm to extract named entities
> and used them as automatic tags. The named entity extraction is not
> used as substitution of PCA but a higher priority add-on. The
> pseudo-code of the algorithm is as following:
>
>    global predefined_set = {'lbl_1', 'lb_2', 'lb_3', ...}
>    assign_tag(content_vector cv)
>    {
>         /* extract named entities from content vector */
>         named_entity_set = extract_named_entities(contect_vector);
>         sort named entities in named_entity_set by term frequency(TF);
>         foreach named entity that has TF > threshold (tentatively 2).
>             add the named entity to tag set of the item;
>
>        /* then use PCA classifier to classify contect vector  */
>        tag_classified = classify(contect_vector, predefined_set);
>        add tag_classified to tag set of the item;
>     }
>
> Here, named entities are obtained through unsupervised learning while
> tag_classifed comes from supervised learning. They will be completing
> each other to some extent.
>
> I have already downloaded a simple named entity recognizer MINIPAR:
> http://www.cs.ualberta.ca/~lindek/minipar.htm, which could recognize
> people's name, organization name and city names. Any suggestion input
> is welcome.
I've high hopes that the Lucene synonym recognizer is going to help a 
lot here to extract named entities beyond this small set. This is  a 
great set though to start with.

Couple of questions:
- What is MINIPAR's license?
- Does MINIPAR gives you a clue as to what the semantic of the entity 
is? (i.e. if it's a people name or location name)
- What about prefixing such entities with their semantic class? (e.g. 
"Smith" is coded in the content_vector as "People:Smith"). That way, if 
we extract other semantic entities from Chandler's data, we could cross 
ref from the text to other Chandler's fields (e.g. "smith at foo.org" from 
the "From" field could be coded as "People:Smith" and will map the same 
way as "People:Smith" extracted from the text).

Cheers,
- Philippe


More information about the chandler-dev mailing list