[pylucene-dev] Examples?

Andrzej Bialecki ab at getopt.org
Fri Jun 25 02:56:47 PDT 2004


darryl wrote:

> Would you mind elaborating on what is wrong with the StandardAnalyzer I 
> dont' know enough about different quirks to know what would be 
> appropriate to use. It seemed to me that adding analyzers would be 
> pretty straight forward.

The combination of QueryParser and StandardAnalyzer has numerous 
problems when parsing queries with special characters. The most glaring 
example is a query with !isLetterOrDigit inside - e.g. the query 
"one\-sided \(1\)" gets parsed into "one sided 1" no matter what 
escaping you use. It also uses internally a stop words table with 
English stopwords, so a query "to be or not to be" gets parsed into... 
an empty query. It also lower-cases everything, even within a phrase 
("ONE Two" -> "one two"). At the same time it tries to be smart when it 
comes to detecting URLs and acronyms, so "I.B.M" gets parsed into 
"i.b.m", but "I.B.M." (notice the trailing dot) becomes "ibm" ... and so 
on, and so on... You can get a diagnostic tool for Lucene indexes - Luke 
(http://www.getopt.org/luke) to try out more examples.

Surprisingly, StandardAnalyzer is often used for query parsing, probably 
out of ignorance - most of the time people would probably prefer the 
SimpleAnalyzer or StopAnalyzer, or even just a LowerCaseAnalyzer... Of 
course, if the index was built _without_ lowercasing the tokens, a query 
with all lowercased terms will find only a subset of what you want, 
because the case matters.

In summary, the StandardAnalyzer is not suitable for every situation, 
and it works in an especially complicated way when used together with 
the QueryParser. In addition, you should use a "compatible" analyzer 
when building a query to the one that was used to build the index. So it 
seems to me that adding more analyzers to PyLucene would be very useful.

HTH

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)



More information about the pylucene-dev mailing list