[pylucene-dev] Examples?
Andrzej Bialecki
ab at getopt.org
Fri Jun 25 02:56:47 PDT 2004
darryl wrote:
> Would you mind elaborating on what is wrong with the StandardAnalyzer I
> dont' know enough about different quirks to know what would be
> appropriate to use. It seemed to me that adding analyzers would be
> pretty straight forward.
The combination of QueryParser and StandardAnalyzer has numerous
problems when parsing queries with special characters. The most glaring
example is a query with !isLetterOrDigit inside - e.g. the query
"one\-sided \(1\)" gets parsed into "one sided 1" no matter what
escaping you use. It also uses internally a stop words table with
English stopwords, so a query "to be or not to be" gets parsed into...
an empty query. It also lower-cases everything, even within a phrase
("ONE Two" -> "one two"). At the same time it tries to be smart when it
comes to detecting URLs and acronyms, so "I.B.M" gets parsed into
"i.b.m", but "I.B.M." (notice the trailing dot) becomes "ibm" ... and so
on, and so on... You can get a diagnostic tool for Lucene indexes - Luke
(http://www.getopt.org/luke) to try out more examples.
Surprisingly, StandardAnalyzer is often used for query parsing, probably
out of ignorance - most of the time people would probably prefer the
SimpleAnalyzer or StopAnalyzer, or even just a LowerCaseAnalyzer... Of
course, if the index was built _without_ lowercasing the tokens, a query
with all lowercased terms will find only a subset of what you want,
because the case matters.
In summary, the StandardAnalyzer is not suitable for every situation,
and it works in an especially complicated way when used together with
the QueryParser. In addition, you should use a "compatible" analyzer
when building a query to the one that was used to build the index. So it
seems to me that adding more analyzers to PyLucene would be very useful.
HTH
--
Best regards,
Andrzej Bialecki
-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)
More information about the pylucene-dev
mailing list