[pylucene-dev] First shot at custom tokenfilter

Andi Vajda vajda at osafoundation.org
Mon Mar 26 20:00:27 PST 2007


On Mon, 26 Mar 2007, Ofer Nave wrote:

>> -----Original Message-----
>> From: pylucene-dev-bounces at osafoundation.org
>> [mailto:pylucene-dev-bounces at osafoundation.org] On Behalf Of Ofer Nave
>> Sent: Monday, March 26, 2007 5:01 PM
>>
>> I reimplemented my FooAnalyzer using this pattern and it
>> works now.  I still don't know why, but at least it works. :)
>
> Ever since I started using a custom Analyer and TokenFilter, my index build
> script keeps crashing.  Usually it just freezes at a random point, and won't
> even respond to ctrl-c (I have to use kill -9 in another terminal).  One
> time it ended with: 'Fatal Python error: This thread state must be current
> when releasing'.  One time it finished successfully (out of about 20
> attempts).  This is from repeated runs without changing any code.

If you submit a piece of code that reproduces the problem, I can take a look 
at it (best would be something like a unit test, see PyLucene/test).

Also, what is your OS ? did you build PyLucene yourself ? If so, which gcj ?
Does 'make test' pass ? What is your version of Python ?

Andi..

>
> I'm not creating any threads.  It's a straight python script, no apache or
> web stuff involved.  The only change has been the custom analyzer and
> tokenfilter.
>
> For reference:
>
> ---
> class TermJoinTokenFilter(object):
>
>    TOKEN_TYPE_JOINED = "JOINED"
>
>    def __init__(self, tokenStream):
>        self.tokenStream = tokenStream
>        self.a = None
>        self.b = None
>
>    def __iter__(self):
>        return self
>
>    def next(self):
>        if self.a:  # emitted prev last time - need to set next, emit prev +
> next, and reset prev to None
>            self.b = self.tokenStream.next()
>            if self.b is None:
>                return None
>            joined = PyLucene.Token(self.a.termText() + self.b.termText(),
> self.a.startOffset(), self.a.endOffset(), self.TOKEN_TYPE_JOINED)
>            joined.setPositionIncrement(0)
>            self.a = None
>            return joined
>        elif self.b:  # emitted prev + next last time - need to emit next,
> set prev to next, and reset next to None
>            self.a = self.b
>            self.b = None
>            return self.a
>        else:  # first call ever - set prev to first token and emit first
> token
>            self.a = self.tokenStream.next()
>            return self.a
>
> class TermJoinAnalyzer(object):
>
>    def __init__(self, analyzer=PyLucene.StandardAnalyzer()):
>        self.analyzer = analyzer
>
>    def tokenStream(self, fieldName, reader):
>        tokenStream = self.analyzer.tokenStream(fieldName, reader)
>        filter = TermJoinTokenFilter(tokenStream)
>        return tokenStream.tokenFilter(filter)
> ---
>
> -ofer
>
> _______________________________________________
> pylucene-dev mailing list
> pylucene-dev at osafoundation.org
> http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
>


More information about the pylucene-dev mailing list