[pylucene-dev] First shot at custom tokenfilter
Ofer Nave
ofer at smarter.com
Wed Mar 28 18:03:15 PST 2007
Sorry for the delay. I have a concise test case now. See below for inline
comments. Code is at the bottom.
> -----Original Message-----
> From: pylucene-dev-bounces at osafoundation.org
> [mailto:pylucene-dev-bounces at osafoundation.org] On Behalf Of
> Andi Vajda
> Sent: Monday, March 26, 2007 9:00 PM
>
> On Mon, 26 Mar 2007, Ofer Nave wrote:
> > Ever since I started using a custom Analyer and
> TokenFilter, my index
> > build script keeps crashing. Usually it just freezes at a random
> > point, and won't even respond to ctrl-c (I have to use kill -9 in
> > another terminal). One time it ended with: 'Fatal Python
> error: This
> > thread state must be current when releasing'. One time it finished
> > successfully (out of about 20 attempts). This is from
> repeated runs without changing any code.
>
> If you submit a piece of code that reproduces the problem, I
> can take a look at it (best would be something like a unit
> test, see PyLucene/test).
Haven't had time to look at the unit testing framework, but the code is
simple and runs standalone.
> Also, what is your OS ? did you build PyLucene yourself ? If
> so, which gcj ?
> Does 'make test' pass ? What is your version of Python ?
Linux 2.6.9
Python 2.3.4
Lucene/PyLucene versions including in sample output below.
I believe the admin compiled PyLucene from source. The box has gcj version
3.4.5 20051201.
Sample code:
---
#!/usr/bin/python
import sys
import PyLucene
def main():
print 'PyLucene', PyLucene.VERSION, 'Lucene', PyLucene.LUCENE_VERSION
data = dict(album='Hail To The Thief', artist='Radiohead',
ASIN='B000092ZYX')
directory = '/tmp/crash'
store = PyLucene.FSDirectory.getDirectory(directory, True)
# store = PyLucene.RAMDirectory()
# analyzer = PyLucene.StandardAnalyzer()
analyzer = TermJoinAnalyzer()
writer = PyLucene.IndexWriter(store, analyzer, True)
docs = 0
while True:
doc = PyLucene.Document()
doc.add(PyLucene.Field('album', data['album'],
PyLucene.Field.Store.YES, PyLucene.Field.Index.TOKENIZED))
doc.add(PyLucene.Field('artist', data['artist'],
PyLucene.Field.Store.YES, PyLucene.Field.Index.TOKENIZED))
doc.add(PyLucene.Field('ASIN', data['ASIN'],
PyLucene.Field.Store.YES, PyLucene.Field.Index.UN_TOKENIZED))
writer.addDocument(doc)
docs += 1
if docs % 5000 == 0:
print docs
class TermJoinTokenFilter(object):
TOKEN_TYPE_JOINED = "JOINED"
def __init__(self, tokenStream):
self.tokenStream = tokenStream
self.a = None
self.b = None
def __iter__(self):
return self
def next(self):
if self.a: # emitted prev last time - need to set next, emit prev +
next, and reset prev to None
self.b = self.tokenStream.next()
if self.b is None:
return None
joined = PyLucene.Token(self.a.termText() + self.b.termText(),
self.a.startOffset(), self.a.endOffset(), self.TOKEN_TYPE_JOINED)
joined.setPositionIncrement(0)
self.a = None
return joined
elif self.b: # emitted prev + next last time - need to emit next,
set prev to next, and reset next to None
self.a = self.b
self.b = None
return self.a
else: # first call ever - set prev to first token and emit first
token
self.a = self.tokenStream.next()
return self.a
class TermJoinAnalyzer(object):
def __init__(self, analyzer=PyLucene.StandardAnalyzer()):
self.analyzer = analyzer
def tokenStream(self, fieldName, reader):
tokenStream = self.analyzer.tokenStream(fieldName, reader)
filter = TermJoinTokenFilter(tokenStream)
return tokenStream.tokenFilter(filter)
main()
---
It builds an index in /tmp/crash. You can change the path, or to avoid
disk, switch which Directory instantiation line is commented out.
It uses my TermJoinAnalyzer class to demonstate the crash. To demonstrate
how the same code runs fine with StandardAnalyzer, switch which Analayzer
instantiation line is commented out.
I ran it with TermJoinAnalyzer three times, and all three times it crashed
within seconds - with three different errors, no less. :) When I ran it
with StandardAnalyzer, it worked fine for several minutes before I killed
it.
Here's the output from the three crashes:
---
[ofer at rnd01 ~/proj/search/trunk]$ bin/tmp.py
PyLucene 2.1.0-1 Lucene 2.1.0-509013
5000
10000
15000
20000
25000
Fatal Python error: auto-releasing thread-state, but no thread-state for
this thread
Aborted
[ofer at rnd01 ~/proj/search/trunk]$ bin/tmp.py
PyLucene 2.1.0-1 Lucene 2.1.0-509013
5000
10000
15000
20000
25000
30000
35000
Fatal Python error: This thread state must be current when releasing
Aborted
[ofer at rnd01 ~/proj/search/trunk]$ bin/tmp.py
PyLucene 2.1.0-1 Lucene 2.1.0-509013
5000
10000
Traceback (most recent call last):
File "bin/tmp.py", line 57, in ?
main()
File "bin/tmp.py", line 19, in main
writer.addDocument(doc)
PyLucene.JavaError: java.lang.NullPointerException
---
-ofer
More information about the pylucene-dev
mailing list