[pylucene-dev] ubuntu forum post about PyLucene

Andi Vajda vajda at osafoundation.org
Thu Nov 1 22:13:14 PDT 2007


I found this interesting post comparing the GCJ and JCC PyLucene flavors on an 
Ubuntu forum:
     http://ubuntuforums.org/showthread.php?t=593327

Mostly correct. Taking the final points made, comments inline:

   1. GCJ version seems to be incompatible with python web frameworks, as well
      as mod-python

Yeah, the threading issue in PyLucene with GCJ is a long standing pain that 
got resolved with PyLucene with JCC.

   2. GCJ has limits regarding file size for indexes, and sometimes cannot
      optimize your data

That is true with GCJ 3.x. GCJ 4.x has a fix for the 2 Gb file size limit in 
the Java runtime classes. Of course, your mileage with GCJ 4.x will vary.

   3. GCJ is very, very fast making search

GCJ is faster than Sun's JRE in getting started. If your search is a short 
lived program, GCJ is indeed faster. I did notice that this performance 
difference got lesser and lesser as the program's running time was longer.

  4. JCC is more complicated to install and require java installed (at least
     jre)

Well, it depends. If you have to build your own GCJ, I'd argue that installing 
PyLucene with JCC is vastly simpler. Building openjdk on Linux is also 
comparatively easier (?) than building GCJ.

   5. Programs using JCC version always need LD_LIBRARY_PATH

Not anymore. By using "-Wl,-rpath=libpath" in setup.py's LFLAGS, this problem 
- and arguably, security issue - is resolved. No need to set LD_LIBRARY_PATH 
anymore. svn trunk's version of JCC's setup.py has an example.

   6. JCC needs to start java VM everytime you run the program, so in cases
      like mine (cgi application) it's a bit slower

Yes, that's true. I spent some time today trying to detect the missing call to 
initVM() but it's more complicated than I thought without adding the check 
everywhere. I thought of adding it to findClass() only, a relatively slow 
operation the first time, but it's harder than I thought. More on this 
later. In the meantime, I put BIG notices at the top of both PyLucene's and 
JCC's README files about the need to call initVM() before calling into the VM.

To dispell another fallacy in the post, initVM() is indeed documented along 
with all its arguments in JCC's README file starting at line 189 of [1]

   7. JCC is about 3 times slower than GCJ when searching records, but seems to
      be fast importing data

See comment (3)

   8. JCC seems to be more stable and can optimize indexes bigger than 2.4GB

Yes, the Sun-originating VMs are much more mature than GCJ's is many ways. Now 
that Sun is sponsoring an open source JDK and JRE, openjdk [2], I expect most 
of the open source energy in java land to be focusing on it (see iced tea [3] 
project) instead of GCJ. The amount of traffic on the GCJ mailing list is not 
what it used to be...

Andi..

[1] http://svn.osafoundation.org/pylucene/trunk/jcc/README
[2] http://openjdk.java.net/
[3] http://fitzsim.org/blog/?p=16 and http://fitzsim.org/blog/?p=17



More information about the pylucene-dev mailing list