Two-word summary: Lucene rocks. Nine-word summary: It indexed 3 gigs of text in 20 minutes.
I've wanted to figure out Lucene but never got around to it (the Lucene book is very outdated and none of the example code works, for instance) but today I did something simpler, a little experiment in indexing.
I have a directory of about 3.2 GB of XML documents (medical journal papers downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.tar.gz -- it's about a 700 MB file). I wondered how long it would take the simple disk-based Lucene demo using default settings (http://lucene.apache.org/java/2_3_1/demo.html).
System stats: 7200 RPM 300 GB disk; Windows XP SP 2, Quad Core 2.4 ghz Core 2, 2 GB DDR2-800 RAM.
It took 23 minutes, the last 5 of which were merely flattening index chunks into a single file so that searches run faster.
So about 20 minutes for 3 gigs of text. The final index file was about 1/5 the size of the original source text at 646 MB.
Memory usage was very reasonable - it hovered around 30-40 MB (unlike, say, Java IDEs which use up 200 MB or so).
Ultimately a benchmark like this is disk-bound, but that's still fast as shit in my opinion. I had to whip together a ghetto homegrown indexing system at work several months ago (I've never had time to optimize it), and this blows away what I created.
The longest part of the process was trying to figure out which versions of Java Lucene and Zend PHP Lucene were compatible. FYI:
Lucene 2.1 index format support (which is also used in Lucene 2.2) is included in the current "trunk" branch. It is available via SVN in current nightly snapshots.
We hope to include Lucene 2.1 index format support in ZF 1.5.0. The current release (ZF V1.0.4) works with Lucene 1.9-2.0 index formats.
http://framework.zend.com/manual/en/zend.search.lucene.html#...