Hacker News new | past | comments | ask | show | jobs | submit login
Observation: Lucene rocks
31 points by henning on March 16, 2008 | hide | past | favorite | 14 comments
Two-word summary: Lucene rocks. Nine-word summary: It indexed 3 gigs of text in 20 minutes.

I've wanted to figure out Lucene but never got around to it (the Lucene book is very outdated and none of the example code works, for instance) but today I did something simpler, a little experiment in indexing.

I have a directory of about 3.2 GB of XML documents (medical journal papers downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.tar.gz -- it's about a 700 MB file). I wondered how long it would take the simple disk-based Lucene demo using default settings (http://lucene.apache.org/java/2_3_1/demo.html).

System stats: 7200 RPM 300 GB disk; Windows XP SP 2, Quad Core 2.4 ghz Core 2, 2 GB DDR2-800 RAM.

It took 23 minutes, the last 5 of which were merely flattening index chunks into a single file so that searches run faster.

So about 20 minutes for 3 gigs of text. The final index file was about 1/5 the size of the original source text at 646 MB.

Memory usage was very reasonable - it hovered around 30-40 MB (unlike, say, Java IDEs which use up 200 MB or so).

Ultimately a benchmark like this is disk-bound, but that's still fast as shit in my opinion. I had to whip together a ghetto homegrown indexing system at work several months ago (I've never had time to optimize it), and this blows away what I created.




Agreed. I was able to set up full text indexing/searching in a few hours.

The longest part of the process was trying to figure out which versions of Java Lucene and Zend PHP Lucene were compatible. FYI:

Lucene 2.1 index format support (which is also used in Lucene 2.2) is included in the current "trunk" branch. It is available via SVN in current nightly snapshots.

We hope to include Lucene 2.1 index format support in ZF 1.5.0. The current release (ZF V1.0.4) works with Lucene 1.9-2.0 index formats.

http://framework.zend.com/manual/en/zend.search.lucene.html#...


And the biggest difference between my ghetto system and Lucene is that searches with lots of results are very, very fast.


I've heard good things about Lucene, but we use Sphinx: http://www.sphinxsearch.com/

For our tests, it indexed much faster than the common Lucene implementations, and for our needs was also a tad faster overall. I haven't tried the newest version, though.


I don't know what kind of testing you've done but nothing even approaches the speed of Lucene. It's by far the fastest open source search engine currently available. If you're using Rails, I cannot recommend Solr enough. It's amazing.

Cutting's a genius.


Do you have references for "nothing even approaches"? Specifically compared to Sphinx? The only comparisons I've found are showing sphinx coming ahead in many indexing/search cases (if only slightly). See my other comment on this thread with links to benchmarks where sphinx clearly "comes close". We did a good bit of research on this, so it does feel odd that you'd say "nothing even approaches".

It was also ridiculously easy to get Sphinx up and going. Lucene is a killer engine, no doubt, but Sphinx's ROI alone won us over.



I've used Sphinx in one of the PHP/MySQL projects, and its much faster than any other (free/open) data indexing platform I've used. Althought configuring Sphinx and getting it to run takes a bit of an effort, but its worth it.


If you're working in Rails and you just want simple search, Sphinx is the fastest path to getting started.

I think of Lucene as more implementor-neutral than Sphinx; Lucene is an API as well as a Java library.



I second your observation, though my recent foray into Lucene was far simpler. I used the RAMDirectory feature to build an index in memory for a large list of names (and our queries go through a thick OR/M). The user of the application needs to be able to filter the list by keywords and doing the query each time was taking too long (2 or 3 seconds). It's now near instantaneous.

I think for 10,000 documents (two fields: name and id) it takes 20 seconds to build the index in Lucene .NET.

I had always heard of using Lucene for really large datasets and thought it might be overkill for speeding up a somewhat small part of one application dialog. In reality it took a single reference to the Lucene .NET dll and a few functions to build the documents and add them to the index.


Has anyone compare Lucene to Xapian? I have never tried Lucene, but have been very happy with Xapian.

http://xapian.org/



Plucene is slow as hell.

You better use Kinoseach, which also uses the same index format as lucene.

Some benchmarks are on this site : http://marvinhumphrey.com/kinosearch/benchmarks.html


lucene is pretty cool and it's a lot better than anything I've seen so far (including ferret). The only problem I've experienced with it was index corruption, which is fairly common and frustrating (though in fairness it could have been due to my sys admin skills)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: