Professional Documents
Culture Documents
By Steven J. Owens
Jarkarta Lucene (http://jakarta.apache.org/lucene/) is a high-performance, full-
featured, java, open-source, text search engine API written by Doug Cutting.
Note that Lucene is specifically an API, not an application. This means that all the
hard parts have been done, but the easy programming has been left to you. The
payoff for you is that, unlike normal search engine applications, you spend less
time wading through tons of options and build a search application that is
specifically suited to what you're doing. You can easily develop a custom search
application, perfectly suited to your needs. Lucene is startlingly easy to develop
with and use.
I'm going to assume that you're a basically competent programmer and that you
are basically competent in java.
Overview
I'm going to try to use emphasis tags any time I introduce a Lucene API class
name.
Here's a simple attempt to diagram how the Lucene classes go together:
Index
Document 1
Field A (name/value)
Field B (name/value)
Document 2
Field A (name/value)
Field B (name/value)
At the heart of Lucene is an Index. This class usually gets its data from a
filesystem directory that contains a certain set of files that follow a certain
structure, but it doesn't absolutely have to be a directory.
You pump data into the Index, then do searches on the Index to get results out. To
build the Index, you use an IndexWriter object. To run a search on the Index you
use an IndexSearcher object.
The search itself is a Query object, which you pass into IndexSearcher.search().
IndexSearcher.search() returns a Hits object, which contains a Vector of Document
objects.
Document objects are stored in the Index, but they have to be put into the Index
at some point, and that's your job. You have to select what data to enter in, and
convert them into Documents. You read in each data file (or database entry, or
whatever), instantiate a Document for it, break down the data into chunks and
store the chunks in the Document as Field objects (a name/value pair). When
you're done building a Document, you write it to the Index using the IndexWriter.
Queries can be quite complicated, so Lucene includes a tool to help generate
Query objects, called a QueryParser. The QueryParser takes a query string, much
like what you'd put into an Internet search engine, and generates a Query object.
Note: There's a gotcha that often pops up, so even though it's a lower-level detail,
I'm going to mention it here. It's the Analyzer. Lucene indexes text, and part of the
first step is cleaning up the text. You use an Analyzer to do this - it drops out
punctuation and commonly occurring but meaningless words (the, a, an, etc).
Lucene provides a couple different Analyzers, and you can make but your own,
but the BIG GOTCHA people keep running into is that you must make sure you
use the same sort of analyzer for both indexing and searching. You must
feed the same sort of Analyzer to the QueryParser that you originally fed
to the IndexWriter.
Moving on... did you notice what's not in the above? Lucene handles the indexing,
searching and retrieving, but it doesn't handle:
• managing the process (instantiating the objects and hooking them together,
both for indexing and for searching)
• selecting the data files
• parsing the data files
• getting the search string from the user
• displaying the search results to the user
Those are all your job. There are some helpful tools and some good examples
available in the Lucene contrib space, but generally Lucene is focused on doing
the indexing and searching, and leaves all of the rest up to you (so you can make
exactly the search solution you want).
I'm going to assume that typical uses for Lucene are either command-line driven,
or web-driven. The example code I mentioned above is for a command-line driven
searchable recipe database. Someday I'm going to build an example of how to
make a web-driven Lucene application and add it to this tutorial.
Don't Get Clever
You'll notice, as we get into this, a common theme. You'll notice the same theme if
you hang out on the lucene-user list and listen to Doug Cutting answering
questions. That theme is don't get clever, all the cleverness you'll ever need
has been put into really, really fast indexing and searching. This isn't to
say it's always best to use brute force, but in Lucene, if there's a simple way to do
it, that way probably makes the most sense. Remember Knuth: "early optimization
is the root of much evil."
Indexing Or Searching
At the top, you're either pumping data into your search application (indexing) or
pulling data out of it (searching).
I'm going to go over these classes in more or less the order you'd encounter them
by going through the the sample source files. Well, to be exact, I'm going to go
through them in the order the data would go through them, in going from an input
file to the output of a search request.
If you're not sure you're ready to dive into this depth, take a look at my not-so-
nitty-gritty overview.
Indexing In Depth
You index by creating Documents full of Fields (which contain name/value pairs)
and pumping them into an IndexWriter, which parses the contents of the Field
values into tokens and creates an index.
Document Objects
Lucene doesn't index files, it indexes Document objects. To index and then search
files, you first need write code that converts your files into Document objects.
A Document object is a collection of Field objects (name/value pairs). So, for each
file, instantiate a Document, then populate it with Fields.
This is the first potentially tricky bit, depending on what kind of files you're
indexing, how much the data in those files is structured, and how much of that
structure you want to preserve. Lucene just handles name/value pairs. Email, for
example, is mostly name/value oriented:
• to: fred
• from: barney
• subject: dinner?
• body: Let's get together for dinner tonight!
For more complex files, you have to "flatten" that structure out into a set of
name/value fields.
By the way, I'm saying "files" here, but the data source could really be anything -
chunks of a very large file, rows returned from an SQL query, individual email
messages from a mailbox file.
A minimum, as in the standard Lucene examples, would be:
A field containing... Which you'll use to...
the path to the original actually show the user the original document after the
document search
a modification date compare against the original Document's modification
date, to see if it needs to be reindexed.
the contents of the file run the search against
Note: This is an example, not a requirement. For example, if you don't have a
modification date, don't sweat it, you just have to reindex all of your files every
time (and in fact, that's the standard recommended approach for reindexing,
under the "don't get clever" rule of thumb).
IndexWriter
The IndexWriter's job is to take the input (a Document), feed it through the
Analyzer you instantiate it with, and create an index. Using the IndexWriter itself
is fairly simple. You instantiate it with parameters for where to put the index files
and the Analyzer you want it to use for cleaning up the tokens. Then feed
Documents into IndexWriter.addDocument(). The actual index is a set of data files
that the IndexWriter creates in a location defined (depending on how you
instantiate the IndexWriter) by a lucene Directory object, a File, or a path string.
Directory Objects
You can also store the index in a Lucene Directory object. A Lucene Directory is an
abstraction around the java filesystem classes. Using a Directory lets the Lucene
classes hide what exactly is going on. This in turn lets you do clever behind-the-
scenes things like keeping the file cached in memory for really high performance
by using the RAM-based Directory class (Lucene comes with two Directory classes,
one for file-based and one for RAM-based).
Searching In Depth
To actually do the search, you need an IndexSearcher, but we'll get to that in a
moment; before you can even think about feeding the IndexSearcher a query, you
have to have a Query object. The IndexSearcher does the actual munging through
the index, but it only understands Query objects.
IndexSearchers
To get an IndexSearcher you simply instantiate an IndexSearcher with a single
argument that tells Lucene where to find an existing index. The argument is either
of these two:
• a string containing a path to the file,
• a Lucene Directory object (see the section about Directory objects under
"Indexing In Depth", above)
Digression: IndexReaders
(You can safely skip this section, as it's just me meandering through the Lucene
source code; not a whole lot of practical value here yet).
There's actually a third option for instantiating an IndexSearcher; you can
instantiate it with any class that is a concrete subclass of the abstract class
IndexReader
This makes more sense if you take a peek at the code for IndexSearcher. The
other two constructors just turn your file path or Directory object into an
IndexReader by calling the static method IndexReader.open(). Just for kicks, let's
do a little more digging and see that IndexReader.open() takes either a String file
path or a java File object and uses them to instantiate a Lucene Directory object,
then calls open(Directory).
NOTE: I have to admit, I'm a little confused at this point, since the API docs say
IndexReader is abstract (which means it can't be instantiated). Presumably that
means IndexReader.open(), a static factory method, instantiates an appropriate
concrete subclass of IndexReader and returns it. However, the API docs don't
show any concrete subclasses of IndexReader. Since I'm too lazy at the moment to
look through the source... oh, all right, I'm not too lazy to look through the source.
Hm. It appears the API docs are out of date, the com/Lucene/index directory
appears to contain a SegmentReader, which IndexReader.open() uses.
Multiple Indexes
If you're searching a single index, you use an IndexSearcher with a single index. If
you need to search across multiple indexes, you instantiate one IndexSearcher per
index, create an array, stick the IndexSearcher instances in the array, and
instantiate a MultiSearcher with the array as an argument.
Hits
IndexSearcher.search(Query) returns a "Hits" object, which is sort of like a Vector,
containing a ranked list of Lucene Document objects. These are the same
Document objects you fed into the IndexWriter, but specifically the ones that
matched your search. Now you need to format the hits for a display, or
manufacture HREFs pointing to the original documents, or whatever you were
basically planning to do with the search results.
Someday To Come
Next we'll go through this process again, and actually build an example program
to index some files and then do searches against that index.
After that, we'll actually build a basic web search engine, using servlets and JSP.
We've already seen that Lucene is a piece of cake to use, and the servlet/jsp stuff
isn't much harder (unless you want to make it harder, which of course is possible
to do). This will also introduce the whole question of multithreading Lucene.
Fortunately, Lucene makes this really, really easy, because most - or all - of the
key Lucene classes are thread-safe.
Copyright 2001 by Steven J. Owens, all rights reserved.