You are on page 1of 9

Lucene Tutorial

By Steven J. Owens
Jarkarta Lucene (http://jakarta.apache.org/lucene/) is a high-performance, full-
featured, java, open-source, text search engine API written by Doug Cutting.
Note that Lucene is specifically an API, not an application. This means that all the
hard parts have been done, but the easy programming has been left to you. The
payoff for you is that, unlike normal search engine applications, you spend less
time wading through tons of options and build a search application that is
specifically suited to what you're doing. You can easily develop a custom search
application, perfectly suited to your needs. Lucene is startlingly easy to develop
with and use.
I'm going to assume that you're a basically competent programmer and that you
are basically competent in java.

Use the Source, Luke


This tutorial is a brief overview; the Lucene distribution comes with four example
classes:
• FileDocument
• IndexFiles
• SearchFiles
• DeleteFiles
These classes are really a good introduction to how to use Lucene. I wrote this
tutorial because I find it easier to follow code if I have a general idea of what's
going on, but it was tricky to write because it starts to look like the source code.
Lucene really does make it that easy.

Overview
I'm going to try to use emphasis tags any time I introduce a Lucene API class
name.
Here's a simple attempt to diagram how the Lucene classes go together:
Index
Document 1
Field A (name/value)

Field B (name/value)

Document 2
Field A (name/value)

Field B (name/value)
At the heart of Lucene is an Index. This class usually gets its data from a
filesystem directory that contains a certain set of files that follow a certain
structure, but it doesn't absolutely have to be a directory.
You pump data into the Index, then do searches on the Index to get results out. To
build the Index, you use an IndexWriter object. To run a search on the Index you
use an IndexSearcher object.
The search itself is a Query object, which you pass into IndexSearcher.search().
IndexSearcher.search() returns a Hits object, which contains a Vector of Document
objects.
Document objects are stored in the Index, but they have to be put into the Index
at some point, and that's your job. You have to select what data to enter in, and
convert them into Documents. You read in each data file (or database entry, or
whatever), instantiate a Document for it, break down the data into chunks and
store the chunks in the Document as Field objects (a name/value pair). When
you're done building a Document, you write it to the Index using the IndexWriter.
Queries can be quite complicated, so Lucene includes a tool to help generate
Query objects, called a QueryParser. The QueryParser takes a query string, much
like what you'd put into an Internet search engine, and generates a Query object.
Note: There's a gotcha that often pops up, so even though it's a lower-level detail,
I'm going to mention it here. It's the Analyzer. Lucene indexes text, and part of the
first step is cleaning up the text. You use an Analyzer to do this - it drops out
punctuation and commonly occurring but meaningless words (the, a, an, etc).
Lucene provides a couple different Analyzers, and you can make but your own,
but the BIG GOTCHA people keep running into is that you must make sure you
use the same sort of analyzer for both indexing and searching. You must
feed the same sort of Analyzer to the QueryParser that you originally fed
to the IndexWriter.
Moving on... did you notice what's not in the above? Lucene handles the indexing,
searching and retrieving, but it doesn't handle:
• managing the process (instantiating the objects and hooking them together,
both for indexing and for searching)
• selecting the data files
• parsing the data files
• getting the search string from the user
• displaying the search results to the user
Those are all your job. There are some helpful tools and some good examples
available in the Lucene contrib space, but generally Lucene is focused on doing
the indexing and searching, and leaves all of the rest up to you (so you can make
exactly the search solution you want).
I'm going to assume that typical uses for Lucene are either command-line driven,
or web-driven. The example code I mentioned above is for a command-line driven
searchable recipe database. Someday I'm going to build an example of how to
make a web-driven Lucene application and add it to this tutorial.
Don't Get Clever
You'll notice, as we get into this, a common theme. You'll notice the same theme if
you hang out on the lucene-user list and listen to Doug Cutting answering
questions. That theme is don't get clever, all the cleverness you'll ever need
has been put into really, really fast indexing and searching. This isn't to
say it's always best to use brute force, but in Lucene, if there's a simple way to do
it, that way probably makes the most sense. Remember Knuth: "early optimization
is the root of much evil."

Indexing Or Searching
At the top, you're either pumping data into your search application (indexing) or
pulling data out of it (searching).
I'm going to go over these classes in more or less the order you'd encounter them
by going through the the sample source files. Well, to be exact, I'm going to go
through them in the order the data would go through them, in going from an input
file to the output of a search request.
If you're not sure you're ready to dive into this depth, take a look at my not-so-
nitty-gritty overview.

Indexing In Depth
You index by creating Documents full of Fields (which contain name/value pairs)
and pumping them into an IndexWriter, which parses the contents of the Field
values into tokens and creates an index.

Document Objects
Lucene doesn't index files, it indexes Document objects. To index and then search
files, you first need write code that converts your files into Document objects.
A Document object is a collection of Field objects (name/value pairs). So, for each
file, instantiate a Document, then populate it with Fields.
This is the first potentially tricky bit, depending on what kind of files you're
indexing, how much the data in those files is structured, and how much of that
structure you want to preserve. Lucene just handles name/value pairs. Email, for
example, is mostly name/value oriented:
• to: fred
• from: barney
• subject: dinner?
• body: Let's get together for dinner tonight!
For more complex files, you have to "flatten" that structure out into a set of
name/value fields.
By the way, I'm saying "files" here, but the data source could really be anything -
chunks of a very large file, rows returned from an SQL query, individual email
messages from a mailbox file.
A minimum, as in the standard Lucene examples, would be:
A field containing... Which you'll use to...
the path to the original actually show the user the original document after the
document search
a modification date compare against the original Document's modification
date, to see if it needs to be reindexed.
the contents of the file run the search against
Note: This is an example, not a requirement. For example, if you don't have a
modification date, don't sweat it, you just have to reindex all of your files every
time (and in fact, that's the standard recommended approach for reindexing,
under the "don't get clever" rule of thumb).

The All Field


You also ought to really think about glomming all of the Field data together and
storing it as some sort of "all" Field. This is the easiest way to set it up so your
users can search all Fields at once, if they want. Yes, you could come up with a
complex scheme to rewrite your users' query so it searches across all of the
known fields, but remember, keep it simple.

Digression: Field Objects


A Field object contains a name (a String) and a value (a String or a Reader), and
three booleans that control whether or not the value will be indexed for searches,
tokenized prior to indexing, and stored in the index so it can be returned with the
search.
Let me explain those three booleans a bit more.
• Indexed for searches - sometimes you'll want to have fields available in your
Documents that don't really have anything to do with searching. Two
examples I can think of off the top of my head are creation dates and file
names, so you can compare when the Document was created against the
file modification date, and decide if the document needs to be reindexed.
Since these fields won't ever make sense to use in an actual search, you can
decrease the amount of work Lucene does by marking them as not indexed
for searches.
• Tokenized prior to indexing - tokenizing refers to taking a piece of text and
cleaning it up, and breaking it down into individual pieces (tokens) for the
indexer. This is done by the Analyzer. Some fields you may not want to be
tokenized, for example a serial number field.
• Stored in the index - even if a field is entirely indexed, it doesn't necessarily
mean that it'll be easy for Lucene to reconstruct it. Although Lucene is a
search index, and not a database, if your fields are reasonably small, you
can ask Lucene to store them in the index. With the fields stored in the
index, instead of using the Document to locate the original file or data and
load it, you can actually pull the data out of the Document. This works best
with fairly small fields and documents that you'd need to parse for display
anyway.
Some fields contain bulk data and are so large that you don't really want to
store them in the index. You can still make your life a little easier by storing
not just the filename, but a Reader object in the Field. This makes it simpler
for your application to just get the Reader out of the Hit and use it to read in
the data to display it to the user.
The Field class itself is pretty simple; it pretty much consists of the instance
variables of the field, accessor methods for those instance variables, a toString()
method, and a normal constructor. The only special part is several convenient
static factory methods for manufacturing fields. These factory methods build
Fields that are appropriate for several typical uses. I've listed them in order of how
often they'd likely be used (in my unqualified opinion):
(Note: Yes, these method names are capitalized; if I had to guess, I'd say it's
probably because they're factory methods - they instantiate and return Field
objects with particular parameters.)
Tokeniz Index Stor
Factory Method Use for
ed ed ed
Field.Text(String name, String contents you want stored
Yes Yes Yes
value)
Field.Text(String name, Reader contents you don't want
Yes Yes No
value) stored
Field.Keyword(String name, values you don't want
No Yes Yes
String value) broken down
Field.UnIndexed(String name, values you don't want
No No Yes
String value) indexed
Field.UnStored(String name, values you don't want
Yes Yes No
String value) stored

IndexWriter
The IndexWriter's job is to take the input (a Document), feed it through the
Analyzer you instantiate it with, and create an index. Using the IndexWriter itself
is fairly simple. You instantiate it with parameters for where to put the index files
and the Analyzer you want it to use for cleaning up the tokens. Then feed
Documents into IndexWriter.addDocument(). The actual index is a set of data files
that the IndexWriter creates in a location defined (depending on how you
instantiate the IndexWriter) by a lucene Directory object, a File, or a path string.

Directory Objects
You can also store the index in a Lucene Directory object. A Lucene Directory is an
abstraction around the java filesystem classes. Using a Directory lets the Lucene
classes hide what exactly is going on. This in turn lets you do clever behind-the-
scenes things like keeping the file cached in memory for really high performance
by using the RAM-based Directory class (Lucene comes with two Directory classes,
one for file-based and one for RAM-based).

Analyzers and Tokenizers


The analyzer's job is to take apart a string of text and give you back a stream of
tokens. The tokens are presumably usually words from the text content of the
string, and that's what gets stored (along with the location and other details) in
the index.
Each analyzer includes one or more tokenizers and may include filters. The
tokenizers take care of the actual rules for where to break the text up into words
(typically whitespace). The filters do any post-tokenizing work on the tokens
(typically dropping out punctuation and commonly occurring words like "the",
"an", "a", etc).
Lucene provides an Analyzer abstract class, and three implementations of
Analyzer. Glossing over the details:
SimpleAnalyze SimpleAnalyzer seems to just use a Tokenizer that converts all of
r the input to lower case.
StopAnalyzer StopAnalyzer includes the lower-case filter, and also has a filter
that drops out any "stop words", words like articles (a, an, the, etc)
that occur so commonly in english that they might as well be
noise for searching purposes. StopAnalyzer comes with a set of
stop words, but you can instantiate it with your own array of stop
words.
StandardAnaly StandardAnalyzer does both lower-case and stop-word filtering,
zer and in addition tries to do some basic clean-up of words, for
example taking out apostrophes ( ' ) and removing periods from
acronyms (i.e. "T.L.A." becomes "TLA").
These analyzers are in English. There are several analyzers for other languages
that have been developed by Lucene users. Check the Lucene Sandbox. If you
can't find an analyzer for your language, it's pretty straightforward to implement
your own. Use a SimpleAnalyzer for now, to learn how it works.

Searching In Depth
To actually do the search, you need an IndexSearcher, but we'll get to that in a
moment; before you can even think about feeding the IndexSearcher a query, you
have to have a Query object. The IndexSearcher does the actual munging through
the index, but it only understands Query objects.

Query and QueryParser Objects


You produce the Query object by feeding the user's argument string into
QueryParser.parse(), along with a string for the default field to search (if the user
doesn't specify which field to search) and an Analyzer. The Analyzer is what
QueryParser uses to tokenize the argument string. (Gotcha Warning: remember,
again, you have to make sure that you use the same flavor Analyzer for tokenizing
the argument string as you used for tokenizing the Index. StopAnalyzer is
probably a safe choice for this, since that's the one used in the example code.)
QueryParser.parse() returns a Query.
QueryParser has a static version of parse(), which I guess is there for convenience.
You can instantiate a QueryParser with an Analyzer and default field String and
keep it around. However, note that QueryParser is not thread-safe, so each thread
will need its own QueryParser.

Digression: Thread Safety


Doug Cutting has posted on the topic of thread safety a couple of times. Indexing
and searching are not only thread safe, but process safe. What this means is that:
• Multiple index searchers can read the lucene index files at the same time.
• An index writer or reader can edit the lucene index files while searches are
ongoing
• Multiple index writers or readers can try to edit the lucene index files at the
same time (it's important for the index writer/reader to be closed so it will
release the file lock).
However, the query parser is not thread safe, so each thread using the index
should have its own query parser.
The index writer however, is thread safe, so you can update the index while
people are searching it. However, you then have to make sure that the threads
with open index searchers close them and open new ones, to get the newly
updated data.

IndexSearchers
To get an IndexSearcher you simply instantiate an IndexSearcher with a single
argument that tells Lucene where to find an existing index. The argument is either
of these two:
• a string containing a path to the file,
• a Lucene Directory object (see the section about Directory objects under
"Indexing In Depth", above)

Digression: IndexReaders
(You can safely skip this section, as it's just me meandering through the Lucene
source code; not a whole lot of practical value here yet).
There's actually a third option for instantiating an IndexSearcher; you can
instantiate it with any class that is a concrete subclass of the abstract class
IndexReader
This makes more sense if you take a peek at the code for IndexSearcher. The
other two constructors just turn your file path or Directory object into an
IndexReader by calling the static method IndexReader.open(). Just for kicks, let's
do a little more digging and see that IndexReader.open() takes either a String file
path or a java File object and uses them to instantiate a Lucene Directory object,
then calls open(Directory).
NOTE: I have to admit, I'm a little confused at this point, since the API docs say
IndexReader is abstract (which means it can't be instantiated). Presumably that
means IndexReader.open(), a static factory method, instantiates an appropriate
concrete subclass of IndexReader and returns it. However, the API docs don't
show any concrete subclasses of IndexReader. Since I'm too lazy at the moment to
look through the source... oh, all right, I'm not too lazy to look through the source.
Hm. It appears the API docs are out of date, the com/Lucene/index directory
appears to contain a SegmentReader, which IndexReader.open() uses.

Multiple Indexes
If you're searching a single index, you use an IndexSearcher with a single index. If
you need to search across multiple indexes, you instantiate one IndexSearcher per
index, create an array, stick the IndexSearcher instances in the array, and
instantiate a MultiSearcher with the array as an argument.

Doing The Search


To actually do the search, you take the argument string the user enters, pass it to
a QueryParser and get back a parsed Query object (and remember (third time's
the charm) to use the right kind of Analyzer when you instantiate the
QueryParser; use the same sort of Analyzer that you used when you built
the index; the QueryParser'll use the Analyzer to tokenize the argument string).
Then you feed the parsed Query to the IndexSearcher.search(). The return is a
Hits object, which is a collection of Document objects for documents that matched
the search parameters. The Hits object also includes a score for each Document,
indicating how well it matched.

Hits
IndexSearcher.search(Query) returns a "Hits" object, which is sort of like a Vector,
containing a ranked list of Lucene Document objects. These are the same
Document objects you fed into the IndexWriter, but specifically the ones that
matched your search. Now you need to format the hits for a display, or
manufacture HREFs pointing to the original documents, or whatever you were
basically planning to do with the search results.

What's Not Mentioned Here


There are classes in the Lucene project that didn't get mentioned here, or only got
mentioned in passing. After all, the point of a tutorial is as much what NOT to tell
you (yet) as what to tell you. Otherwise I'd just say Use The Source, Luke.
I highly recommend sitting down with this tutorial and following through the
source of the demo classes first. Then, go back and do it again, only this time
when the demo class does something with a Lucene class, go look at the source of
the Lucene class and see what it's doing. Not only is this is a good way to learn
about Lucene, it's an excellent way to learn more about programming.

Someday To Come
Next we'll go through this process again, and actually build an example program
to index some files and then do searches against that index.
After that, we'll actually build a basic web search engine, using servlets and JSP.
We've already seen that Lucene is a piece of cake to use, and the servlet/jsp stuff
isn't much harder (unless you want to make it harder, which of course is possible
to do). This will also introduce the whole question of multithreading Lucene.
Fortunately, Lucene makes this really, really easy, because most - or all - of the
key Lucene classes are thread-safe.
Copyright 2001 by Steven J. Owens, all rights reserved.

You might also like