Professional Documents
Culture Documents
by Bertrand Delacretaz
August 09, 2006
Solr (pronounced "solar") builds on the well-known Lucene search engine library
to create an enterprise search server with a simple HTTP/XML interface. Using
Solr, large collections of documents can be indexed based on strongly typed field
definitions, thereby taking advantage of Lucene's powerful full-text search
features. This article describes Solr's indexing interface and its main features, and
shows how field-type definitions are used for precise content analysis.
The tutorial provided on the Solr website gives a good overview of how Solr works
and integrates with your system.
To run Solr, you'll need a Java 1.5 virtual machine and, optionally, a scripting
environment (a bash shell, or Cygwin if you're running Windows) to run the
provided utility and test scripts.
The HTTP/XML interface of the indexer has two main access points: the update
URL, which maintains the index, and the select URL, which is used for queries. In
the default configuration, they are found at:
http://localhost:8983/solr/update
http://localhost:8983/solr/select
The <add> element tells Solr that we want to add the document to the index (or
replace it if it's already indexed), and with the default configuration, the id field is
used as a unique identifier for the document. Posting another document with the
same id will overwrite existing fields and add new ones to the indexed data.
Note that the added document isn't yet visible in queries. To speed up the
addition of multiple documents (an <add> element can contain multiple <doc>
elements), changes aren't committed after each document, so we must POST an
XML document containing a <commit> element to make our changes visible.
This is all handled by the post.sh script provided in the Solr examples, which uses
curl to do the POST. Clients for several languages (Ruby, PHP, Java, Python) are
provided on the Solr wiki, but, of course, any language that can do an HTTP POST
will be able to talk to Solr.
Once we have indexed some data, an HTTP GET on the select URL does the
querying. The example below searches for the word "video" in the default search
field and asks for the name and id fields to be included in the response.
$ export URL="http://localhost:8983/solr/select/"
$ curl "$URL?indent=on&q=video&fl=name,id"
As you can imagine, there's much more to this, but those are the basics: POST an
XML document to have it indexed, do another POST to commit changes, and make
a GET request to query the index.
This simple and thin interface makes it easy to create a system-wide indexing
service with Solr: convert the relevant parts of your business objects or
documents to the simple XML required for indexing, and index all of your data in a
single place--whatever its source--combining full-text and typed fields. At this
point, the data-mining area of your brain should start blinking happily--at least
mine does!
Now that we have the basics covered, let's examine the indexing and search
interfaces in more detail.