You are on page 1of 3

Solr: Indexing XML with Lucene and REST

by Bertrand Delacretaz
August 09, 2006

Solr (pronounced "solar") builds on the well-known Lucene search engine library
to create an enterprise search server with a simple HTTP/XML interface. Using
Solr, large collections of documents can be indexed based on strongly typed field
definitions, thereby taking advantage of Lucene's powerful full-text search
features. This article describes Solr's indexing interface and its main features, and
shows how field-type definitions are used for precise content analysis.

Solr began at CNET Networks, where it is used to provide high-relevancy search


and faceted browsing capabilities. Although quite new as a public project (the
code was first published in January 2006), it is already used for several high-traffic
websites.

The project is currently in incubation at the Apache Software Foundation (ASF).


This means that it is a candidate for becoming an official project of the ASF, after
an observation phase during which the project's community and code are
examined for conformity to the ASF's principles (see the incubator homepage for
more info).
Solr in Ten Minutes

The tutorial provided on the Solr website gives a good overview of how Solr works
and integrates with your system.

To run Solr, you'll need a Java 1.5 virtual machine and, optionally, a scripting
environment (a bash shell, or Cygwin if you're running Windows) to run the
provided utility and test scripts.

The HTTP/XML interface of the indexer has two main access points: the update
URL, which maintains the index, and the select URL, which is used for queries. In
the default configuration, they are found at:
http://localhost:8983/solr/update
http://localhost:8983/solr/select

To add a document to the index, we POST an XML representation of the fields to


index to the update URL. The XML looks like the example below, with a <field>
element for each field to index. Such documents represent the metadata and
content of the actual documents or business objects that we're indexing. Any data
is indexable as long as it can be converted to this simple format.
<add>
<doc>
<field name="id">9885A004</field>
<field name="name">Canon PowerShot SD500</field>
<field name="category">camera</field>
<field name="features">3x optical zoom</field>
<field name="features">aluminum case</field>
<field name="weight">6.4</field>
<field name="price">329.95</field>
</doc>
</add>

The <add> element tells Solr that we want to add the document to the index (or
replace it if it's already indexed), and with the default configuration, the id field is
used as a unique identifier for the document. Posting another document with the
same id will overwrite existing fields and add new ones to the indexed data.

Note that the added document isn't yet visible in queries. To speed up the
addition of multiple documents (an <add> element can contain multiple <doc>
elements), changes aren't committed after each document, so we must POST an
XML document containing a <commit> element to make our changes visible.

This is all handled by the post.sh script provided in the Solr examples, which uses
curl to do the POST. Clients for several languages (Ruby, PHP, Java, Python) are
provided on the Solr wiki, but, of course, any language that can do an HTTP POST
will be able to talk to Solr.

Once we have indexed some data, an HTTP GET on the select URL does the
querying. The example below searches for the word "video" in the default search
field and asks for the name and id fields to be included in the response.
$ export URL="http://localhost:8983/solr/select/"
$ curl "$URL?indent=on&q=video&fl=name,id"

<?xml version="1.0" encoding="UTF-8"?>


<response>
<responseHeader>
<status>0</status><QTime>1</QTime>
</responseHeader>

<result numFound="2" start="0">


<doc>
<str name="id">MA147LL/A</str>
<str name="name">Apple 60 GB iPod Black</str>
</doc>
<doc>
<str name="id">EN7800GTX/2DHTV/256M</str>
<str name="name">ASUS Extreme N7800GTX</str>
</doc>
</result>
</response>

As you can imagine, there's much more to this, but those are the basics: POST an
XML document to have it indexed, do another POST to commit changes, and make
a GET request to query the index.
This simple and thin interface makes it easy to create a system-wide indexing
service with Solr: convert the relevant parts of your business objects or
documents to the simple XML required for indexing, and index all of your data in a
single place--whatever its source--combining full-text and typed fields. At this
point, the data-mining area of your brain should start blinking happily--at least
mine does!

Now that we have the basics covered, let's examine the indexing and search
interfaces in more detail.

You might also like