Indexing and Search Using Embedded Lucene
public void createAndClearLuceneIndex throws CorruptIndexException,
LockObtainFailedException, IOException {
deleteFilePathnew Filedata_store_file_root + lucene_index;
File index_dir = new Filedata_store_file_root + lucene_index;
new IndexWriterindex_dir, new StandardAnalyzer, true.close;
} If you are using an existing disk-based index that you want to reuse, then do not call
method createAndClearLuceneIndex. The last argument to the class IndexW riter constructor is a flag to create a new index, overwriting any existing indices. I use the
utility method deleteF ileP ath to make sure that all files from any previous indices using the same top level file path are deleted. The method addDocumentT oIndex
is used to add new documents to the index. Here we call the constructor for the class IndexW riter with a value of false for the last argument to avoid overwriting the
index each time method addDocumentT oIndex is called.
public void addDocumentToIndex String document_original_uri,
String document_plain_text throws CorruptIndexException, IOException {
File index_dir = new Filedata_store_file_root + lucene_index;
writer = new IndexWriterindex_dir,
new StandardAnalyzer, false; Document doc = new Document;
store URI in index; do not index doc.addnew Fielduri,
document_original_uri, Field.Store.YES,
Field.Index.NO; store text in index; index
doc.addnew Fieldtext, document_plain_text,
Field.Store.YES, Field.Index.TOKENIZED;
writer.addDocumentdoc; writer.optimize; optional
writer.close; }
194
You can add fields as needed when you create individual Lucene Document objects but you will want to add the same fields in your application: it is not good to have
different documents in an index with different fields. There are a few things that you may want to change if you use this class as an implementation example in your own
projects. If you are adding many documents to the index in a short time period, then it is inefficient to open the index, add one document, and then optimize and close
the index. You might want to add a method that passes in collections of URIs and document text strings for batch inserts. You also may not want to store the document
text in the index if you are already storing document text somewhere else, perhaps in a database.
There are two search methods in my LuceneM anager class: one just returns the document URIs for search matches and the other returns both URIs and the original
document text. Both of these methods open an instance of IndexReader for each query. For high search volume operations in a multi-threaded environment, you may
want to create a pool of IndexReader instances and reuse them. There are several text analyzer classes in Lucene and you should use the same analyzer class when
adding indexed text fields to the index as when you perform queries. In the two search methods I use the same StandardAnalyzer class that I used when adding
documents to the index. The following method returns a list of string URIs for matched documents:
public ListString searchIndexForURIsString search_query
throws ParseException, IOException { reader = IndexReader.opendata_store_file_root +
lucene_index; ListString ret = new ArrayListString;
Searcher searcher = new IndexSearcherreader; Analyzer analyzer = new StandardAnalyzer;
QueryParser parser =
new QueryParsertext, analyzer; Query query = parser.parsesearch_query;
Hits hits = searcher.searchquery; for int i = 0; i hits.length; i++ {
System.out.println searchIndexForURIs: hit: + hits.doci;
Document doc = hits.doci; String uri = doc.geturi;
ret.adduri; }
reader.close; return ret;
}
195
The Lucene class Hits is used for returned search matches and here we use APIs to get the number of hits and for each hit get back an instance of the Lucene class
Document. Note that the field values are retrieved by name, in this case “uri.” The other search method in my utility class searchIndexF orU RIsAndDocT ext
is almost the same as searchIndexF orU RIs so I will only show the differences:
public ListString[] searchIndexForURIsAndDocText
String search_query throws Exception { ListString[] ret = new ArrayListString[];
... for int i = 0; i hits.length; i += 1 {
Document doc = hits.doci; System.out.println
hit: + hits.doci;
String [] pair = new String[]{doc.geturi, doc.gettext};
ret.addpair; }
... return ret;
} Here we also return the original text from matched documents that we get by fetch-
ing the named field “text.” The following code snippet is an example for using the LuceneM anager class:
LuceneManager lm = new LuceneManagertmp; start fresh: create a new index:
lm.createAndClearLuceneIndex; lm.addDocumentToIndexfile:tmptest1.txt,
This is a test for index and a test for search.; lm.addDocumentToIndexfile:tmptest2.txt,
Please test the index code.; lm.addDocumentToIndexfile:tmptest3.txt,
Please test the index code before tomorrow.; get URIs of matching documents:
ListString doc_uris = lm.searchIndexForURIstest, index;
System.out.printlnMatched document URIs: +doc_uris; get URIs and document text for matching documents:
ListString[] doc_uris_with_text = lm.searchIndexForURIsAndDocTexttest, index;
196
for String[] uri_and_text : doc_uris_with_text { System.out.printlnMatched document URI:
+ uri_and_text[0];
System.out.println document text: +
uri_and_text[1]; }
and here is the sample output with debug printout from deleting the old test disk- based index removed:
Matched document URIs: [file:tmptest1.txt, file:tmptest2.txt,
file:tmptest3.txt] Matched document URI:
file:tmptest1.txt document text: This is a test for index
and a test for search. Matched document URI:
file:tmptest2.txt document text: Please test the index code.
Matched document URI: file:tmptest3.txt
document text: Please test the index code before tomorrow.
I use the Lucene library frequently on customer projects and although tailoring Lucene to specific applications is not simple, the wealth of options for analyzing
text and maintaining disk-based indices makes Lucene a very good tool. Lucene is also very efficient and scales well to very large indices.
In Section 10.5 we will look at the Nutch system that is built on top of Lucene and provides a complete turnkey but also highly customizable solution to implement-
ing search in large scale projects where it does not make sense to use Lucene in an embedded mode as we did in this Section.