Indexing and Search Using Embedded Lucene

public void createAndClearLuceneIndex throws CorruptIndexException, LockObtainFailedException, IOException { deleteFilePathnew Filedata_store_file_root + lucene_index; File index_dir = new Filedata_store_file_root + lucene_index; new IndexWriterindex_dir, new StandardAnalyzer, true.close; } If you are using an existing disk-based index that you want to reuse, then do not call method createAndClearLuceneIndex. The last argument to the class IndexW riter constructor is a flag to create a new index, overwriting any existing indices. I use the utility method deleteF ileP ath to make sure that all files from any previous indices using the same top level file path are deleted. The method addDocumentT oIndex is used to add new documents to the index. Here we call the constructor for the class IndexW riter with a value of false for the last argument to avoid overwriting the index each time method addDocumentT oIndex is called. public void addDocumentToIndex String document_original_uri, String document_plain_text throws CorruptIndexException, IOException { File index_dir = new Filedata_store_file_root + lucene_index; writer = new IndexWriterindex_dir, new StandardAnalyzer, false; Document doc = new Document; store URI in index; do not index doc.addnew Fielduri, document_original_uri, Field.Store.YES, Field.Index.NO; store text in index; index doc.addnew Fieldtext, document_plain_text, Field.Store.YES, Field.Index.TOKENIZED; writer.addDocumentdoc; writer.optimize; optional writer.close; } 194 You can add fields as needed when you create individual Lucene Document objects but you will want to add the same fields in your application: it is not good to have different documents in an index with different fields. There are a few things that you may want to change if you use this class as an implementation example in your own projects. If you are adding many documents to the index in a short time period, then it is inefficient to open the index, add one document, and then optimize and close the index. You might want to add a method that passes in collections of URIs and document text strings for batch inserts. You also may not want to store the document text in the index if you are already storing document text somewhere else, perhaps in a database. There are two search methods in my LuceneM anager class: one just returns the document URIs for search matches and the other returns both URIs and the original document text. Both of these methods open an instance of IndexReader for each query. For high search volume operations in a multi-threaded environment, you may want to create a pool of IndexReader instances and reuse them. There are several text analyzer classes in Lucene and you should use the same analyzer class when adding indexed text fields to the index as when you perform queries. In the two search methods I use the same StandardAnalyzer class that I used when adding documents to the index. The following method returns a list of string URIs for matched documents: public ListString searchIndexForURIsString search_query throws ParseException, IOException { reader = IndexReader.opendata_store_file_root + lucene_index; ListString ret = new ArrayListString; Searcher searcher = new IndexSearcherreader; Analyzer analyzer = new StandardAnalyzer; QueryParser parser = new QueryParsertext, analyzer; Query query = parser.parsesearch_query; Hits hits = searcher.searchquery; for int i = 0; i hits.length; i++ { System.out.println searchIndexForURIs: hit: + hits.doci; Document doc = hits.doci; String uri = doc.geturi; ret.adduri; } reader.close; return ret; } 195 The Lucene class Hits is used for returned search matches and here we use APIs to get the number of hits and for each hit get back an instance of the Lucene class Document. Note that the field values are retrieved by name, in this case “uri.” The other search method in my utility class searchIndexF orU RIsAndDocT ext is almost the same as searchIndexF orU RIs so I will only show the differences: public ListString[] searchIndexForURIsAndDocText String search_query throws Exception { ListString[] ret = new ArrayListString[]; ... for int i = 0; i hits.length; i += 1 { Document doc = hits.doci; System.out.println hit: + hits.doci; String [] pair = new String[]{doc.geturi, doc.gettext}; ret.addpair; } ... return ret; } Here we also return the original text from matched documents that we get by fetch- ing the named field “text.” The following code snippet is an example for using the LuceneM anager class: LuceneManager lm = new LuceneManagertmp; start fresh: create a new index: lm.createAndClearLuceneIndex; lm.addDocumentToIndexfile:tmptest1.txt, This is a test for index and a test for search.; lm.addDocumentToIndexfile:tmptest2.txt, Please test the index code.; lm.addDocumentToIndexfile:tmptest3.txt, Please test the index code before tomorrow.; get URIs of matching documents: ListString doc_uris = lm.searchIndexForURIstest, index; System.out.printlnMatched document URIs: +doc_uris; get URIs and document text for matching documents: ListString[] doc_uris_with_text = lm.searchIndexForURIsAndDocTexttest, index; 196 for String[] uri_and_text : doc_uris_with_text { System.out.printlnMatched document URI: + uri_and_text[0]; System.out.println document text: + uri_and_text[1]; } and here is the sample output with debug printout from deleting the old test disk- based index removed: Matched document URIs: [file:tmptest1.txt, file:tmptest2.txt, file:tmptest3.txt] Matched document URI: file:tmptest1.txt document text: This is a test for index and a test for search. Matched document URI: file:tmptest2.txt document text: Please test the index code. Matched document URI: file:tmptest3.txt document text: Please test the index code before tomorrow. I use the Lucene library frequently on customer projects and although tailoring Lucene to specific applications is not simple, the wealth of options for analyzing text and maintaining disk-based indices makes Lucene a very good tool. Lucene is also very efficient and scales well to very large indices. In Section 10.5 we will look at the Nutch system that is built on top of Lucene and provides a complete turnkey but also highly customizable solution to implement- ing search in large scale projects where it does not make sense to use Lucene in an embedded mode as we did in this Section.

10.5 Indexing and Search with Nutch Clients

This is the last section in this book, and we have a great topic for finishing the book: the Nutch system that is a very useful tool for information storage and retrieval. Out of the box, it only takes about 15 minutes to set up a “vanilla” Nutch server with the default web interface for searching documents. Nutch can be configured to index documents on a local file system and contains utilities for processing a wide range of document types Microsoft Office, OpenOffice.org, PDF, TML, etc.. You can also configure Nutch to spider remote and local private usually on a company LAN web sites. 197 The Nutch web site http:lucene.apache.orgnutch contains binary distributions and tutorials for quickly setting up a Nutch system and I will not repeat all of these di- rections here. What I do want to show you is how I usually use the Nutch system on customer projects: after I configure Nutch to periodically “spider” customer specific data sources I then use a web services client library to integrate Nutch with other systems that need both document repository and search functionality. Although you can tightly couple your Java applications with Nutch using the Nutch API, I prefer to use the OpenSearch API that is an extension of RSS 2.0 for per- forming search using web service calls. OpenSearch was originally developed for Amazon’s A9.com search engine and may become widely adopted since it is a rea- sonable standard. More information on the OpenSearch standard can be found at http:www.opensearch.org but I will cover the basics here.

10.5.1 Nutch Server Fast Start Setup

For completeness, I will quickly go over the steps I use to set up Tomcat version 6 with Nutch. For this discussion, I assume that you have unpacked Tomcat and changed the directory name to Tomcat6 Nutch, that you have removed all files from the directory Tomcat6 Nutchwebapps, and that you have then moved the nutch- 0.9.war file I am using Nutch version 0.9 to the Tomcat webapps directory chang- ing its name to ROOT.war: Tomcat6_NutchwebappsROOT.war I then move the directory nutch-0.9 to: Tomcat6_Nutchnutch The file Tomcat6 Nutchnutchconfcrawl-urlfilter.txt needs to be edited to specify a combination of local and remote data sources; here I have configured it to spider just my http:knowledgebooks.com web site the only changes I had to make are the two lines, one being a comment line containing the string “knowledgebooks.com”: skip file:, ftp:, mailto: urls -ˆfile|ftp|mailto: skip image and other suffixes we can’t yet parse -\.gif|GIF|jpg|JPG| ... skip URLs containing certain characters as probable queries, etc. -[?=] 198