Down to the Bare Metal: In-Memory Index and Search

id = count++; } public String toString { int len = text.length; if len 25 len = 25; return [Document id: + id + : + text.substring0,len + ...]; } } We will write a class InM emorySearch that indexes instances of the T estDocument class and supplies an API for search. The first decision to make is how to store the index that maps search terms to documents that contain the search terms. One sim- ple idea would be to use a map to maintain a set of document IDs for each search term; something like: MapString, SetInteger index; This would be easy to implement but leaves much to be desired so we will take a different approach. We would like to rank documents by relevance but a relevance measure just based on containing all or most of the search terms is weak. We will improve the index by also storing a score of how many times a search term occurs in a document, scaled by the number of words in a document. Since our document model does not contain links to other documents we will not use a Google-like page ranking algorithm that increases the relevance of search results based on the number of incoming links to matched documents. We will use a utility class again, assuming same package data visibility to hold a document ID and a search term count. I used generics for the first version of this class to allow alternative types for counting word use in a document and later changed the code to hardwiring the types for ID and word count to native integer values for runtime efficiency and to use less memory. Here is the second version of the code: class IdCount implements ComparableIdCount { int id = 0; int count = 0; public IdCountint k, int v { this.id = k; this.count = v; } public String toString { return [IdCount: + id + : + count + ]; } Override 188 public int compareToIdCount o { don’t use o.count - count: avoid overflows if o.count == count return 0; if o.count count return 1; return -1; } } We can now define the data structure for our index: MapString,TreeSetIdCount index = new HashtableString, TreeSetIdCount; The following code is used to add documents to the index. I score word counts by dividing by the maximum word size I expect for documents; in principle it would be better to use a F loat value but I prefer working with and debugging code using integers – debug output is more readable. The reason why the number of times a word appears in a document needs to be scaled by the the size of the document is fairly obvious: if a given word appears once in a document with 10 words and once in another document with 1000 words, then the word is much more relevant to finding the first document. public void addTestDocument document { MapString,Integer wcount = new HashtableString,Integer; StringTokenizer st = new StringTokenizerdocument.text.toLowerCase, .,;:; int num_words = st.countTokens; if num_words == 0 return; while st.hasMoreTokens { String word = st.nextToken; System.out.printlnword; if wcount.containsKeyword { wcount.putword, wcount.getword + MAX_WORDS_PER_DOCUMENT num_words; } else { wcount.putword, MAX_WORDS_PER_DOCUMENT num_words; } } for String word : wcount.keySet { TreeSetIdCount ts; 189 if index.containsKeyword { ts = index.getword; } else { ts = new TreeSetIdCount; index.putword, ts; } ts.addnew IdCountdocument.id, wcount.getword MAX_WORDS_PER_DOCUMENT num_words; } } If a word is in the index hash table then the hash value will be a sorted T reeSet of IdCount objects. Sort order is in decreasing size of the scaled word count. Notice that I converted all tokenized words in document text to lower case but I did not stem the words. For some applications you may want to use a word stemmer as we did in Section 9.1. I used the temporary hash table wcount to hold word counts for the document being indexed and once wcount was created and filled, then looked up the T reeSet for each word creating it if it did not yet exist and added in new IdCount objects to represent the currently indexed document and the scaled number of occurrences for the word that is the index hash table key. For development it is good to have a method that prints out the entire index; the following method serves this purpose: public void debug { System.out.println Debug: dump of search index:\n; for String word : index.keySet { System.out.println\n + word; TreeSetIdCount ts = index.getword; IteratorIdCount iter = ts.iterator; while iter.hasNext { System.out.println + iter.next; } } } Here are a few lines of example code to create an index and add three test documents: InMemorySearch ims = new InMemorySearch; TestDocument doc1 = new TestDocumentThis is a test for index and a test for search.; 190 ims.adddoc1; TestDocument doc2 = new TestDocumentPlease test the index code.; ims.adddoc2; TestDocument doc3 = new TestDocumentPlease test the index code before tomorrow.; ims.adddoc3; ims.debug; The method debug produces the following output most is not shown for brevity. Remember that the variable IdCount contains a data pair: the document integer ID and a scaled integer word count in the document. Also notice that the T reeSet is sorted in descending order of scaled word count. Debug: dump of search index: code [IdCount: 1 : 40000] [IdCount: 2 : 20285] please [IdCount: 1 : 40000] [IdCount: 2 : 20285] index [IdCount: 1 : 40000] [IdCount: 2 : 20285] [IdCount: 0 : 8181] ... Given the hash table index it is simple to take a list of search words and return a sorted list of matching documents. We will use a temporary hash table ordered results that maps document IDs to the current search result score for that document. We tokenize the string containing search terms, and for each search word we look up if it exists a score count in the temporary map ordered results creating a new IdCount object otherwise and increment the score count. Note that the map ordered results is ordered later by sorting the keys by the hash table value: public ListInteger searchString search_terms, int max_terms { ListInteger ret = new ArrayListIntegermax_terms; 191 temporary tree set to keep ordered search results: final MapInteger,Integer ordered_results = new HashtableInteger,Integer0; StringTokenizer st = new StringTokenizersearch_terms.toLowerCase, .,;:; while st.hasMoreTokens { String word = st.nextToken; IteratorIdCount word_counts = index.getword.iterator; while word_counts.hasNext { IdCount ts = word_counts.next; Integer id = ts.id; if ordered_results.containsKeyid { ordered_results.putid, ordered_results.getid + ts.count; } else { ordered_results.putid, ts.count; } } } ListInteger keys = new ArrayListIntegerordered_results.keySet; Collections.sortkeys, new ComparatorInteger { public int compareInteger a, Integer b { return -ordered_results.geta. compareToordered_results.getb ; } }; int count = 0; result_loop: for Integer id : keys { if count++ = max_terms break result_loop; ret.addid; } return ret; } For the previous example using the three short test documents, we can search the index, in this case for a maximum of 2 results, using: ListInteger search_results = ims.searchtest index, 2; System.out.printlnresult doc IDs: +search_results; 192 getting the results: result doc IDs: [1, 2] If you want to use this “bare metal” indexing and search library, there are a few details that still need to be implemented. You will probably want to persist the T estDocument objects and this can be done simply by tagging the class with the Serializable interface and writing serialized files using the document ID as the file name. You might also want to serialize the InM emorySearch class. While I sometimes implement custom indexing and search libraries for projects that require a lightweight and flexible approach to indexing and search as we did in this section, I usually use either the Lucene search library or a combination of the Hi- bernate Object Relational Mapping ORM library with Lucene Hibernate Search. We will look at Lucene in Section 10.4.

10.4 Indexing and Search Using Embedded Lucene

Books have been written on the Lucene indexing and search library and in this short section we will look at a brief application example that you can use for a quick reference for starting Lucene based projects. I consider Lucene to be an important tool for building intelligent text processing systems. Lucene supports the concept of a document with one or more fields. Fields can either be indexed or not, and optionally stored in a disk-based index. Searchable fields can be automatically tokenized using either one of Lucene’s built in text tokenizers or you can supply your customized tokenizer. When I am starting a new project using Lucene I begin by using a template class LuceneM anager that you can find in the file src-index-searchLuceneManager.java. I usually clone this file and make any quick changes for adding fields to documents, etc. We will look at a few important code snippets in the class LuceneM anager and you can refer to the source code for more details. We will start by looking at how in- dices are stored and managed on disk. The class constructor stores the file path to the Lucene disk index. You can optionally use method createAndClearLuceneIndex to delete an existing Lucene index if it exists and creates an empty index. public LuceneManagerString data_store_file_root { this.data_store_file_root = data_store_file_root; } 193 public void createAndClearLuceneIndex throws CorruptIndexException, LockObtainFailedException, IOException { deleteFilePathnew Filedata_store_file_root + lucene_index; File index_dir = new Filedata_store_file_root + lucene_index; new IndexWriterindex_dir, new StandardAnalyzer, true.close; } If you are using an existing disk-based index that you want to reuse, then do not call method createAndClearLuceneIndex. The last argument to the class IndexW riter constructor is a flag to create a new index, overwriting any existing indices. I use the utility method deleteF ileP ath to make sure that all files from any previous indices using the same top level file path are deleted. The method addDocumentT oIndex is used to add new documents to the index. Here we call the constructor for the class IndexW riter with a value of false for the last argument to avoid overwriting the index each time method addDocumentT oIndex is called. public void addDocumentToIndex String document_original_uri, String document_plain_text throws CorruptIndexException, IOException { File index_dir = new Filedata_store_file_root + lucene_index; writer = new IndexWriterindex_dir, new StandardAnalyzer, false; Document doc = new Document; store URI in index; do not index doc.addnew Fielduri, document_original_uri, Field.Store.YES, Field.Index.NO; store text in index; index doc.addnew Fieldtext, document_plain_text, Field.Store.YES, Field.Index.TOKENIZED; writer.addDocumentdoc; writer.optimize; optional writer.close; } 194