Down to the Bare Metal: In-Memory Index and Search
id = count++; }
public String toString { int len = text.length;
if len 25 len = 25; return [Document id: + id + : +
text.substring0,len + ...]; }
} We will write a class InM emorySearch that indexes instances of the T estDocument
class and supplies an API for search. The first decision to make is how to store the index that maps search terms to documents that contain the search terms. One sim-
ple idea would be to use a map to maintain a set of document IDs for each search term; something like:
MapString, SetInteger index; This would be easy to implement but leaves much to be desired so we will take a
different approach. We would like to rank documents by relevance but a relevance measure just based on containing all or most of the search terms is weak. We will
improve the index by also storing a score of how many times a search term occurs in a document, scaled by the number of words in a document. Since our document
model does not contain links to other documents we will not use a Google-like page ranking algorithm that increases the relevance of search results based on the
number of incoming links to matched documents. We will use a utility class again, assuming same package data visibility to hold a document ID and a search term
count. I used generics for the first version of this class to allow alternative types for counting word use in a document and later changed the code to hardwiring the types
for ID and word count to native integer values for runtime efficiency and to use less memory. Here is the second version of the code:
class IdCount implements ComparableIdCount { int id = 0;
int count = 0; public IdCountint k, int v {
this.id = k; this.count = v;
} public String toString {
return [IdCount: + id + : + count + ]; }
Override
188
public int compareToIdCount o { don’t use o.count - count: avoid overflows
if o.count == count return 0; if o.count count return 1;
return -1;
} }
We can now define the data structure for our index: MapString,TreeSetIdCount index =
new HashtableString, TreeSetIdCount; The following code is used to add documents to the index. I score word counts by
dividing by the maximum word size I expect for documents; in principle it would be better to use a F loat value but I prefer working with and debugging code using
integers – debug output is more readable. The reason why the number of times a word appears in a document needs to be scaled by the the size of the document
is fairly obvious: if a given word appears once in a document with 10 words and once in another document with 1000 words, then the word is much more relevant to
finding the first document.
public void addTestDocument document { MapString,Integer wcount =
new HashtableString,Integer; StringTokenizer st =
new StringTokenizerdocument.text.toLowerCase, .,;:;
int num_words = st.countTokens; if num_words == 0
return; while st.hasMoreTokens {
String word = st.nextToken; System.out.printlnword;
if wcount.containsKeyword { wcount.putword, wcount.getword +
MAX_WORDS_PER_DOCUMENT num_words; } else {
wcount.putword, MAX_WORDS_PER_DOCUMENT num_words;
} }
for String word : wcount.keySet { TreeSetIdCount ts;
189
if index.containsKeyword { ts = index.getword;
} else { ts = new TreeSetIdCount;
index.putword, ts; }
ts.addnew IdCountdocument.id, wcount.getword MAX_WORDS_PER_DOCUMENT num_words;
} }
If a word is in the index hash table then the hash value will be a sorted T reeSet of IdCount objects. Sort order is in decreasing size of the scaled word count. Notice
that I converted all tokenized words in document text to lower case but I did not stem the words. For some applications you may want to use a word stemmer as we
did in Section 9.1. I used the temporary hash table wcount to hold word counts for the document being indexed and once wcount was created and filled, then looked
up the T reeSet for each word creating it if it did not yet exist and added in new IdCount objects to represent the currently indexed document and the scaled number
of occurrences for the word that is the index hash table key.
For development it is good to have a method that prints out the entire index; the following method serves this purpose:
public void debug { System.out.println
Debug: dump of search index:\n; for String word : index.keySet {
System.out.println\n + word; TreeSetIdCount ts = index.getword;
IteratorIdCount iter = ts.iterator; while iter.hasNext {
System.out.println + iter.next;
} }
} Here are a few lines of example code to create an index and add three test documents:
InMemorySearch ims = new InMemorySearch; TestDocument doc1 =
new TestDocumentThis is a test for index and a test for search.;
190
ims.adddoc1; TestDocument doc2 =
new TestDocumentPlease test the index code.; ims.adddoc2;
TestDocument doc3 = new TestDocumentPlease test the index code
before tomorrow.; ims.adddoc3;
ims.debug;
The method debug produces the following output most is not shown for brevity. Remember that the variable IdCount contains a data pair: the document integer ID
and a scaled integer word count in the document. Also notice that the T reeSet is sorted in descending order of scaled word count.
Debug: dump of search index: code
[IdCount: 1 : 40000] [IdCount: 2 : 20285]
please [IdCount: 1 : 40000]
[IdCount: 2 : 20285] index
[IdCount: 1 : 40000] [IdCount: 2 : 20285]
[IdCount: 0 : 8181]
... Given the hash table index it is simple to take a list of search words and return a
sorted list of matching documents. We will use a temporary hash table ordered results that maps document IDs to the current search result score for that document. We
tokenize the string containing search terms, and for each search word we look up if it exists a score count in the temporary map ordered results creating a
new IdCount object otherwise and increment the score count. Note that the map ordered results is ordered later by sorting the keys by the hash table value:
public ListInteger searchString search_terms, int max_terms {
ListInteger ret = new ArrayListIntegermax_terms;
191
temporary tree set to keep ordered search results: final MapInteger,Integer ordered_results =
new HashtableInteger,Integer0; StringTokenizer st =
new StringTokenizersearch_terms.toLowerCase, .,;:;
while st.hasMoreTokens { String word = st.nextToken;
IteratorIdCount word_counts = index.getword.iterator;
while word_counts.hasNext { IdCount ts = word_counts.next;
Integer id = ts.id; if ordered_results.containsKeyid {
ordered_results.putid, ordered_results.getid + ts.count;
} else { ordered_results.putid, ts.count;
} }
} ListInteger keys =
new ArrayListIntegerordered_results.keySet; Collections.sortkeys, new ComparatorInteger {
public int compareInteger a, Integer b { return -ordered_results.geta.
compareToordered_results.getb ; }
}; int count = 0;
result_loop: for Integer id : keys {
if count++ = max_terms break result_loop; ret.addid;
} return ret;
} For the previous example using the three short test documents, we can search the
index, in this case for a maximum of 2 results, using:
ListInteger search_results = ims.searchtest index, 2;
System.out.printlnresult doc IDs: +search_results;
192
getting the results: result doc IDs: [1, 2]
If you want to use this “bare metal” indexing and search library, there are a few details that still need to be implemented. You will probably want to persist the
T estDocument objects and this can be done simply by tagging the class with the Serializable interface and writing serialized files using the document ID as the file
name. You might also want to serialize the InM emorySearch class.
While I sometimes implement custom indexing and search libraries for projects that require a lightweight and flexible approach to indexing and search as we did in this
section, I usually use either the Lucene search library or a combination of the Hi- bernate Object Relational Mapping ORM library with Lucene Hibernate Search.
We will look at Lucene in Section 10.4.