Finding the Index for Partially Matched Strings

- 246 - if key == FAST_KEY1 return value1; else if key.equalsFASTISH_KEY2 return value2; else if key.equalspossibly_fast_key_assigned_at_runtime return value3; else return hash.getkey; }

11.7 Finding the Index for Partially Matched Strings

The problem considered here concerns a large number of string keys that need to be accessed by full or partial match. Each string is unique, so the full-match access can easily be handled by a standard hash-table structure e.g., java.util.HashMap . But the partial-match access needs to collect all objects that have string keys starting with a particular substring. So, for example, if you had the hash consisting of keys and values: hello 1 bye 2 hi 3 Then the full match for key hi retrieves 3 , and the partial match against strings starting with h retrieves the collection {1,3} . Using a hash-table structure for the partial-match access is expensive because it requires that all keys be iterated over, and then each key matching the corresponding object needs to be collated. Of course, I am considering here a large collection of strings. Alternatives are not usually necessary for a few or even a few thousand strings. But for large collections, performance-tuning techniques become necessary. The tuning procedure here should be to look for data structures that quickly match any partial string. The task is somewhat simpler than the most generic version of this type of problem, because you need to match only the first few consecutive characters. This means that some sort of tree structure is probably ideal. Of the structures available from the JDK, TreeMap looks like it can provide exactly the required functionality; it gives a minimal baseline and, if the performance is adequate, there is no more tuning to do. But TreeMap is 5 to 10 times slower than HashMap for access and update. The target is to obtain HashMap access speed for single-key access. Dont get carried away searching for the perfect data structure. Thinking laterally, you can consider other possibilities. If you have the strings in a sorted collection, you can apply a binary search to find the index of the string that is greater than or less than the partial string, and then obtain all the strings and hence corresponding objects in between. More specifically, from the hash table you can construct a sorted array of all the keys. Then, if you want to find all strings starting with h , you can run a binary search for the strings h and h\uFFFF . This gives all the indexes of the band for all the keys that start with h . Note that a binary search can return the index where the string would be even if it is not actually in the array. The correct solution actually goes from h inclusive to i exclusive, but this solution will do for strings that dont include character \uFFFF . - 247 - Having parallel collections can lead to all sorts of problems in making sure both collections contain the same elements. Solutions that involve parallel collections should hide all accesses and updates to the parallel collections through a separate object to ensure that all accesses and updates are consistent. The solution here is suitable mainly when the collections are updated infrequently, e.g., they are built once or periodically, and read from quite often. Here is a class implementing this solution: package tuning.struct; import java.util.Hashtable; import java.util.Enumeration; public class PartialSearcher { Hashtable hash; String[] sortedArray; public static void mainString args[] { Populate a Hashtable with ten strings Hashtable h = new Hashtable ; h.puthello, new Integer1; h.puthell, new Integer2; h.putalpha, new Integer3; h.putbye, new Integer4; h.puthello2, new Integer5; h.putsolly, new Integer6; h.putsally, new Integer7; h.putsilly, new Integer8; h.putzorro, new Integer9; h.puthi, new Integer10; Create the searching object PartialSearcher p = new PartialSearcherh; Match against all string keys given by the first command line argument Object[] objs = p.matchargs[0]; And print the matches out forint i = 0; iobjs.length; i++ System.out.printlnobjs[i]; } public PartialSearcherHashtable h { hash = h; createSortedArray ; } public Object[] matchString s { find the start and end positions of strings that match the key int startIdx = binarySearchsortedArray, s, 0, sortedArray.length-1; int endIdx = binarySearchsortedArray, s+ \uFFFF, 0, sortedArray.length-1; and return an array of the matched keys Object[] objs = new Object[endIdx-startIdx]; for int i = startIdx ; i endIdx; i++ objs[i-startIdx] = sortedArray[i]; return objs; - 248 - } public void createSortedArray { Create a sorted array of the keys of the hash table sortedArray = new String[hash.size ]; Enumeration e = hash.keys ; for int i = 0; e.hasMoreElements ; i++ sortedArray[i] = String e.nextElement ; quicksortsortedArray, 0, sortedArray.length-1; } Semi-standard binary search returning index of match location or where the location would match if it is not present. public static int binarySearchString[] arr, String elem, int fromIndex, int toIndex { int mid,cmp; while fromIndex = toIndex { mid =fromIndex + toIndex2; if cmp = arr[mid].compareToelem 0 fromIndex = mid + 1; else if cmp 0 toIndex = mid - 1; else return mid; } return fromIndex; } Standard quicksort public void quicksortString[] arr, int lo, int hi { if lo = hi return; int mid = lo + hi 2; String tmp; String middle = arr[ mid ]; if arr[ lo ].compareTomiddle 0 { arr[ mid ] = arr[ lo ]; arr[ lo ] = middle; middle = arr[ mid ]; } if middle.compareToarr[ hi ] 0 { arr[ mid ] = arr[ hi ]; arr[ hi ] = middle; middle = arr[ mid ]; if arr[ lo ].compareTomiddle 0 { arr[ mid ] = arr[ lo ]; arr[ lo ] = middle; middle = arr[ mid ]; } - 249 - } int left = lo + 1; int right = hi - 1; if left = right return; for ;; { while arr[ right ].compareTomiddle 0 { right--; } while left right arr[ left ].compareTomiddle = 0 { left++; } if left right { tmp = arr[ left ]; arr[ left ] = arr[ right ]; arr[ right ] = tmp; right--; } else { break; } } quicksortarr, lo, left; quicksortarr, left + 1, hi; } } Note that this solution is more generic than for only string keys. Any type of object can be used as a key as long as you can create a methodology to compare the order of the keys. This is therefore a reasonable solution for several types of indexing.

11.8 Search Trees