Sorting Internationalized Strings Strings

- 121 - When case-insensitive searches are required, one standard optimization is to use a second collection containing all the strings uppercased. This second collection is used for comparisons, thus avoiding the need to repeatedly uppercase each character in the search methods. For example, if you have a hash table containing String keys, you need to iterate over all the keys to match keys case- insensitively. But, if you have a second hash table with all the keys uppercased, retrieving the key simply requires you to uppercase the element being searched for: The slow version, iterating through all the keys ignoring case until the key matches. hash is a Hashtable public Object slowlyGetString key { Enumeration e = hash.keys ; String hkey; whilee.hasMoreElements { if key.equalsIgnoreCasehkey = String e.getNext return hash.gethkey; } return null; } The fast version assumes that a second hashtable was created with all the keys uppercased. Access is straightforward. public Object quicklyGetString key { return uppercasedHash.getkey.toUppercase ; } However, note that String.toUppercase and String.toLowercase creates a complete copy of the String object with a new char array. Unlike String.substring , String.toUppercase has a processing time that is linearly dependent on the size of the string and also creates an extra object a new char array. This means that repeatedly using String.toUppercase and String.toLowercase can impose a heavy overhead on an application. For each particular problem, you need to ensure that the extra temporary objects created and the extra processing overheads still provide a performance benefit rather than causing a new bottleneck in the application.

5.6 Sorting Internationalized Strings

One big advantage you get with String s is that they are built almost from the ground up to support internationalization. This means that the Unicode character set is the lingua franca in Java. Unfortunately, because Unicode uses two-byte characters, many string libraries based on one-byte characters that can be ported into Java do not work so well. Most string-search optimizations use tables to assist string searches, but the table size is related to the size of the character set. For example, traditional Boyer-Moore string search takes much memory and a long initialization phase to use with Unicode. The Boyer-Moore String-Search Algorithm Boyer-Moore string search uses a table of characters to skip comparisons. Heres a simple example with none of the complexities. Assume you are matching abcd against a string. The abcd is aligned against the first four characters of the string. The fourth character of the string is checked first. If that fourth character is none of a, b, c, or d, the abcd can be skipped to be matched against the fifth to eighth characters, and the matching proceeds in the same way. If instead the fourth character of the string is b, the abcd can be skipped - 122 - to align the b against the fourth character, and the matching proceeds as before. For optimum speed, this algorithm requires several arrays giving skip distances for each possible character in the character set. For more detail, see the Knuth book listed in Chapter 15 , or the paper Fast Algorithms for Sorting and Searching Strings, by Jon Bentley and Robert Sedgewick, Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, January 1997. There is also a web site that describes a large number of string-searching algorithms at http:www-igm.univ-mlv.fr~lecroqstring . Furthermore, sorting international String s requires the ability to handle many kinds of localization issues, such as the sorted location for accented characters, characters that can be treated as character pairs, and so on. In these cases, it is difficult and usually impossible to handle the general case yourself. It is almost always easier to use the String helper classes Java provides, for example, the java.text.Collator class. [10] [10] The code that handles this type of work didnt really start to get integrated in Java until 1.1, and did not start to be optimized until JDK 1.2. An article by Laura Werner of IBM in the February 1999 issue of the Java Report, Efficient Text Searching in Java, covers the optimizations added to the java.text.Collator class for JDK 1.2. There is also a useful StringSearch class available at the IBM alphaWorks site http:www.alphaworks.ibm.com . Using the java.text.CollationKey object to represent each string is a standard optimization for repeated comparisons of internationalized String s. You can use this when sorting an array of String s, for example. CollationKey s perform more than twice as fast as using java.text.Collator.compare . It is probably easiest to see how to use collation keys with a particular example. So lets look at tuning an internationalized String sort. For this, I use a standard quicksort algorithm the quicksort implementation can be found in Section 11.7 . The only modification to the standard quicksort is that for each optimization, the quicksort needs to be adjusted to use the appropriate comparison method and the appropriate data type. For example, the generic quicksort that sorts an array of Comparable objects has the signature: public static void quicksortComparable[] arr, int lo, int hi and uses the Comparable.compareToObject method when comparing two Comparable objects. On the other hand, a generic quicksort that sorts objects based on a java.util.Comparator has the signature: public static void quicksortObject[] arr, int lo, int hi, Comparator c and uses the java.util.Comparator.compareObject, Object method when comparing any two objects. See java.util.Arrays.sort for a specific example. In each case the underlying algorithm is the same. Only the comparison method changes and in general the data type too, though not in these examples where the data type was Object . The obvious first test, to get a performance baseline, is the straightforward internationalized sort: public runsort { quicksortstringArray,0,stringArray.length-1, Collator.getInstance ; } public static void quicksortString[] arr, int lo, int hi, java.text.Collator c { ... int mid = lo + hi 2; String middle = arr[ mid ]; String data type ... uses Collator.compareString, String - 123 - if c.comparearr[ lo ], middle 0 ... } I use a large dictionary of words for the array of strings, inserted in random order, and I use the same random order for each of the tests. The first test took longer than expected. Looking at the Collator class, I can see that it does a huge amount, and I cannot possibly bypass its internationalized support if I want to support internationalized strings. [11] [11] The kind of investment made in building such global support is beyond most projects; it is almost always much cheaper to buy the support. In this case, Taligent put a huge number of man years into the globalization you get for free with the JDK. However, as previously mentioned, the Collator class comes with the java.util.CollationKey class specifically to provide for this type of speedup. It is simple to convert the sort in order to use this. You still need the Collator to generate the CollationKeys , so add a conversion method. The sort now looks like: public runsort { quicksortstringArray,0,stringArray.length-1, Collator.getInstance ; } public static void quicksortString[] arr, int lo, int hi, Collator c { convert to an array of CollationKeys CollationKey keys[] = new CollationKey[arr.length]; for int i = arr.length-1; i = 0; i-- keys[i] = c.getCollationKeyarr[i]; Run the sort on the collation keys quicksort_collationKeykeys, 0, arr.length-1; and unwrap so that we get our Strings in sorted order for int i = arr.length-1; i = 0; i-- arr[i] = keys[i].getSourceString ; } public static void quicksort_collationKeyCollationKey[] arr, int lo, int hi { ... int mid = lo + hi 2; CollationKey middle = arr[ mid ]; CollationKey data type ... uses CollationKey.compareToCollationKey if arr[ lo ].compareTomiddle 0 ... } Normalizing the time for the first test to 100, this test is much faster and takes half the time see Table 5-8 . This is despite the extra cost imposed by a whole new populated array of CollationKey objects, one for each string. Can it do better? Well, there is nothing further in the java.text package that suggests so. Instead look at the String class, and consider its implementation of the String.compareTo method. This is a simple lexicographic ordering , basically treating the char array as a sequence of numbers and ordering sequence pairs as if there is no meaning to the object being String s. Obviously, this is useless for internationalized support, but it is much faster. A quick test shows that sorting the test String array using the String.compareTo method takes just 3 of time of the first test, which seems much more reasonable. But is this test incompatible with the desired internationalized sort? Well, maybe not. Sort algorithms usually execute faster if they operate on a partially sorted array. Perhaps using the - 124 - String.compareTo sort first might bring the array considerably closer to the final ordering of the internationalized sort, and at a fairly low cost. Testing this is straightforward: public runsort { quicksortstringArray,0,stringArray.length-1, Collator.getInstance ; } public static void quicksortString[] arr, int lo, int hi, Collator c { simple sort using String.compareTo simple_quicksortarr, lo, hi; Full international sort on a hopefully partially sorted array intl_quicksortarr, lo, hi, c; } public static void simple_quicksortString[] arr, int lo, int hi { ... int mid = lo + hi 2; String middle = arr[ mid ]; uses String data type ... uses String.compareToString if arr[ lo ].compareTomiddle 0 ... } public static void intl_quicksortString[] arr, int lo, int hi, Collator c { convert to an array of CollationKeys CollationKey keys[] = new CollationKey[arr.length]; for int i = arr.length-1; i = 0; i-- keys[i] = c.getCollationKeyarr[i]; Run the sort on the collation keys quicksort_collationKeykeys, 0, arr.length-1; and unwrap so that we get our Strings in sorted order for int i = arr.length-1; i = 0; i-- arr[i] = keys[i].getSourceString ; } public static void quicksort_collationKeyCollationKey[] arr, int lo, int hi { ... int mid = lo + hi 2; CollationKey middle = arr[ mid ]; CollationKey data type ... uses CollationKey.compareToCollationKey if arr[ lo ].compareTomiddle 0 ... } This double-sorting implementation reduces the international sort time to a quarter of the original test time see Table 5-8 . Partially sorting the list first using a much simpler and quicker comparison test has doubled the speed of the total sort as compared to using only the CollationKeys optimization. Table 5-8, Timings Using Different Sorting Strategies Sort Using: 1.2

1.3 HotSpot 1.0

1.1.6 Collator 100 55 42 1251 CollationKeys 49 25 36 117 Sorted twice 22 11 15 58 String.compareTo 3 2 4 3 - 125 - Of course, these optimizations have improved the situation only for the particular locale I have tested my default locale is set for US English. However, running the test in a sampling of other locales European and Asian locales, I find similar relative speedups. Without using locale-specific dictionaries, this locale variation test may not be fully valid. But the speedup will likely hold across all Latinized alphabets. You can also create a simple partial-ordering class-specific sort to some locales, which provides a similar speedup. For example, by duplicating the effect of using String.compareTo , you can provide the basis for a customized partial sorter: public class PartialSorter { String source; char[] stringArray; public SortingString s { retain the original string source = s; and get the array of characters for our customized comparison stringArray = new char[s.length ]; s.getChars0, stringArray.length, stringArray, 0; } This compare method should be customized for different locales public static int comparechar[] arr1, char[] arr2 { basically the String.compareTo algorithm int n = Math.minarr1.length, arr2.length; for int i = 0; i n; i++ { if arr1[i] = arr2[i] return arr1[i] - arr2[i]; } return arr1.length - arr2.length; } public static void quicksortString[] arr, int lo, int hi { convert to an array of PartialSorters PartialSorter keys[] = new PartialSorter[arr.length]; for int i = arr.length-1; i = 0; i-- keys[i] = new PartialSorterarr[i]; quicksort_mysorterkeys, 0, arr.length-1; and unwrap so that we get our Strings in sorted order for int i = arr.length-1; i = 0; i-- arr[i] = keys[i].source; } public static void quicksort_mysorterPartialSorter[] arr, int lo, int hi { ... int mid = lo + hi 2; PartialSorter middle = arr[ mid ]; PartialSorter data type ... Use the PartialSorter.compare method to compare the char arrays if comparearr[ lo ].stringArray, middle.stringArray 0 ... } } This PartialSorter class works similarly to the CollationKey class, wrapping a string and providing its own comparison method. The particular comparison method shown here is just an implementation of the String.compareTo method. It is pointless to use it exactly as defined here, because object-creation overhead means that using the PartialSorter is twice as slow as using the String.compareTo directly. But customizing the PartialSorter.compare method for any particular locale is a reasonable task: remember, we are only interested in a simple - 126 - algorithm that handles a partial sort, not the full intricacies of completely accurate locale-specific comparison. Generally, you cannot expect to support internationalized strings and retain the performance of simple one-byte-per-character strings. But, as shown here, you can certainly improve the performance by some useful amounts.

5.6 Sorting Internationalized Strings