- 121 - When case-insensitive searches are required, one standard optimization is to use a second collection
containing all the strings uppercased. This second collection is used for comparisons, thus avoiding the need to repeatedly uppercase each character in the search methods. For example, if you have a
hash table containing
String
keys, you need to iterate over all the keys to match keys case- insensitively. But, if you have a second hash table with all the keys uppercased, retrieving the key
simply requires you to uppercase the element being searched for:
The slow version, iterating through all the keys ignoring case until the key matches. hash is a Hashtable
public Object slowlyGetString key {
Enumeration e = hash.keys ; String hkey;
whilee.hasMoreElements {
if key.equalsIgnoreCasehkey = String e.getNext return hash.gethkey;
} return null;
} The fast version assumes that a second hashtable was created
with all the keys uppercased. Access is straightforward. public Object quicklyGetString key
{ return uppercasedHash.getkey.toUppercase ;
}
However, note that
String.toUppercase
and
String.toLowercase
creates a complete copy of the
String
object with a new
char
array. Unlike
String.substring
,
String.toUppercase
has a processing time that is linearly dependent on the size of the string and also creates an extra object a new
char
array. This means that repeatedly using
String.toUppercase
and
String.toLowercase
can impose a heavy overhead on an application. For each particular problem, you need to ensure that the extra temporary objects created
and the extra processing overheads still provide a performance benefit rather than causing a new bottleneck in the application.
5.6 Sorting Internationalized Strings
One big advantage you get with
String
s is that they are built almost from the ground up to support internationalization. This means that the Unicode character set is the lingua franca in Java.
Unfortunately, because Unicode uses two-byte characters, many string libraries based on one-byte characters that can be ported into Java do not work so well. Most string-search optimizations use
tables to assist string searches, but the table size is related to the size of the character set. For example, traditional Boyer-Moore string search takes much memory and a long initialization phase
to use with Unicode.
The Boyer-Moore String-Search Algorithm
Boyer-Moore string search uses a table of characters to skip comparisons. Heres a simple example with none of the complexities. Assume you are matching abcd against a string.
The abcd is aligned against the first four characters of the string. The fourth character of the string is checked first. If that fourth character is none of a, b, c, or d, the abcd can be
skipped to be matched against the fifth to eighth characters, and the matching proceeds in the same way. If instead the fourth character of the string is b, the abcd can be skipped
- 122 - to align the b against the fourth character, and the matching proceeds as before. For
optimum speed, this algorithm requires several arrays giving skip distances for each possible character in the character set. For more detail, see the Knuth book listed in
Chapter 15 , or the paper Fast Algorithms for Sorting and Searching Strings, by Jon
Bentley and Robert Sedgewick, Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, January 1997. There is also a web site that describes a large
number of string-searching algorithms at http:www-igm.univ-mlv.fr~lecroqstring
. Furthermore, sorting international
String
s requires the ability to handle many kinds of localization issues, such as the sorted location for accented characters, characters that can be treated as character
pairs, and so on. In these cases, it is difficult and usually impossible to handle the general case yourself. It is almost always easier to use the
String
helper classes Java provides, for example, the
java.text.Collator
class.
[10] [10]
The code that handles this type of work didnt really start to get integrated in Java until 1.1, and did not start to be optimized until JDK 1.2. An article by Laura Werner of IBM in the February 1999 issue of the Java Report, Efficient Text Searching in Java, covers the optimizations added to the
java.text.Collator
class for JDK 1.2. There is also a useful
StringSearch
class available at the IBM alphaWorks site http:www.alphaworks.ibm.com
.
Using the
java.text.CollationKey
object to represent each string is a standard optimization for repeated comparisons of internationalized
String
s. You can use this when sorting an array of
String
s, for example.
CollationKey
s perform more than twice as fast as using
java.text.Collator.compare
. It is probably easiest to see how to use collation keys with a particular example. So lets look at tuning an internationalized
String
sort. For this, I use a standard quicksort algorithm the quicksort implementation can be found in
Section 11.7
. The only modification to the standard quicksort is that for each optimization, the quicksort needs to be adjusted to use the appropriate comparison method and the appropriate data type. For
example, the generic quicksort that sorts an array of
Comparable
objects has the signature:
public static void quicksortComparable[] arr, int lo, int hi
and uses the
Comparable.compareToObject
method when comparing two
Comparable
objects. On the other hand, a generic quicksort that sorts objects based on a
java.util.Comparator
has the signature:
public static void quicksortObject[] arr, int lo, int hi, Comparator c
and uses the
java.util.Comparator.compareObject, Object
method when comparing any two objects. See
java.util.Arrays.sort
for a specific example. In each case the underlying algorithm is the same. Only the comparison method changes and in general the data type too,
though not in these examples where the data type was
Object
. The obvious first test, to get a performance baseline, is the straightforward internationalized sort:
public runsort { quicksortstringArray,0,stringArray.length-1, Collator.getInstance ;
} public static void quicksortString[] arr, int lo, int hi, java.text.Collator c
{ ...
int mid = lo + hi 2; String middle = arr[ mid ]; String data type
... uses Collator.compareString, String
- 123 -
if c.comparearr[ lo ], middle 0 ...
}
I use a large dictionary of words for the array of strings, inserted in random order, and I use the same random order for each of the tests. The first test took longer than expected. Looking at the
Collator
class, I can see that it does a huge amount, and I cannot possibly bypass its internationalized support if I want to support internationalized strings.
[11] [11]
The kind of investment made in building such global support is beyond most projects; it is almost always much cheaper to buy the support. In this case, Taligent put a huge number of man years into the globalization you get for free with the JDK.
However, as previously mentioned, the
Collator
class comes with the
java.util.CollationKey
class specifically to provide for this type of speedup. It is simple to convert the sort in order to use this. You still need the
Collator
to generate the
CollationKeys
, so add a conversion method. The sort now looks like:
public runsort { quicksortstringArray,0,stringArray.length-1, Collator.getInstance ;
} public static void quicksortString[] arr, int lo, int hi, Collator c
{ convert to an array of CollationKeys
CollationKey keys[] = new CollationKey[arr.length]; for int i = arr.length-1; i = 0; i--
keys[i] = c.getCollationKeyarr[i]; Run the sort on the collation keys
quicksort_collationKeykeys, 0, arr.length-1; and unwrap so that we get our Strings in sorted order
for int i = arr.length-1; i = 0; i-- arr[i] = keys[i].getSourceString ;
} public static void quicksort_collationKeyCollationKey[] arr, int lo, int hi
{ ...
int mid = lo + hi 2; CollationKey middle = arr[ mid ]; CollationKey data type
... uses CollationKey.compareToCollationKey
if arr[ lo ].compareTomiddle 0 ...
}
Normalizing the time for the first test to 100, this test is much faster and takes half the time see Table 5-8
. This is despite the extra cost imposed by a whole new populated array of
CollationKey
objects, one for each string. Can it do better? Well, there is nothing further in the
java.text
package that suggests so. Instead look at the
String
class, and consider its implementation of the
String.compareTo
method. This is a simple lexicographic ordering , basically treating the
char
array as a sequence of numbers and ordering sequence pairs as if there is no meaning to the object being
String
s. Obviously, this is useless for internationalized support, but it is much faster. A quick test shows that sorting the test
String
array using the
String.compareTo
method takes just 3 of time of the first test, which seems much more reasonable.
But is this test incompatible with the desired internationalized sort? Well, maybe not. Sort algorithms usually execute faster if they operate on a partially sorted array. Perhaps using the
- 124 -
String.compareTo
sort first might bring the array considerably closer to the final ordering of the internationalized sort, and at a fairly low cost. Testing this is straightforward:
public runsort { quicksortstringArray,0,stringArray.length-1, Collator.getInstance ;
} public static void quicksortString[] arr, int lo, int hi, Collator c
{ simple sort using String.compareTo
simple_quicksortarr, lo, hi; Full international sort on a hopefully partially sorted array
intl_quicksortarr, lo, hi, c; }
public static void simple_quicksortString[] arr, int lo, int hi {
... int mid = lo + hi 2;
String middle = arr[ mid ]; uses String data type ...
uses String.compareToString if arr[ lo ].compareTomiddle 0
... }
public static void intl_quicksortString[] arr, int lo, int hi, Collator c {
convert to an array of CollationKeys CollationKey keys[] = new CollationKey[arr.length];
for int i = arr.length-1; i = 0; i-- keys[i] = c.getCollationKeyarr[i];
Run the sort on the collation keys quicksort_collationKeykeys, 0, arr.length-1;
and unwrap so that we get our Strings in sorted order for int i = arr.length-1; i = 0; i--
arr[i] = keys[i].getSourceString ; }
public static void quicksort_collationKeyCollationKey[] arr, int lo, int hi {
... int mid = lo + hi 2;
CollationKey middle = arr[ mid ]; CollationKey data type ...
uses CollationKey.compareToCollationKey if arr[ lo ].compareTomiddle 0
... }
This double-sorting implementation reduces the international sort time to a quarter of the original test time see
Table 5-8 . Partially sorting the list first using a much simpler and quicker
comparison test has doubled the speed of the total sort as compared to using only the
CollationKeys
optimization. Table 5-8, Timings Using Different Sorting Strategies
Sort Using: 1.2
1.3 HotSpot 1.0
1.1.6
Collator 100 55
42 1251
CollationKeys 49 25
36 117
Sorted twice 22
11 15
58 String.compareTo
3 2
4 3
- 125 - Of course, these optimizations have improved the situation only for the particular locale I have
tested my default locale is set for US English. However, running the test in a sampling of other locales European and Asian locales, I find similar relative speedups. Without using locale-specific
dictionaries, this locale variation test may not be fully valid. But the speedup will likely hold across all Latinized alphabets. You can also create a simple partial-ordering class-specific sort to some
locales, which provides a similar speedup. For example, by duplicating the effect of using
String.compareTo
, you can provide the basis for a customized partial sorter:
public class PartialSorter { String source;
char[] stringArray; public SortingString s
{ retain the original string
source = s; and get the array of characters for our customized comparison
stringArray = new char[s.length ]; s.getChars0, stringArray.length, stringArray, 0;
} This compare method should be customized for different locales
public static int comparechar[] arr1, char[] arr2 {
basically the String.compareTo algorithm int n = Math.minarr1.length, arr2.length;
for int i = 0; i n; i++ {
if arr1[i] = arr2[i] return arr1[i] - arr2[i];
} return arr1.length - arr2.length;
} public static void quicksortString[] arr, int lo, int hi
{ convert to an array of PartialSorters
PartialSorter keys[] = new PartialSorter[arr.length]; for int i = arr.length-1; i = 0; i--
keys[i] = new PartialSorterarr[i]; quicksort_mysorterkeys, 0, arr.length-1;
and unwrap so that we get our Strings in sorted order for int i = arr.length-1; i = 0; i--
arr[i] = keys[i].source; }
public static void quicksort_mysorterPartialSorter[] arr, int lo, int hi {
... int mid = lo + hi 2;
PartialSorter middle = arr[ mid ]; PartialSorter data type ...
Use the PartialSorter.compare method to compare the char arrays if comparearr[ lo ].stringArray, middle.stringArray 0
... }
}
This
PartialSorter
class works similarly to the
CollationKey
class, wrapping a string and providing its own comparison method. The particular comparison method shown here is just an
implementation of the
String.compareTo
method. It is pointless to use it exactly as defined here, because object-creation overhead means that using the
PartialSorter
is twice as slow as using the
String.compareTo
directly. But customizing the
PartialSorter.compare
method for any particular locale is a reasonable task: remember, we are only interested in a simple
- 126 - algorithm that handles a partial sort, not the full intricacies of completely accurate locale-specific
comparison.
Generally, you cannot expect to support internationalized strings and retain the performance of simple one-byte-per-character strings. But, as shown here, you can certainly improve the
performance by some useful amounts.
5.6 Sorting Internationalized Strings