String Comparisons and Searches

- 119 - return startIdx+1; } public static int indexOfCharschar[] buf, int startIdx, int bufLen, char[] match { Simple linear search int j; for int i = startIdx; i bufLen; i++ { if matchesbuf, i, bufLen, match return i; } return -1; } public static boolean matcheschar[] buf, int startIdx, int bufLen, char[] match { if startIdx + match.length bufLen return false; else { forint j = match.length-1; j = 0 ; j-- ifbuf[startIdx+j] = match[j] return false; return true; } } The individual methods listed here are fairly basic. As with the JDK methods, I assume a line termination is indicated by a newline or return character. Otherwise, the main effort comes in writing efficient array-matching methods. In this example, I did not try hard to look for the very best array-matching algorithms . Instead, I used straightforward algorithms for clarity, since these are fast enough for the example. There are many sources describing more sophisticated array- matching algorithms; for example, the University of Rouen in France has a nice site listing Exact String Matching Algorithms at http:www-igm.univ-mlv.fr~lecroqstring .

5.5 String Comparisons and Searches

String comparison performance is highly dependent on both the string data and the comparison algorithm this is really a truism about collections in general. The methods that come with the String class have a performance advantage in being able to directly access the underlying char collection. So if you need to make String comparisons, String methods usually provide better performance than your own methods, provided that you can make your desired comparison fit in with one of the String methods. Another necessary consideration is whether comparisons are case- sensitive or -insensitive, and I will consider this in more detail shortly. To optimize for string comparisons, you need to look at the source of the comparison methods so you know exactly how they work. As an example, consider the String.equals and String.equalsIgnoreCase methods from the Java 2 distribution. String.equalsObject runs in a fairly straightforward way: it first checks for object identity, then for null , then for String type, then for same-size strings, and then character by character, running from the first characters to the last. Efficient and complete. - 120 - String.equalsIgnoreCaseString is a little more complex. It checks for null , and then for strings being the same size the String type check is not needed, since this method accepts only String objects. Then, using a case-insensitive comparison, regionMatches is applied. regionMatches runs a character-by-character test from the first character to the last, converting characters to uppercase before comparing them. Immediately, you see that the more differences there are between the two strings, the faster these methods return. This behavior is common for collection comparisons, and the order of the comparison is crucial. In these two cases, the strings are compared starting with the first character, so the earlier the difference occurs, the faster the methods return. However, equals returns faster if the two String objects are identical. It is unusual to check String s by identity, but there are a number of situations where it is useful, for example, when you are using a set of canonical String s see Chapter 4 . Another example is when an application has enough time during string input to intern [8] the strings, so that later comparisons by identity are possible. [8] String.intern returns the String object that is being stored in the internal VM string pool. If two String s are equal, then their intern results are identical; for example, if s1.equalss2 is true , then s1.intern == s2.intern is also true . In any case, equals returns immediately if the two strings are identical, but equalsIgnoreCase does not even check for identity which may be reasonable given what it does. This results in equals running an order of magnitude faster than equalsIgnoreCase if the two strings are identical; identical strings is the fastest test case resolvable for equals , but the slowest case for equalsIgnoreCase . On the other hand, if the two strings are different in size, equalsIgnoreCase has only two tests to make before it returns, whereas equals makes four tests before it returns. This can make equalsIgnoreCase run 20 faster than equals for what may be the most common difference between strings. There are more differences between these two methods. In almost every possible case of string data, equals runs faster often several times faster than equalsIgnoreCase . However, in a test against the words from a particular dictionary, I found that over 90 of the words were different in size from a randomly chosen word. When comparing the performance of these two methods for a comparison of a randomly chosen word against the entire dictionary, the total comparison time taken by each of the two methods was about the same. The many cases in which strings had different lengths compensated almost exactly for the slower comparison of equalsIgnoreCase when the strings were similar or equal. This illustrates how the data and the algorithm interplay with each other to affect performance. Even though String methods have access to the internal char s, it can be faster to use your own methods if there are no String methods appropriate for your test. You can build methods that are tailored to the data you have. One way to optimize an equality test is to look for ways to make the strings identical. An alternative that can actually be better for performance is to change the search strategy to reduce search time. For example, a linear search through a large array of String s is slower than a binary search through the same size array if the array is sorted. This, in turn, is slower than a straight access to a hashed table. Note that when you are able and willing to deploy changes to JDK classes e.g., for servlets, you can add methods directly to the String class. However, altering JDK classes can lead to maintenance problems. [9] [9] Several of my colleagues have emphasized their view that changes to the JDK sources lead to severe maintenance problems. - 121 - When case-insensitive searches are required, one standard optimization is to use a second collection containing all the strings uppercased. This second collection is used for comparisons, thus avoiding the need to repeatedly uppercase each character in the search methods. For example, if you have a hash table containing String keys, you need to iterate over all the keys to match keys case- insensitively. But, if you have a second hash table with all the keys uppercased, retrieving the key simply requires you to uppercase the element being searched for: The slow version, iterating through all the keys ignoring case until the key matches. hash is a Hashtable public Object slowlyGetString key { Enumeration e = hash.keys ; String hkey; whilee.hasMoreElements { if key.equalsIgnoreCasehkey = String e.getNext return hash.gethkey; } return null; } The fast version assumes that a second hashtable was created with all the keys uppercased. Access is straightforward. public Object quicklyGetString key { return uppercasedHash.getkey.toUppercase ; } However, note that String.toUppercase and String.toLowercase creates a complete copy of the String object with a new char array. Unlike String.substring , String.toUppercase has a processing time that is linearly dependent on the size of the string and also creates an extra object a new char array. This means that repeatedly using String.toUppercase and String.toLowercase can impose a heavy overhead on an application. For each particular problem, you need to ensure that the extra temporary objects created and the extra processing overheads still provide a performance benefit rather than causing a new bottleneck in the application.

5.6 Sorting Internationalized Strings