- 119 -
return startIdx+1; }
public static int indexOfCharschar[] buf, int startIdx, int bufLen, char[] match
{ Simple linear search
int j; for int i = startIdx; i bufLen; i++
{ if matchesbuf, i, bufLen, match
return i; }
return -1; }
public static boolean matcheschar[] buf, int startIdx, int bufLen, char[] match
{ if startIdx + match.length bufLen
return false; else
{ forint j = match.length-1; j = 0 ; j--
ifbuf[startIdx+j] = match[j] return false;
return true; }
}
The individual methods listed here are fairly basic. As with the JDK methods, I assume a line termination is indicated by a newline or return character. Otherwise, the main effort comes in
writing efficient array-matching methods. In this example, I did not try hard to look for the very best array-matching algorithms . Instead, I used straightforward algorithms for clarity, since these
are fast enough for the example. There are many sources describing more sophisticated array- matching algorithms; for example, the University of Rouen in France has a nice site listing Exact
String Matching Algorithms at
http:www-igm.univ-mlv.fr~lecroqstring .
5.5 String Comparisons and Searches
String
comparison performance is highly dependent on both the string data and the comparison algorithm this is really a truism about collections in general. The methods that come with the
String
class have a performance advantage in being able to directly access the underlying
char
collection. So if you need to make
String
comparisons,
String
methods usually provide better performance than your own methods, provided that you can make your desired comparison fit in
with one of the
String
methods. Another necessary consideration is whether comparisons are case- sensitive or -insensitive, and I will consider this in more detail shortly.
To optimize for string comparisons, you need to look at the source of the comparison methods so you know exactly how they work. As an example, consider the
String.equals
and
String.equalsIgnoreCase
methods from the Java 2 distribution.
String.equalsObject
runs in a fairly straightforward way: it first checks for object identity, then for
null
, then for
String
type, then for same-size strings, and then character by character, running from the first characters to the last. Efficient and complete.
- 120 -
String.equalsIgnoreCaseString
is a little more complex. It checks for
null
, and then for strings being the same size the
String
type check is not needed, since this method accepts only
String
objects. Then, using a case-insensitive comparison,
regionMatches
is applied.
regionMatches
runs a character-by-character test from the first character to the last, converting characters to uppercase before comparing them.
Immediately, you see that the more differences there are between the two strings, the faster these methods return. This behavior is common for collection comparisons, and the order of the
comparison is crucial. In these two cases, the strings are compared starting with the first character, so the earlier the difference occurs, the faster the methods return. However,
equals
returns faster if the two
String
objects are identical. It is unusual to check
String
s by identity, but there are a number of situations where it is useful, for example, when you are using a set of canonical
String
s see Chapter 4
. Another example is when an application has enough time during string input to
intern
[8]
the strings, so that later comparisons by identity are possible.
[8]
String.intern
returns the
String
object that is being stored in the internal VM string pool. If two
String
s are equal, then their
intern
results are identical; for example, if
s1.equalss2
is
true
, then
s1.intern == s2.intern
is also
true
.
In any case,
equals
returns immediately if the two strings are identical, but
equalsIgnoreCase
does not even check for identity which may be reasonable given what it does. This results in
equals
running an order of magnitude faster than
equalsIgnoreCase
if the two strings are identical; identical strings is the fastest test case resolvable for
equals
, but the slowest case for
equalsIgnoreCase
. On the other hand, if the two strings are different in size,
equalsIgnoreCase
has only two tests to make before it returns, whereas
equals
makes four tests before it returns. This can make
equalsIgnoreCase
run 20 faster than
equals
for what may be the most common difference between strings.
There are more differences between these two methods. In almost every possible case of string data,
equals
runs faster often several times faster than
equalsIgnoreCase
. However, in a test against the words from a particular dictionary, I found that over 90 of the words were different in
size from a randomly chosen word. When comparing the performance of these two methods for a comparison of a randomly chosen word against the entire dictionary, the total comparison time
taken by each of the two methods was about the same. The many cases in which strings had different lengths compensated almost exactly for the slower comparison of
equalsIgnoreCase
when the strings were similar or equal. This illustrates how the data and the algorithm interplay with each other to affect performance.
Even though
String
methods have access to the internal
char
s, it can be faster to use your own methods if there are no
String
methods appropriate for your test. You can build methods that are tailored to the data you have. One way to optimize an equality test is to look for ways to make the
strings identical. An alternative that can actually be better for performance is to change the search strategy to reduce search time. For example, a linear search through a large array of
String
s is slower than a binary search through the same size array if the array is sorted. This, in turn, is slower
than a straight access to a hashed table. Note that when you are able and willing to deploy changes to JDK classes e.g., for servlets, you can add methods directly to the
String
class. However, altering JDK classes can lead to maintenance problems.
[9] [9]
Several of my colleagues have emphasized their view that changes to the JDK sources lead to severe maintenance problems.
- 121 - When case-insensitive searches are required, one standard optimization is to use a second collection
containing all the strings uppercased. This second collection is used for comparisons, thus avoiding the need to repeatedly uppercase each character in the search methods. For example, if you have a
hash table containing
String
keys, you need to iterate over all the keys to match keys case- insensitively. But, if you have a second hash table with all the keys uppercased, retrieving the key
simply requires you to uppercase the element being searched for:
The slow version, iterating through all the keys ignoring case until the key matches. hash is a Hashtable
public Object slowlyGetString key {
Enumeration e = hash.keys ; String hkey;
whilee.hasMoreElements {
if key.equalsIgnoreCasehkey = String e.getNext return hash.gethkey;
} return null;
} The fast version assumes that a second hashtable was created
with all the keys uppercased. Access is straightforward. public Object quicklyGetString key
{ return uppercasedHash.getkey.toUppercase ;
}
However, note that
String.toUppercase
and
String.toLowercase
creates a complete copy of the
String
object with a new
char
array. Unlike
String.substring
,
String.toUppercase
has a processing time that is linearly dependent on the size of the string and also creates an extra object a new
char
array. This means that repeatedly using
String.toUppercase
and
String.toLowercase
can impose a heavy overhead on an application. For each particular problem, you need to ensure that the extra temporary objects created
and the extra processing overheads still provide a performance benefit rather than causing a new bottleneck in the application.
5.6 Sorting Internationalized Strings