- 113 -
5.4 Strings Versus char Arrays
In one of my first programming courses, in the language C , our instructor made an interesting comment. He said, C has lightning-fast string handling because it has no string type. He went on
to explain this oxymoron by pointing out that in C, any
null
-terminated sequence of bytes can be considered a string: this convention is supported by all string-handling functions. The point is that
since the convention is adhered to fairly rigorously, there is no need to use only the standard string- handling functions. Any string manipulation you want to do can be executed directly on the
byte
array, allowing you to bypass or rewrite any string-handling functions you need to speed up. Because you are not forced to run through a restricted set of manipulation functions, it is always
possible to optimize code using your own hand-crafted functions. Furthermore, some string- manipulating functions operate directly on the original
byte
array rather than creating a copy of this array. This can be a source of bugs, but is another reason speed can be optimized.
In Java, the inability to subclass
String
or access its internal
char
array means you cannot use the techniques applied in C. Even if you could subclass
String
, this does not avoid the second problem: many other methods operate on or return copies of a
String
. Generally, there is no way to avoid using
String
objects for code external to your application classes. But internally, you can provide your own
char
array type that allows you to manipulate strings according to your needs. As an example, lets look at a couple of simple text-parsing problems: first, counting the words in a
body of text, and second, using a filter to select lines of a file based on whether they contain a particular string.
5.4.1 Word-Counting Example
Lets look at the typical Java approach to counting words in a text. I use the
StreamTokenizer
for the word count, as that class is tailor-made for this kind of problem.
The word count is fairly easy to implement. The only difficulty comes in defining what a word is and coaxing the
StreamTokenizer
to agree with that definition. To keep things simple, I define a word as any contiguous sequence of alphanumeric characters. This means that words with
apostrophes and numbers with decimal points count as two words, but Im more interested in the performance than the niceties of word definitions here, and I want to keep the implementation
simple. The implementation looks like this:
public static void wordcountString filename throws IOException
{ int count = 0;
create the tokenizer, and initialize it FileReader r = new FileReaderfilename;
StreamTokenizer rdr = new StreamTokenizerr; rdr.resetSyntax ;
rdr.wordCharsa, z; words include any lowercase character rdr.wordCharsA, Z; words include any uppercase character
rdr.wordChars0,9; words include any digit everything else is whitespace
rdr.whitespaceChars0, 0-1; rdr.whitespaceChars9+1, A-1;
rdr.whitespaceCharsz+1, \uffff; int token;
loop getting each token word from the tokenizer until we reach the end of the file
while token = rdr.nextToken = StreamTokenizer.TT_EOF
- 114 -
{ If the token is a word, count it, otherwise it is whitespace
if token == StreamTokenizer.TT_WORD count++;
} System.out.printlncount + words found.;
r.close ; }
Now, for comparison, implement a more efficient version using
char
arrays. The word-count algorithm is relatively straightforward: test for sequences of alphanumerics and skip anything else.
The only slight complication comes when you refill the buffer with the next chunk from the file. You need to avoid counting one word as two if it falls across the junction of the two reads into the
buffer, but this turns out to be easy to handle. You simply need to remember the last character of the last chunk and skip any alphanumeric characters at the beginning of the next chunk if that last
character was alphanumeric i.e., continue with the word until it terminates. The implementation looks like this:
public static void cwordcountString filename throws IOException
{ int count = 0;
FileReader rdr = new FileReaderfilename; buffer to hold read in characters
char[] buf = new char[8192]; int len;
int idx = 0; initialize so that our current character is in whitespace
char c = ; read in each chunk as much as possible,
until there is nothing left to read while len = rdr.readbuf, 0, buf.length = -1
{ idx = 0;
int start; if we are already in a word, then skip the rest of it
if Character.isLetterOrDigitc while idx len Character.isLetterOrDigitbuf[idx]
{idx++;} whileidx len
{ skip non alphanumeric
while idx len Character.isLetterOrDigitbuf[idx] {idx++;}
skip word start = idx;
while idx len Character.isLetterOrDigitbuf[idx] {idx++;}
if start len {
count++; count word }
} get last character so we know whether to carry on a word
c = buf[idx-1]; }
System.out.printlncount + words found.; }
You can compare this implementation with the one using the
StreamTokenizer
. All tests use the same large text file for counting the words. I normalize to 100 the time taken by
- 115 -
StreamTokenizer
using JDK 1.2 with the JIT compiler see Table 5-5
. Interestingly, the test takes almost the same amount of time when I run using the
StreamTokenizer
without the JIT compiler running. Depending on the file I run with, sometimes the JIT VM turns out slower than the non-JIT
VM with the
StreamTokenizer
test. Table 5-5, Word Counter Timings Using wordcount or cwordcount Methods
VM 1.2 1.2
no JIT
1.3 HotSpot 1.0