Word-Counting Example Strings Versus char Arrays

- 113 -

5.4 Strings Versus char Arrays

In one of my first programming courses, in the language C , our instructor made an interesting comment. He said, C has lightning-fast string handling because it has no string type. He went on to explain this oxymoron by pointing out that in C, any null -terminated sequence of bytes can be considered a string: this convention is supported by all string-handling functions. The point is that since the convention is adhered to fairly rigorously, there is no need to use only the standard string- handling functions. Any string manipulation you want to do can be executed directly on the byte array, allowing you to bypass or rewrite any string-handling functions you need to speed up. Because you are not forced to run through a restricted set of manipulation functions, it is always possible to optimize code using your own hand-crafted functions. Furthermore, some string- manipulating functions operate directly on the original byte array rather than creating a copy of this array. This can be a source of bugs, but is another reason speed can be optimized. In Java, the inability to subclass String or access its internal char array means you cannot use the techniques applied in C. Even if you could subclass String , this does not avoid the second problem: many other methods operate on or return copies of a String . Generally, there is no way to avoid using String objects for code external to your application classes. But internally, you can provide your own char array type that allows you to manipulate strings according to your needs. As an example, lets look at a couple of simple text-parsing problems: first, counting the words in a body of text, and second, using a filter to select lines of a file based on whether they contain a particular string.

5.4.1 Word-Counting Example

Lets look at the typical Java approach to counting words in a text. I use the StreamTokenizer for the word count, as that class is tailor-made for this kind of problem. The word count is fairly easy to implement. The only difficulty comes in defining what a word is and coaxing the StreamTokenizer to agree with that definition. To keep things simple, I define a word as any contiguous sequence of alphanumeric characters. This means that words with apostrophes and numbers with decimal points count as two words, but Im more interested in the performance than the niceties of word definitions here, and I want to keep the implementation simple. The implementation looks like this: public static void wordcountString filename throws IOException { int count = 0; create the tokenizer, and initialize it FileReader r = new FileReaderfilename; StreamTokenizer rdr = new StreamTokenizerr; rdr.resetSyntax ; rdr.wordCharsa, z; words include any lowercase character rdr.wordCharsA, Z; words include any uppercase character rdr.wordChars0,9; words include any digit everything else is whitespace rdr.whitespaceChars0, 0-1; rdr.whitespaceChars9+1, A-1; rdr.whitespaceCharsz+1, \uffff; int token; loop getting each token word from the tokenizer until we reach the end of the file while token = rdr.nextToken = StreamTokenizer.TT_EOF - 114 - { If the token is a word, count it, otherwise it is whitespace if token == StreamTokenizer.TT_WORD count++; } System.out.printlncount + words found.; r.close ; } Now, for comparison, implement a more efficient version using char arrays. The word-count algorithm is relatively straightforward: test for sequences of alphanumerics and skip anything else. The only slight complication comes when you refill the buffer with the next chunk from the file. You need to avoid counting one word as two if it falls across the junction of the two reads into the buffer, but this turns out to be easy to handle. You simply need to remember the last character of the last chunk and skip any alphanumeric characters at the beginning of the next chunk if that last character was alphanumeric i.e., continue with the word until it terminates. The implementation looks like this: public static void cwordcountString filename throws IOException { int count = 0; FileReader rdr = new FileReaderfilename; buffer to hold read in characters char[] buf = new char[8192]; int len; int idx = 0; initialize so that our current character is in whitespace char c = ; read in each chunk as much as possible, until there is nothing left to read while len = rdr.readbuf, 0, buf.length = -1 { idx = 0; int start; if we are already in a word, then skip the rest of it if Character.isLetterOrDigitc while idx len Character.isLetterOrDigitbuf[idx] {idx++;} whileidx len { skip non alphanumeric while idx len Character.isLetterOrDigitbuf[idx] {idx++;} skip word start = idx; while idx len Character.isLetterOrDigitbuf[idx] {idx++;} if start len { count++; count word } } get last character so we know whether to carry on a word c = buf[idx-1]; } System.out.printlncount + words found.; } You can compare this implementation with the one using the StreamTokenizer . All tests use the same large text file for counting the words. I normalize to 100 the time taken by - 115 - StreamTokenizer using JDK 1.2 with the JIT compiler see Table 5-5 . Interestingly, the test takes almost the same amount of time when I run using the StreamTokenizer without the JIT compiler running. Depending on the file I run with, sometimes the JIT VM turns out slower than the non-JIT VM with the StreamTokenizer test. Table 5-5, Word Counter Timings Using wordcount or cwordcount Methods VM 1.2 1.2 no JIT

1.3 HotSpot 1.0