Java.io.Reader Converter

- 148 -

7.1 Java.io.Reader Converter

In the java.io package, the Reader and Writer classes provide character-based IO as opposed to byte-based IO. The InputStreamReader provides a bridge from byte to character streams. It reads bytes and translates them into characters according to a specified character encoding . If no encoding is specified, a default converter class is provided. For applications that spend a significant amount of time in reading, it is not unusual to see the convert method of this encoding class high up on a profile of how the application time is spent. It is instructive to examine how this particular conversion method functions and to see the effect of a tuning exercise. Examining the bytecodes of the convert method [2] where most of the time is being spent, you can see that the bytecodes correspond to the following method the Exception used is different; I have just used the generic Exception class: [2] The convert method is a method in one of the sun. packages, so the source code is not available. I have chosen the convert method from the default class used in some ASCII environments, the ISO 8859_1 conversion class. public int convertbyte input[], int byteStart, int byteEnd, char output[], int charStart, int charEnd throws Exception { int charOff = charStart; forint byteOff = byteStart; byteOff byteEnd; { ifcharOff = charEnd throw new Exception ; int i1 = input[byteOff++]; ifi1 = 0 output[charOff++] = chari1; else output[charOff++] = char256 + i1; } return charOff - charStart; } Basically, the method takes a byte array input and converts the elements from byteStart to byteEnd of that array into characters. The conversion of byte s to char s is straightforward, consisting of mapping positive byte values to the same char value, and mapping negative byte values to the char with value byte value + 256. These char s are put into the passed char array output from indexes charStart to charEnd . It doesnt seem that there is too much scope for tuning. There is the obvious first test, which is performed every time through the loop. You can certainly move that. But lets start by trying to tune the data conversion itself. First, be sure that casts on data types are efficient. Its only a quick test to find out. Add a static char array to the class, which contains just char values to 127 at elements to 127 in the array. Calling this array MAP1 , test the following altered method: public int convertbyte input[], int byteStart, int byteEnd, char output[], int charStart, int charEnd throws Exception { int charOff = charStart; forint byteOff = byteStart; byteOff byteEnd; { ifcharOff = charEnd throw new Exception ; - 149 - int i1 = input[byteOff++]; ifi1 = 0 output[charOff++] = MAP1[i1]; else output[charOff++] = char256 + i1; } return charOff - charStart; } On the basis of the original method taking a normalized 100.0 seconds in test runs, this alternative takes an average of 111.8 seconds over a set of test runs. Well, that says that casts are not so slow, but it hasnt helped make this method any faster. However, the second cast involves an addition as well, and perhaps you can do better here. Unfortunately, there is no obvious way to use a negative value as an index into the array without executing some offset operation, so you wont gain time. For completeness, test this with an index offset given by i1+128 and find that the average time is at the 110.7-second mark. This is not significantly better than the last test and definitely worse than the original. Array-lookup speeds are highly dependent on the processor and the memory-access instructions available from the processor. The lookup speed is also dependent on the compiler taking advantage of the fastest memory-access instructions available. It is possible that other processors, VMs, or compilers will produce lookups faster than the cast. But you have gained an extra option from these two tests. It is now clear that you can map all the byte s to char s through an array. Perhaps you can eliminate the test for positiveness applied to the byte i.e., ifi1 = 0 and use a char array to map all the byte s directly. And indeed you can. Use the index conversion from the second test an index offset given by i1+128 , with a static char array that contains just char values 128 to 255 at elements to 127 in the array, and char values to 127 at elements 128 to 255 in the array. The method now looks like: public int convertbyte input[], int byteStart, int byteEnd, char output[], int charStart, int charEnd throws Exception { int charOff = charStart; forint byteOff = byteStart; byteOff byteEnd; { ifcharOff = charEnd throw new Exception ; int i1 = input[byteOff++]; output[charOff++] = MAP3[128 + i1]; } return charOff - charStart; } You have eliminated one boolean test each time through the loop at the expense of using a slightly more expensive data-conversion method array access rather than the cast. The average test result is now slightly faster than before, but still over the 100 seconds some VMs show a speedup at this stage, but not the JDK 1.2 VM. - 150 - Cleaning up the method slightly, you can see that the temporary variable, i1 , which was previously required for the test, is no longer needed. Being an assiduous tuner and clean coder, you eliminate it and retest so that you have a new baseline to start from. Astonishingly to me at least, this speeds the test up measurably. The average test time is now still slightly above 100 seconds again, some VMs do show a speedup at this stage, greater than before, but not the JDK 1.2 VM. There was a definite overhead from the redundant temporary variable in the loop: a lesson to keep in mind for general tuning. It may be worth testing to see if an int array performs better than the char array MAP3 previously used, since int s are the faster data type. And indeed, changing the type of this array and putting a char cast in the loop improves times so that you are now very slightly, but consistently, faster than 100 seconds for JDK 1.2. Not all VMs are faster at this stage, though all are close to the 100-second mark. For example, JDK 1.1.6 shows timings slightly larger than 100 seconds. More to the point, after this effort, you have not really managed a speedup consistent enough or good enough to justify the time spent on this tuning exercise. Now Im out of original ideas, but we have yet to apply the standard optimizations. Start [3] by eliminating expressions from the loop that do not need to be repeatedly called, and move the other boolean test the one for the out-of-range Exception out of the loop. The method now looks like this MAP5 is the int array mapping for byte s to char s: [3] Although the tuning optimizations Ive tried so far have not provided a significant speedup, I will continue tuning with the most recent implementation discussed, instead of starting again from the beginning. There is no particular reason why I should not restart from the original implementation. public int convertbyte input[], int byteStart, int byteEnd, char output[], int charStart, int charEnd throws Exception { int max = byteEnd; boolean throwException = false; if byteEnd-byteStart charEnd-charStart { max = byteStart+charEnd-charStart; throwException = true; } int charOff = charStart; forint byteOff = byteStart; byteOff max; { output[charOff++] = char MAP5[input[byteOff++]+128]; } ifthrowException throw new Exception ; return charOff - charStart; } I am taking the trouble to make the method functionally identical to the original. The original version filled in the array until the actual out-of-range exception is encountered, so I do the same. If you throw the exception as soon as you establish the index is out of range, the code will be slightly more straightforward. Other than that, the loop is the same as before, but without the out-of-range test and without the temporary assignment. The average test result is now a very useful 83.3 seconds. Youve shaved off nearly a fifth of the time spent in this loop. This is mainly down to eliminating a test that was originally being run through on each loop iteration. This speedup applied to all VMs tested many had a better speedup, i.e., a lower relative time to the 100-second mark. - 151 - Loop unrolling is another standard optimization that eliminates some more tests. Lets partially unroll the loop and see what sort of a gain we get. In practice, the optimal amount of loop unrolling corresponds to the way the application uses the convert method, for example, the size of the typical array that is being converted. But in any case, we use a particular example of 10 loop iterations to see the effect. Optimal loop unrolling depends on a number of factors, including the underlying operating system and hardware. Loop unrolling is ideally achieved by way of an optimizing compiler rather than by hand. HotSpot interacts with manual loop unrolling in a highly variable way: sometimes HotSpot makes the unoptimized loop faster, sometimes the manually unrolled loop comes out faster. An example can be seen in Table 8-1 and Table 8-2 , which show HotSpot producing both faster and slower times for the same manually unrolled loop, depending on the data being processed. These two tables show the results from the same optimized program being run against files with long lines Table 8-1 and files with short lines Table 8-2 . Of all the VMs tested, only the HotSpot VM produces inconsistent results, with a speedup when processing the long-line files but a slowdown when processing the short-line files the last two lines of each table show the difference between the original loop and the manually unrolled loop. The method now looks like this: public int convertbyte input[], int byteStart, int byteEnd, char output[], int charStart, int charEnd throws Exception { Set the maximum index of the input array to wind to int max = byteEnd; boolean throwException = false; if byteEnd-byteStart charEnd-charStart { If the byte arry length is larger than the char array length then we will throw an exception when we get to the adjusted max max = byteStart+charEnd-charStart; throwException = true; } charOff is the current index into output int charOff = charStart; Check that we have at least 10 elements for our unrolled part of the loop if max-byteStart 10 { shift max down by 10 so that we have some elements left over before we run out of groups of 10 max -= 10; int byteOff = byteStart; The loop test only tests every 10th test compared to the normal loop. All the increments are done in the loop body. Each line increments the byteoff by 1 until its incremented by 10 after 10 lines. Then the test checks that we are still under max - if so then loop again. for; byteOff max; { output[charOff++] = char MAP5[input[byteOff++]+128]; output[charOff++] = char MAP5[input[byteOff++]+128]; output[charOff++] = char MAP5[input[byteOff++]+128]; output[charOff++] = char MAP5[input[byteOff++]+128]; - 152 - output[charOff++] = char MAP5[input[byteOff++]+128]; output[charOff++] = char MAP5[input[byteOff++]+128]; output[charOff++] = char MAP5[input[byteOff++]+128]; output[charOff++] = char MAP5[input[byteOff++]+128]; output[charOff++] = char MAP5[input[byteOff++]+128]; output[charOff++] = char MAP5[input[byteOff++]+128]; } We exited the loop because the byteoff went over the max. Fortunately we kept back 10 elements so that we didnt go too far past max. Now add the 10 back, and go into the normal loop for the last few elements. max += 10; for; byteOff max; { output[charOff++] = char MAP5[input[byteOff++]+128]; } } else { If were in this conditional, then there arent even 10 elements to process, so obviously we dont want to do the unrolled part of the method. forint byteOff = byteStart; byteOff max; { output[charOff++] = char MAP5[input[byteOff++]+128]; } } Finally if we indicated that the method needed an exception thrown, we do it now. ifthrowException throw new Exception ; return charOff - charStart; } The average test result is now a very good 72.6 seconds. Youve now shaved off over one quarter of the time compared to the original loop in JDK 1.2; other VMs give an even larger speedup, some taking down to 60 of the time of the original loop. It is worth repeating that this is mainly a result of eliminating tests that were originally run in each loop iteration. For tight loops i.e., loops that have a small amount of actual work that needs to be executed on each iteration, the overhead of tests is definitely significant. It is also important during the tuning exercise to run the various improvements under different VMs, and determine that the improvements are generally applicable. My tests indicate that these improvements are generally valid for all runtime environments. One development environment with a very slow VM—an order of magnitude slower than the Sun VM without JIT—showed only a small improvement. However, it is not generally a good idea to base performance tests on development environments. For a small Java program that does simple filtering or conversion of data from text files, this convert method could take 40 of the total program time. Improving this one method as shown can shave 10 from the time of the whole program, which is a good gain for a relatively small amount of work it took me longer to write this section than to tune the convert method. - 153 -

7.2 Exception-Terminated Loops