Compression IO, Logging, and Console Output

- 188 - and look at the actual method names of the native methods, you will find that in almost every case, the only classes applicable to you are the FileInputStream , FileOutputStream , and RandomAccessFile classes. Now the difficult part is wrapping these calls so that you can monitor them. Native methods that are declared private are straightforward to handle: just redefine the java.io class to count the times they are called internally. Native methods that are protected or have no access modifier are similarly handled: just ensure you do the same redefinition for subclasses and package members. But the methods defined with the public modifier need to be tracked for any classes that call these native methods, which can be difficult and tiresome, but not impossible. [17] Ultimately, it is the number of low-level IO operations that matter. But if you reduce the high-level IO operations, the low-level ones are generally reduced by the same proportion. The Java readwriteopenclose operations at the native level are also the OS readwriteopenclose operations for all the Java runtimes Ive investigated. The simplest alternative would be to use the debug interface to count the number of hits on the method. Unfortunately, you cannot set a breakpoint on a native method, so this is not possible. The result is that it takes some effort to identify every IO call in an application. If you have consistently used your own IO classes, the java.io buffered classes, and the java.io Reader and Writer classes, it may be enough to wrap the IO calls to FileOutputStream and FileInputStream from these classes. If you have done nonstandard things, you need to put in more effort. One other way to determine how many IO operations you have used is to execute Runtime.getRuntime .traceMethodCallstrue before the test starts, capture the method trace, and filter out the native calls you have identified. Unfortunately, this is optional functionality in the JDK Java specifies that the traceMethodCalls method must exist in Runtime , but it does not have to do anything, so you are lucky if you use a system that supports it. The only one I am aware of is the Symantec development environment, and in that case, you have to be in the IDE and running in debug mode. Running the Symantec VM outside the IDE does not seem to enable this feature. Some profilers see also Chapter 2 may also help to produce a trace of all IO operations. I would recommend that all basic IO calls have logging statements next to them, capable of reporting the amount of IO performed both the number of IO operations and the number of bytes transferred. IO is typically so costly that one null call or if statement when logging is not turned on is not at all significant for each IO performed. On the other hand, it is incredibly useful to be able to determine at any time whether IO is causing a performance problem. Typically, IO performance depends on the configuration of the system and on resources outside the application. So if an unusual configuration causes IO to be dramatically more expensive, this can be easily missed in testing and difficult to determine especially remotely unless you have an IO-monitoring capability built into your application.

8.6 Compression

A colleague of mine once installed a compression utility on his desktop machine that compressed the entire disk. The utility worked as a type of disk driver: accesses to the disk went through the utility, and every read and write was decompressed or compressed transparently to the rest of the system, and to the user. My colleague was expecting the system to run slower, but needed the extra disk space and was willing to put up with a slower system. What he actually found was that his system ran faster It turned out that the major bottleneck to his system was disk throughput, and by making most files smaller averaging half the previous size, - 189 - everything was moving between memory and disk much quicker. The CPU had plenty of spare cycles necessary to handle the compression-decompression procedures because it was waiting for disk transfers to complete. This illustrates how the overhead of compression can be outweighed by the benefits of reducing IO. The system described obviously had a disk that was relatively too slow in comparison to the CPU processing power. But this is quite common. Disk throughput has not improved nearly as fast as CPUs have increased in speed, and this divergent trend is set to continue for some time. The same is true for networks. Although networks do tend to have a huge jump in throughput with each generation, this jump tends to be offset by the much larger volumes of data being transferred. Furthermore, network-mounted disks are also increasingly common, and the double performance hit from accessing a disk over a network is surely a prime candidate for increasing speed using compression. On the other hand, if a system has a fully loaded CPU, adding compression can make things worse. This means that when you control the environment servers, servlets, etc., you can probably specify precisely, by testing, whether or not to use compression in your application to improve performance. When the environment is unknown, the situation is more complex. One suggestion is to write IO wrapper classes that handle compressed and uncompressed IO automatically on the fly. Your application can then test whether any particular IO destination has better performance using compression, and then automatically use compression when called for. One final thing to note about compressed data is that it is not always necessary to decompress the data in order to work with it. As an example, if you are using 2-Ronnies compression, [18] the text Hello. Have you any eggs? No, we havent any eggs is compressed into LO. F U NE X? 9, V FN NE X. [18] The Two Ronnies was a British comedy show that featured very inventive comedy sketches, many based on word play. One such sketch involved a restaurant scene where all the characters spoke only in letters and numbers, joining the letters up in such a way that they sounded like words. The mapping for some of the words to letters was as follows: have F you U any NE eggs X hello LO no 9 yes S we V have F havent FN ham M and N Now, if I want to search the text to see if it includes the phrase any eggs, I do not actually need to decompress the compressed text. Instead, I compress the search string any eggs using 2-Ronnies compression into NE X, and I can now use that compressed search string to search directly on the compressed text. When applied to objects or data, this technique requires some effort. You need to ensure that any small data chunk compresses in the same way both on its own and as part of a larger volume of data containing that data chunk. If this is not the case, you may need to break objects and searchable data into fields that are individually compressed. - 190 - There are several advantages to this technique of searching directly against compressed data: • There is no need to decompress a large amount of data. • Searches are actually quicker because the search is against a smaller volume of data. • More data can be held in memory simultaneously since it is compressed, which can be especially important for searching through large volumes of disk stored data. It is rarely possible to search for compressed substrings directly in compressed data because of the way most compression algorithms use tables covering the whole dataset. However, this scheme has been used to selectively query for data locations. For this usage, unique data keys are compressed separately from the rest of the data. A pointer is stored next to the compressed key. This produces a compressed index table that can be searched without decompressing the keys. The compression algorithm is separately applicable for each key. This scheme allows compressed keys to be searched directly to identify the location of the corresponding data.

8.7 Performance Checklist