Collections Appropriate Data Structures and Algorithms

- 234 - [2] The classic reference is The Art of Computer Programming by Donald Knuth Addison Wesley. A more Java-specific book is Data Structures and Algorithm Analysis in Java by Mark Weiss Peachpit Press. When tuning, you often need to switch one implementation of a class for another, more optimal implementation. Switching data structures is easier because you are in an object-oriented environment, so you can usually replace one or a few classes with different implementations while keeping all the interfaces and signatures the same. When tuning algorithms, one factor that should pop to the front of your mind concerns the scaling characteristics of the algorithms you use. For example, bubblesort is an On2 algorithm, while quicksort is Onlogn. The concept of order of magnitude statistics is described in Section 9.3 . This tells you nothing about absolute times for using either of these algorithms for sorting elements, but it does clearly tell you that quicksort has the better scaling characteristics, and so is likely to be the better candidate as your collections increase in size. Similarly, hash tables have an O1 searching algorithm where an array requires On searching.

11.1 Collections

Collections are the data structures that are most easily altered for performance-tuning purposes. Using the correct or most appropriate collection class can improve performance with little change to code. For example, if a large ordered collection has elements frequently deleted or inserted throughout it, it usually can provide better performance if based on a linked list rather than an array. On the other hand, a static unchanging collection that needs to be accessed by index performs better with an underlying implementation that is an array . If the data is large and insertions are allowed for example, a text buffer, then a common halfway measure is to use a linked list of arrays. This structure copies data within a single array when data is inserted or deleted. When an array gets filled, the collection inserts a new empty array immediately after the full array, and moves some data from the full to the empty array so that both old and new arrays have space. A converse structure provides optimized indexed access to a linked-list structure by holding an array of a subset of the link nodes e.g., every 20th node. This structure allows for quick navigation to the indexed nodes, and then slower nodal access to nodes in between. [3] The result is a linked-list implementation that is much faster at index access, though it occupies more space. [3] Skip lists are an implementation of this concept. See The Elegant and Fast Skip List by T. Wenger, Java Pro, April-May 1998. It is sometimes useful to provide two collections holding the same data, so that the data can be accessed using the most appropriate and fastest procedure. This is common for indexed data database-type indexes as opposed to array indexes, but entails extra overhead at the build stage. In a similar way, it may be that a particular data set is best held in two or more different collections over its lifetime, but with only one collection being used at any one time. For example, you may use a linked-list implementation of a vector type collection while building the collection because your collection requires many insertions while it is being built. However, this provides suboptimal random access. After the build is completed, the collection can be converted into one based on an array, thus speeding up access. It can be difficult to identify optimal algorithms for particular data structures. For example, in the Java 2 java.util.Collections.sort method, a linked list is first converted to an array in order to sort it. This is detrimental to performance, and it would be significantly faster to sort a linked list directly using a merge sort . [4] In any case, frequently converting between collections and arrays is likely to cause performance problems. [4] See Sorting and Searching Linked Lists in Java by John Boyer, Dr. Dobbs Journal, May 1998. - 235 - The fastest ordered collections available in Java are plain arrays e.g., int[] , Object[] , etc.. The drawback to using these directly is the lack of object-oriented methodology you can apply. Arrays are not proper classes that can be extended. However, I occasionally find that there are situations when I want to pass these raw arrays directly between several classes, rather than wrap the arrays in a class with the behavior required. This is unfortunate in design terms, but does provide speed. An example would be in some communications layers. Here, there are several layers of protocols you need to pass your message through before it is transmitted, for example, a compression layer and an encryption layer. If you use an object as a message being passed through these layers, each layer has to request the message contents copying it, change the contents, and then assign back the new contents copying again. An alternative is to implement the content-manipulation methods in the message object itself, which is not a very extensible architecture. Assuming that you use an array to hold the contents, you can allow the message-contents array itself to be passed directly to the other compression and encryption layer objects. This provides a big speedup, avoiding several copies. String objects also illustrate the point. If you want to iterate over the characters in a String , you must either repeatedly call String.charAt or copy the characters into your own array using String.getChars , and then iterate over them. Depending on the size of the String and how many times you iterate through the characters, one or the other of these methods is quicker, but if you could iterate directly on the underlying char array, you would avoid the repeated method calls and the copy see Chapter 5 . A final point is that the collections that come with Java and other packages are usually not type- specific. This generality comes at the cost of performance. For example, if you are using java.util.Vector to hold only String objects, then you have to keep casting to String each time you access elements. If you reimplement the Vector class yourself using an underlying String[] array, and then change signature parameters and return types of methods from Object to String , the re-implemented class is faster. It is also clearer to use: you get rid of all those casts from your code. The cost is that you lose the general collection interface see Section 3.2 , for an example. It is straightforward to test the performance costs of generalized collections compared to specialized collections. Access that does not involve a cast takes place at essentially the same speed, i.e., all the following accesses take the same time: int i = integerArrayList.getsomeIndex; String s = stringArrayList.getsomeIndex; Object o = objectArrayList.getsomeIndex; But the cost of a cast can make the access take 50 longer: It can take 50 longer to access the string because of the cast String s = String objectArrayList.getsomeIndex; Update time can also be significantly faster. Updates to underlying arrays of primitive data types can be 40 faster than to object arrays. [5] The biggest difference is when a primitive data type needs to be wrapped and unwrapped in order to store into an array: [5] Even updating a typed object array with objects of the given type e.g., String s into an underlying String[] array of an array list seems to be faster by about 10. The only reason I can think of for this is that the JIT compiler manages to optimize the update to the specialized array. Simpler and much faster using a specialized IntArrayList integerArrayList.setsomeIndex, someNum; int num = integerArrayList.getsomeIndex; Using a generalized ArrayList requires wrapping, casting unwrapping - 236 - integerArrayList.setsomeIndex, new IntegersomeNum; int num = Integer integerArrayList.getsomeIndex, someNum.intValue ; For this example, the cost of creating a new Integer object to wrap the int makes setting values take more than ten times longer when using the generalized array. Accessing is not as bad, taking only twice as long after including the extra cast and method access to get to the int .

11.2 Java 2 Collections