Avoiding Serialized Execution HotSpot 1.0

- 216 - for int j = 0; j CAPACITY; j++ l.setj, Boolean.TRUE; System.out.printlnltype + took + System.currentTimeMillis -time; } } The normalized results from running this test are shown in Table 10-3 . Table 10-3, Timings of the Various Array-Manipulation Tests, Normalized to the JDK 1.2 Vector Test 1.2 1.2 no JIT

1.3 HotSpot 1.0

HotSpot 2nd Run Vector 100 179 25 [2] 64 32 ArrayList 23 382 [3] 22 17 24 Wrapped ArrayList 170 797 36 72 39 [2] The 1.3 VM manages to execute the initial Vector test slightly faster than the ArrayList . But unfortunately, the VM then appears to unoptimize the Vector test, making all subsequent test runs slower. [3] I have no idea why the non-JIT VM runs the ArrayList slower. The ArrayList methods are defined with slightly more testing, but I wouldnt have thought there was enough to make such a difference. There are some reports that the latest VMs have negligible overheads for synchronized methods; however, my own tests show that synchronized methods continue to incur significant overheads VMs up to and including JDK 1.2. HotSpot has at times shown different behavior. My tests using HotSpot show that synchronized methods can sometimes be optimized to run faster than unsynchronized versions. However, by varying the order and type of tests, it becomes clear that HotSpot is very inconsistent in its optimizations. This variation can exist for a number of different reasons: profiler overheads, aggressive compiler cutting in, deoptimizations occasionally necessary, etc. The variability in this particular test probably comes from speculatively inlining methods and sometimes having to undo the speculative inline. This can result in tests where a synchronized method apparently gets optimized more effectively than a nonsynchronized method. The results from running the ListTesting class just defined in a HotSpot VM show how difficult it can be to get consistent results from HotSpot . For my test results, I take the first three and next three results, but I also find that altering the order of the tests can make a big difference to the times: Vector took 5548 sync ArrayList took 6239 ArrayList took 1472 Vector took 2734 ArrayList took 2103 sync ArrayList took 3385 Vector took 7811 ArrayList took 6469 sync ArrayList took 3696

10.4.2 Avoiding Serialized Execution

One way of completely avoiding the requirement to synchronize methods is to use separate objects and storage structures for different threads. Care must be taken to avoid calling synchronized methods from your own methods, or you will lose all your carefully built benefits. For example, Hashtable access and update methods are synchronized , so using one in your storage structure can eliminate any desired benefit. Prior to JDK 1.2, there is no unsynchronized hash table in the - 217 - JDK, and you have to build or buy your own unsynchronized version. From JDK 1.2, unsynchronized collection classes are available, including Map classes. As an example of implementing this framework, I look at a simple set of global counters, keyed on a numeric identifier. Basically, the concept is a global counter to which any thread can add a number. This concept is extended slightly to allow for multiple counters, each counter having a different key. String keys are more useful, but for simplicity I use integer keys in this example. To use String keys, an unsynchronized Map replaces the arrays. The simple, straightforward version of the class looks like this: package tuning.threads; public class Counter1 { For simplicity make just 10 counters static long[] vec = new long[10]; public static void initializeint key { vec[key] = 0; } And also just make key the index into the array public static void addAmountint key, long amount { This is not atomically synchronized since we do an array access together with an update, which are two operations. vec[key] += amount; } public static long getAmountint key { return vec[key]; } } This class is basic and easy to understand. Unfortunately, it is not thread-safe, and leads to corrupt counter values when used. A test run on a particular single-CPU configuration with four threads running simultaneously, each adding the number 1 to the same key 10 million times, gives a final counter value of around 26 million instead of the correct 40 million. [4] On the positive side, the test is blazingly fast, taking very little time to complete and get the wrong answer. [4] The results discussed are for one particular test run. On other test runs, the final value is different, but it is almost never the correct 40 million value. If I use a faster CPU or a lower total count, the threads can get serialized by the operating system by finishing quickly enough, leading to consistently correct results for the total count. But those correct results are an artifact of the environment, and are not guaranteed to be produced. Other system loads and environments generate corrupt values. To get the correct behavior, you need to synchronize the update methods in the class. Here is Counter2 , which is just Counter1 with the methods synchronized: package tuning.threads; public class Counter2 { For simplicity make just 10 counters static long[] vec = new long[10]; public static synchronized void initializeint key { - 218 - vec[key] = 0; } And also make the just make key the index into the array public static synchronized void addAmountint key, long amount { Now the method is synchronized, so we will always complete any particular update vec[key] += amount; } public static synchronized long getAmountint key { return vec[key]; } } Now you get the correct answer of 40 million for the same test as before. Unfortunately, the test takes 20 times longer to execute see Table 10-4 . Avoiding the synchronization is going to be more work. To do this, create a set of counters, one for each thread, and update each threads counter separately. [5] When you want to see the global total, you need to sum the counters across the threads. The class definition follows: [5] Although ThreadLocal variables might seem ideal to ensure the allocation of different counters for different threads, they are of no use here. The underlying implementation for ThreadLocal objects uses a synchronized map to allocate per-thread objects, and that defeats the intention to avoid synchronization completely. package tuning.threads; public class Counter3 { support up to 10 threads of 10 counters static long vec[][] = new long[10][]; public static synchronized void initializeCounterTest t { For simplicity make just 10 counters per thread vec[t.num] = new long[10]; } public static void addAmountint key, long amount { Use our own threads to make the mapping easier, and to illustrate the technique of customizing threads. For generic Thread objects, could use an unsynchronized HashMap or other Map, Or use ThreadLocal if JDK 1.2 is available We use the num instance variable of the CounterTest object to determine which array we are going to increment. Since each thread is different, here is no conflict. Each thread updates its own counter. long[] arr = vec[CounterTest Thread.currentThread .num]; arr[key] += amount; } public static synchronized long getAmountint key { The current amount must be aggregated across the thread storage arrays. This needs to be synchronized, but does not matter here as I just call it at the end. long amount = 0; for int threadnum = vec.length-1; threadnum = 0 ; threadnum-- { - 219 - long[] arr = vec[threadnum]; if arr = null amount += arr[key]; } return amount; } } Using Counter3 , you get the correct answer for the global counter, and the test is quicker than Counter2 . The relative timings for a range of VMs are listed in Table 10-4 . Table 10-4, Timings of the Various Counter Tests, Normalized to the JDK 1.2 Counter2 Test 1.2 1.2 no JIT