no JIT no JIT HotSpot 1.0 Timing Multithreaded Tests

- 216 - for int j = 0; j CAPACITY; j++ l.setj, Boolean.TRUE; System.out.printlnltype + took + System.currentTimeMillis -time; } } The normalized results from running this test are shown in Table 10-3 . Table 10-3, Timings of the Various Array-Manipulation Tests, Normalized to the JDK 1.2 Vector Test 1.2 1.2 no JIT

1.3 HotSpot 1.0

HotSpot 2nd Run Vector 100 179 25 [2] 64 32 ArrayList 23 382 [3] 22 17 24 Wrapped ArrayList 170 797 36 72 39 [2] The 1.3 VM manages to execute the initial Vector test slightly faster than the ArrayList . But unfortunately, the VM then appears to unoptimize the Vector test, making all subsequent test runs slower. [3] I have no idea why the non-JIT VM runs the ArrayList slower. The ArrayList methods are defined with slightly more testing, but I wouldnt have thought there was enough to make such a difference. There are some reports that the latest VMs have negligible overheads for synchronized methods; however, my own tests show that synchronized methods continue to incur significant overheads VMs up to and including JDK 1.2. HotSpot has at times shown different behavior. My tests using HotSpot show that synchronized methods can sometimes be optimized to run faster than unsynchronized versions. However, by varying the order and type of tests, it becomes clear that HotSpot is very inconsistent in its optimizations. This variation can exist for a number of different reasons: profiler overheads, aggressive compiler cutting in, deoptimizations occasionally necessary, etc. The variability in this particular test probably comes from speculatively inlining methods and sometimes having to undo the speculative inline. This can result in tests where a synchronized method apparently gets optimized more effectively than a nonsynchronized method. The results from running the ListTesting class just defined in a HotSpot VM show how difficult it can be to get consistent results from HotSpot . For my test results, I take the first three and next three results, but I also find that altering the order of the tests can make a big difference to the times: Vector took 5548 sync ArrayList took 6239 ArrayList took 1472 Vector took 2734 ArrayList took 2103 sync ArrayList took 3385 Vector took 7811 ArrayList took 6469 sync ArrayList took 3696

10.4.2 Avoiding Serialized Execution

One way of completely avoiding the requirement to synchronize methods is to use separate objects and storage structures for different threads. Care must be taken to avoid calling synchronized methods from your own methods, or you will lose all your carefully built benefits. For example, Hashtable access and update methods are synchronized , so using one in your storage structure can eliminate any desired benefit. Prior to JDK 1.2, there is no unsynchronized hash table in the - 217 - JDK, and you have to build or buy your own unsynchronized version. From JDK 1.2, unsynchronized collection classes are available, including Map classes. As an example of implementing this framework, I look at a simple set of global counters, keyed on a numeric identifier. Basically, the concept is a global counter to which any thread can add a number. This concept is extended slightly to allow for multiple counters, each counter having a different key. String keys are more useful, but for simplicity I use integer keys in this example. To use String keys, an unsynchronized Map replaces the arrays. The simple, straightforward version of the class looks like this: package tuning.threads; public class Counter1 { For simplicity make just 10 counters static long[] vec = new long[10]; public static void initializeint key { vec[key] = 0; } And also just make key the index into the array public static void addAmountint key, long amount { This is not atomically synchronized since we do an array access together with an update, which are two operations. vec[key] += amount; } public static long getAmountint key { return vec[key]; } } This class is basic and easy to understand. Unfortunately, it is not thread-safe, and leads to corrupt counter values when used. A test run on a particular single-CPU configuration with four threads running simultaneously, each adding the number 1 to the same key 10 million times, gives a final counter value of around 26 million instead of the correct 40 million. [4] On the positive side, the test is blazingly fast, taking very little time to complete and get the wrong answer. [4] The results discussed are for one particular test run. On other test runs, the final value is different, but it is almost never the correct 40 million value. If I use a faster CPU or a lower total count, the threads can get serialized by the operating system by finishing quickly enough, leading to consistently correct results for the total count. But those correct results are an artifact of the environment, and are not guaranteed to be produced. Other system loads and environments generate corrupt values. To get the correct behavior, you need to synchronize the update methods in the class. Here is Counter2 , which is just Counter1 with the methods synchronized: package tuning.threads; public class Counter2 { For simplicity make just 10 counters static long[] vec = new long[10]; public static synchronized void initializeint key { - 218 - vec[key] = 0; } And also make the just make key the index into the array public static synchronized void addAmountint key, long amount { Now the method is synchronized, so we will always complete any particular update vec[key] += amount; } public static synchronized long getAmountint key { return vec[key]; } } Now you get the correct answer of 40 million for the same test as before. Unfortunately, the test takes 20 times longer to execute see Table 10-4 . Avoiding the synchronization is going to be more work. To do this, create a set of counters, one for each thread, and update each threads counter separately. [5] When you want to see the global total, you need to sum the counters across the threads. The class definition follows: [5] Although ThreadLocal variables might seem ideal to ensure the allocation of different counters for different threads, they are of no use here. The underlying implementation for ThreadLocal objects uses a synchronized map to allocate per-thread objects, and that defeats the intention to avoid synchronization completely. package tuning.threads; public class Counter3 { support up to 10 threads of 10 counters static long vec[][] = new long[10][]; public static synchronized void initializeCounterTest t { For simplicity make just 10 counters per thread vec[t.num] = new long[10]; } public static void addAmountint key, long amount { Use our own threads to make the mapping easier, and to illustrate the technique of customizing threads. For generic Thread objects, could use an unsynchronized HashMap or other Map, Or use ThreadLocal if JDK 1.2 is available We use the num instance variable of the CounterTest object to determine which array we are going to increment. Since each thread is different, here is no conflict. Each thread updates its own counter. long[] arr = vec[CounterTest Thread.currentThread .num]; arr[key] += amount; } public static synchronized long getAmountint key { The current amount must be aggregated across the thread storage arrays. This needs to be synchronized, but does not matter here as I just call it at the end. long amount = 0; for int threadnum = vec.length-1; threadnum = 0 ; threadnum-- { - 219 - long[] arr = vec[threadnum]; if arr = null amount += arr[key]; } return amount; } } Using Counter3 , you get the correct answer for the global counter, and the test is quicker than Counter2 . The relative timings for a range of VMs are listed in Table 10-4 . Table 10-4, Timings of the Various Counter Tests, Normalized to the JDK 1.2 Counter2 Test 1.2 1.2 no JIT

1.3 HotSpot 1.0

HotSpot 2nd Run 1.1.6 Counter2 100 397 383 191 755 180 Counter3 70 384 175 156 190 95 Counter1 incorrect result 5 116 10 78 17 5 The serialized execution avoidance class is a significant improvement on the synchronized case. The Counter2 timings can be extremely variable. This variation is generated from the nature of multithreaded context switching, together with the fact that the activity taking much of the time in this test is lock management. Switching is essentially unpredictable, and the amount of switching and where it occurs affects how often the VM has to release and reacquire locks in different threads. Nevertheless, across a number of measurements, Counter3 was always faster than Counter2 , often several times faster. The listed times were measured on a single-processor machine. Consider what happens on a multiprocessor machine where the threads can run on different CPU s i.e., where the Java runtime and operating system support preemptive thread scheduling on separate CPUs. Counter3 the serialized execution avoidance class is parallelized automatically and scales very nicely. This same test with Counter3 , running on a four-CPU machine, tends towards one-quarter of the single-CPU time, assuming that the four CPUs have the same power as the single CPU we tested earlier. On the other hand, the synchronized version of the counter, Counter2 , always has serialized execution thats what synchronized does. Consequently, it does not scale and generally performs no better than in the single-CPU test except for the advantage of running the OS on another CPU.

10.5 Timing Multithreaded Tests

I measured timings of the three Counter classes in the previous section using another class, CounterTest . This timing class illustrates some pitfalls you need to avoid when timing multithreaded applications, so Ill go into a little detail about the CounterTest definition. The first naive implementation of CounterTest is quite simple. Just create a Thread subclass with the run method running timed tests of the classes you are measuring. You need an extra instance variable for the Counter3 class, so the class can be defined as: package tuning.threads; public class CounterTest extends Thread { instance variable to specify which thread we are. int num; - 220 - public CounterTestint threadnum { super ; num = threadnum; } main forks four threads public static void mainString[] args { int REPEAT = args.length 0 ? Integer.parseIntargs[0] : 10000000; for int i = 0; i 4; i++ new CounterTesti.start ; } public void run { Counter1.initialize0; long time = System.currentTimeMillis ; for int i = REPEAT; i 0; i-- Counter1.addAmount0, 1; System.out.printlnCounter1 count: + Counter1.getAmount0 + time: + System.currentTimeMillis -time; Counter2.initialize0; time = System.currentTimeMillis ; for int i = REPEAT; i 0; i-- Counter2.addAmount0, 1; System.out.printlnCounter2 count: + Counter2.getAmount0 + time: + System.currentTimeMillis -time; Counter3.initializethis; time = System.currentTimeMillis ; for int i = REPEAT; i 0; i-- Counter3.addAmount0, 1; System.out.printlnCounter3 count: + Counter3.getAmount0 + time: + System.currentTimeMillis -time; } } Unfortunately, this class has two big problems. First, there is no way of knowing that the four threads are running the same test at the same time. With this implementation, it is perfectly possible that one thread is running the Counter1 test, while another has already finished that test and is now running the Counter2 test concurrently . This gives incorrect times for both tests, since the CPU is being used by another test while you measure the first test. And the synchronization costs are not measured properly, since the intention is to test the synchronization costs of running four threads using the same methods at the same time. The second problem is with the times you are measuring. The timings are for each thread running its own threaded update to the Counter class. But you should be measuring the time from the first update in any thread to the last update in any thread. One way to avoid the first pitfall is to synchronize the tests so that they are not started until all the threads are ready. Then all threads can be started at the same time. The second pitfall can be avoided by setting a global time at the start of the first update, then printing the time difference when the last thread finishes. The full tuning.threads.CounterTest implementation with the correct handling for measurements can be found along with all the other classes from this book by clicking on the Examples link from the books catalog page, http:www.oreilly.comcatalogjavapt . - 221 -

10.6 Atomic Access and Assignment