- 216 -
for int j = 0; j CAPACITY; j++ l.setj, Boolean.TRUE;
System.out.printlnltype + took + System.currentTimeMillis -time;
} }
The normalized results from running this test are shown in Table 10-3
. Table 10-3, Timings of the Various Array-Manipulation Tests, Normalized to the JDK 1.2
Vector Test
1.2 1.2 no JIT
1.3 HotSpot 1.0
HotSpot 2nd Run
Vector 100 179
25
[2]
64 32
ArrayList 23 382
[3]
22 17
24 Wrapped ArrayList
170 797 36 72
39
[2]
The 1.3 VM manages to execute the initial
Vector
test slightly faster than the
ArrayList
. But unfortunately, the VM then appears to unoptimize the
Vector
test, making all subsequent test runs slower.
[3]
I have no idea why the non-JIT VM runs the
ArrayList
slower. The
ArrayList
methods are defined with slightly more testing, but I wouldnt have thought there was enough to make such a difference.
There are some reports that the latest VMs have negligible overheads for synchronized methods; however, my own tests show that synchronized methods continue to incur significant overheads
VMs up to and including JDK 1.2. HotSpot has at times shown different behavior. My tests using HotSpot show that synchronized methods can sometimes be optimized to run faster than
unsynchronized versions. However, by varying the order and type of tests, it becomes clear that HotSpot is very inconsistent in its optimizations. This variation can exist for a number of different
reasons: profiler overheads, aggressive compiler cutting in, deoptimizations occasionally necessary, etc. The variability in this particular test probably comes from speculatively inlining methods and
sometimes having to undo the speculative inline. This can result in tests where a synchronized method apparently gets optimized more effectively than a nonsynchronized method.
The results from running the
ListTesting
class just defined in a HotSpot VM show how difficult it can be to get consistent results from HotSpot . For my test results, I take the first three and next
three results, but I also find that altering the order of the tests can make a big difference to the times:
Vector took 5548 sync ArrayList took 6239
ArrayList took 1472 Vector took 2734
ArrayList took 2103 sync ArrayList took 3385
Vector took 7811 ArrayList took 6469
sync ArrayList took 3696
10.4.2 Avoiding Serialized Execution
One way of completely avoiding the requirement to synchronize methods is to use separate objects and storage structures for different threads. Care must be taken to avoid calling
synchronized
methods from your own methods, or you will lose all your carefully built benefits. For example,
Hashtable
access and update methods are
synchronized
, so using one in your storage structure can eliminate any desired benefit. Prior to JDK 1.2, there is no unsynchronized hash table in the
- 217 - JDK, and you have to build or buy your own unsynchronized version. From JDK 1.2,
unsynchronized collection classes are available, including
Map
classes. As an example of implementing this framework, I look at a simple set of global counters, keyed on
a numeric identifier. Basically, the concept is a global counter to which any thread can add a number. This concept is extended slightly to allow for multiple counters, each counter having a
different key.
String
keys are more useful, but for simplicity I use integer keys in this example. To use
String
keys, an unsynchronized
Map
replaces the arrays. The simple, straightforward version of the class looks like this:
package tuning.threads; public class Counter1
{ For simplicity make just 10 counters
static long[] vec = new long[10]; public static void initializeint key
{ vec[key] = 0;
} And also just make key the index into the array
public static void addAmountint key, long amount {
This is not atomically synchronized since we do an array access together with an update, which are two operations.
vec[key] += amount; }
public static long getAmountint key {
return vec[key]; }
}
This class is basic and easy to understand. Unfortunately, it is not thread-safe, and leads to corrupt counter values when used. A test run on a particular single-CPU configuration with four threads
running simultaneously, each adding the number 1 to the same key 10 million times, gives a final counter value of around 26 million instead of the correct 40 million.
[4]
On the positive side, the test is blazingly fast, taking very little time to complete and get the wrong answer.
[4]
The results discussed are for one particular test run. On other test runs, the final value is different, but it is almost never the correct 40 million value. If I use a faster CPU or a lower total count, the threads can get serialized by the operating system by finishing quickly enough, leading to consistently correct results for
the total count. But those correct results are an artifact of the environment, and are not guaranteed to be produced. Other system loads and environments generate corrupt values.
To get the correct behavior, you need to synchronize the update methods in the class. Here is
Counter2
, which is just
Counter1
with the methods synchronized:
package tuning.threads; public class Counter2
{ For simplicity make just 10 counters
static long[] vec = new long[10]; public static synchronized void initializeint key
{
- 218 -
vec[key] = 0; }
And also make the just make key the index into the array public static synchronized void addAmountint key, long amount
{ Now the method is synchronized, so we will always
complete any particular update vec[key] += amount;
} public static synchronized long getAmountint key
{ return vec[key];
} }
Now you get the correct answer of 40 million for the same test as before. Unfortunately, the test takes 20 times longer to execute see
Table 10-4 . Avoiding the synchronization is going to be more
work. To do this, create a set of counters, one for each thread, and update each threads counter separately.
[5]
When you want to see the global total, you need to sum the counters across the threads. The class definition follows:
[5]
Although
ThreadLocal
variables might seem ideal to ensure the allocation of different counters for different threads, they are of no use here. The underlying implementation for
ThreadLocal
objects uses a synchronized map to allocate per-thread objects, and that defeats the intention to avoid synchronization completely.
package tuning.threads; public class Counter3
{ support up to 10 threads of 10 counters
static long vec[][] = new long[10][]; public static synchronized void initializeCounterTest t
{ For simplicity make just 10 counters per thread
vec[t.num] = new long[10]; }
public static void addAmountint key, long amount {
Use our own threads to make the mapping easier, and to illustrate the technique of customizing threads.
For generic Thread objects, could use an unsynchronized HashMap or other Map,
Or use ThreadLocal if JDK 1.2 is available We use the num instance variable of the CounterTest
object to determine which array we are going to increment. Since each thread is different, here is no conflict.
Each thread updates its own counter. long[] arr = vec[CounterTest Thread.currentThread .num];
arr[key] += amount; }
public static synchronized long getAmountint key {
The current amount must be aggregated across the thread storage arrays. This needs to be synchronized, but
does not matter here as I just call it at the end. long amount = 0;
for int threadnum = vec.length-1; threadnum = 0 ; threadnum-- {
- 219 -
long[] arr = vec[threadnum]; if arr = null
amount += arr[key]; }
return amount; }
}
Using
Counter3
, you get the correct answer for the global counter, and the test is quicker than
Counter2
. The relative timings for a range of VMs are listed in Table 10-4
. Table 10-4, Timings of the Various Counter Tests, Normalized to the JDK 1.2 Counter2
Test
1.2 1.2 no JIT
1.3 HotSpot 1.0
HotSpot 2nd Run 1.1.6
Counter2 100 397
383 191 755
180 Counter3
70 384 175
156 190
95 Counter1
incorrect result 5
116 10
78 17
5
The serialized execution avoidance class is a significant improvement on the synchronized case. The
Counter2
timings can be extremely variable. This variation is generated from the nature of multithreaded context switching, together with the fact that the activity taking much of the time in
this test is lock management. Switching is essentially unpredictable, and the amount of switching and where it occurs affects how often the VM has to release and reacquire locks in different threads.
Nevertheless, across a number of measurements,
Counter3
was always faster than
Counter2
, often several times faster.
The listed times were measured on a single-processor machine. Consider what happens on a multiprocessor machine where the threads can run on different CPU s i.e., where the Java runtime
and operating system support preemptive thread scheduling on separate CPUs.
Counter3
the serialized execution avoidance class is parallelized automatically and scales very nicely. This same
test with
Counter3
, running on a four-CPU machine, tends towards one-quarter of the single-CPU time, assuming that the four CPUs have the same power as the single CPU we tested earlier. On the
other hand, the synchronized version of the counter,
Counter2
, always has serialized execution thats what
synchronized
does. Consequently, it does not scale and generally performs no better than in the single-CPU test except for the advantage of running the OS on another CPU.
10.5 Timing Multithreaded Tests
I measured timings of the three
Counter
classes in the previous section using another class,
CounterTest
. This timing class illustrates some pitfalls you need to avoid when timing multithreaded applications, so Ill go into a little detail about the
CounterTest
definition. The first naive implementation of
CounterTest
is quite simple. Just create a
Thread
subclass with the
run
method running timed tests of the classes you are measuring. You need an extra instance variable for the
Counter3
class, so the class can be defined as:
package tuning.threads; public class CounterTest
extends Thread {
instance variable to specify which thread we are. int num;
- 220 -
public CounterTestint threadnum {
super ; num = threadnum;
} main forks four threads
public static void mainString[] args {
int REPEAT = args.length 0 ? Integer.parseIntargs[0] : 10000000; for int i = 0; i 4; i++
new CounterTesti.start ; }
public void run {
Counter1.initialize0; long time = System.currentTimeMillis ;
for int i = REPEAT; i 0; i-- Counter1.addAmount0, 1;
System.out.printlnCounter1 count: + Counter1.getAmount0 + time: + System.currentTimeMillis -time;
Counter2.initialize0; time = System.currentTimeMillis ;
for int i = REPEAT; i 0; i-- Counter2.addAmount0, 1;
System.out.printlnCounter2 count: + Counter2.getAmount0 + time: + System.currentTimeMillis -time;
Counter3.initializethis; time = System.currentTimeMillis ;
for int i = REPEAT; i 0; i-- Counter3.addAmount0, 1;
System.out.printlnCounter3 count: + Counter3.getAmount0 + time: + System.currentTimeMillis -time;
} }
Unfortunately, this class has two big problems. First, there is no way of knowing that the four threads are running the same test at the same time. With this implementation, it is perfectly possible
that one thread is running the
Counter1
test, while another has already finished that test and is now running the
Counter2
test concurrently . This gives incorrect times for both tests, since the CPU is being used by another test while you measure the first test. And the synchronization costs are not
measured properly, since the intention is to test the synchronization costs of running four threads using the same methods at the same time.
The second problem is with the times you are measuring. The timings are for each thread running its own threaded update to the
Counter
class. But you should be measuring the time from the first update in any thread to the last update in any thread.
One way to avoid the first pitfall is to synchronize the tests so that they are not started until all the threads are ready. Then all threads can be started at the same time. The second pitfall can be
avoided by setting a global time at the start of the first update, then printing the time difference when the last thread finishes.
The full
tuning.threads.CounterTest
implementation with the correct handling for measurements can be found along with all the other classes from this book by clicking on the
Examples link from the books catalog page, http:www.oreilly.comcatalogjavapt
.
- 221 -
10.6 Atomic Access and Assignment