Profiling Methodology Method Calls

- 26 - This sampling technique can be difficult to get right. It is not enough to simply sample the stack. The profiler must also ensure that it has a coherent stack state, so the call must be synchronized across the stack activities, possibly by temporarily stopping the thread. The profiler also needs to make sure that multiple threads are treated consistently, and that the timing involved in its activities is accounted for without distorting the regular sample time. Also, too short a sample interval causes the program to become extremely slow, while too long an interval results in many method calls being missed and hence misrepresentative profile results being generated. The JDK comes with a minimal profiler, obtained by running a program using the java executable with the -Xrunhprof option -prof before JDK 1.2, -Xprof with HotSpot. The result of running with this option is a file with the profile data in it. The default name of the file is java.hprof.txt java.prof before 1.2. This filename can be specified by using the modified option, - Xrunhprof:file=filename -prof:filename before 1.2. The output using these options is discussed in detail shortly.

2.3.1 Profiling Methodology

When using a method profiler, the most useful technique is to target the top five to ten methods and choose the quickest to fix. The reason for this is that once you make one change, the profile tends to be different the next time, sometimes markedly so. This way, you can get the quickest speedup for a given effort. However, it is also important to consider what you are changing, so you know what your results are. If you select a method that is taking up 10 of the execution time, then if you halve the time that method takes, you have speeded up your application by 5. On the other hand, targeting a method that takes up only 1 of execution time is going to give you a maximum of only 1 speedup to the application, no matter how much effort you put in to speed up that method. Similarly, if you have a method that takes 10 of the time but is called a huge number of times so that each individual method call is quite short, you are less likely to speed up that method. On the other hand, if you can eliminate some significant fraction of the calling methods the methods that call the method that takes 10 of the time, you might gain a good speedup in that way. Lets look at the profile output from a short program that repeatedly converts some numbers to strings and also inserts them into a hash table: package tuning.profile; import java.util.; public class ProfileTest { public static void mainString[] args { Repeat the loop this many times int repeat = 2000; Two arrays of numbers, eight doubles and ten longs double[] ds = {Double.MAX_VALUE, -3.14e-200D, Double.NEGATIVE_INFINITY, 567.89023D, 123e199D, -0.000456D, -1.234D, 1e55D}; long[] ls = {2283911683699007717L, -8007630872066909262L, 4536503365853551745L, 548519563869L, 45L, Long.MAX_VALUE, 1L, -9999L, 7661314123L, 0L}; - 27 - initializations long time; StringBuffer s = new StringBuffer ; Hashtable h = new Hashtable ; System.out.printlnStarting test; time = System.currentTimeMillis ; Repeatedly add all the numbers to a stringbuffer, and also put them into a hash table for int i = repeat; i 0; i-- { s.setLength0; for int j = ds.length-1; j = 0; j-- { s.appendds[j]; h.putnew Doubleds[j], Boolean.TRUE; } for int j = ls.length-1; j = 0; j-- { s.appendls[j]; h.putnew Longls[j], Boolean.FALSE; } } time = System.currentTimeMillis - time; System.out.println The test took + time + milliseconds; } } The relevant output from running this program with the JDK 1.2 method profiling option follows. See Section 2.3.2 for a detailed explanation of the 1.2 profiling option and its output. CPU SAMPLES BEGIN total = 15813 Wed Jan 12 11:26:47 2000 rank self accum count trace method 1 54.79 54.79 8664 204 javalangFloatingDecimal.dtoa 2 11.67 66.46 1846 215 javalangDouble.equals 3 10.18 76.64 1609 214 javalangFloatingDecimal.dtoa 4 3.10 79.74 490 151 javalangFloatingDecimal.dtoa 5 2.90 82.63 458 150 javalangFloatingDecimal.init 6 2.11 84.74 333 213 javalangFloatingDecimal.init 7 1.23 85.97 194 216 javalangDouble.doubleToLongBits 8 0.97 86.94 154 134 sunioCharToByteConverter.convertAny 9 0.94 87.88 148 218 javalangFloatingDecimal.init 10 0.82 88.69 129 198 javalangDouble.toString 11 0.78 89.47 123 200 javalangDouble.hashCode 12 0.70 90.17 110 221 javalangFloatingDecimal.dtoa 13 0.66 90.83 105 155 javalangFloatingDecimal.multPow52 14 0.62 91.45 98 220 javalangDouble.equals 15 0.52 91.97 83 157 javalangFloatingDecimal.big5pow 16 0.46 92.44 73 158 javalangFloatingDecimal.constructPow52 17 0.46 92.89 72 133 javaioOutputStreamWriter.write In this example, I have extracted only the top few lines from the profile summary table. The methods are ranked according to the percentage of time they take. Note that the trace does not identify actual method signatures, only method names. The top three methods take, respectively, 54.79, 11.67, and 10.18 of the time taken to run the full program. [4] The fourth method in the list takes 3.10 of the time, so clearly you need look no further than the top three methods to optimize the program. The methods ranked first, third, and fourth are the same method, possibly called in different ways. Obtaining the traces for these three entries from the relevant section of the profile output trace 204 for the first entry, and traces 215 and 151 for the second and fourth entries, you get: - 28 - [4] The samples that count towards a particular methods execution time are those where the method itself is executing at the time of the sample. If method foo was calling another method when the sample was taken, that other method would be at the top of the stack instead of foo . So you do not need to worry about the distinction between foo s execution time and the time spent executing foo s callees. Only the method at the top of the stack is tallied. TRACE 204: javalangFloatingDecimal.dtoaFloatingDecimal.java:Compiled method javalangFloatingDecimal.initFloatingDecimal.java:Compiled method javalangDouble.toStringDouble.java:Compiled method javalangString.valueOfString.java:Compiled method TRACE 214: javalangFloatingDecimal.dtoaFloatingDecimal.java:Compiled method TRACE 151: javalangFloatingDecimal.dtoaFloatingDecimal.java:Compiled method javalangFloatingDecimal.initFloatingDecimal.java:Compiled method javalangDouble.toStringDouble.java:132 javalangString.valueOfString.java:2065 In fact, both traces 204 and 151 are the same stack, but trace 151 provides line numbers for two of the methods. Trace 214 is a truncated entry, and is probably the same stack as the other two these differences are one of the limitations of the JDK profiler, i.e., that information is sometimes lost. So all three entries refer to the same stack: an inferred call from the StringBuffer to append a double , which calls String.valueOf , which calls Double.toString , which in turn creates a FloatingDecimal object. init is the standard way to write a constructor call; clinit is the standard way to show a class initializer being executed. These are also the actual names for constructors and static initializers in the class file. FloatingDecimal is a class that is private to the java.lang package, which handles most of the logic involved in converting floating- point numbers. FloatingDecimal.dtoa is the method called by the FloatingDecimal constructor that converts the binary floating-point representation of a number into its various parts of digits before the decimal point, after the decimal point, and the exponent. FloatingDecimal stores the digits of the floating-point number as an array of char s when the FloatingDecimal is created; no strings are created until the floating-point number is converted to a string. Since this stack includes a call to a constructor, it is worth checking the object-creation profile to see whether you are generating an excessive number of objects: object creation is expensive, and a method that generates many new objects is often a performance bottleneck. I show the object- creation profile and how to generate it in Section 2.4 . The object-creation profile shows that a large number of extra objects are being created, including a large number of FDBigInt objects that are created by the new FloatingDecimal objects. Clearly, FloatingDecimal.dtoa is the primary method to try to optimize in this case. Almost any improvement in this one method translates directly to a similar improvement in the overall program. However, normally only Sun can modify this method, and even if you want to modify it, it is long and complicated and takes an excessive amount of time to optimize unless you are already familiar with both floating-point binary representation and converting that representation to a string format. Normally when tuning, the first alternative to optimizing FloatingDecimal.dtoa is to examine the other significant bottleneck method, Double.equals , which came second in the summary. Even though this entry takes up only 11.67 compared to over 68 for the FloatingDecimal.dtoa method, it may be an easier optimization target. But note that while a small 10 improvement in the FloatingDecimal.dtoa method translates into a 6 improvement for the program as a whole, the Double.equals method needs to be speeded up to be more than twice as fast to get a similar 6 improvement for the full program. - 29 - The trace corresponding to this second entry in the summary example turns out to be another truncated trace, but the example shows the same method in 14th position, and the trace for that entry identifies the Double.equals call as coming from the Hashtable.put call. Unfortunately for tuning purposes, the Double.equals method itself is already quite fast and cannot be optimized further. When methods cannot be directly optimized, the next best choice is to reduce the number of times they are called or even avoid the methods altogether. In fact, eliminating method calls is actually the better tuning choice, but is often considerably more difficult to achieve and so is not a first- choice tactic for optimization. The object-creation profile and the method profile together point to the FloatingDecimal class as being a huge bottleneck, so avoiding this class is the obvious tuning tactic here. In Chapter 5 , I employ this technique, avoiding the default call through the FloatingDecimal class for the case of converting floating-point numbers to String s, and I obtain an order-of-magnitude improvement. Basically, the strategy is to create a more efficient routine to run the equivalent conversion functionality, and then replacing the calls to the underperforming FloatingDecimal methods with calls to the more efficient optimized methods. The best way to avoid the Double.equals method is to replace the hash table with another implementation that stores double primitive data types directly rather than requiring the double s to be wrapped in a Double object. This allows the == operator to make the comparison in the put method, thus completely avoiding the Double.equals call: this is another standard tuning tactic, where a data structure is replaced with a more appropriate and faster one for the task. The 1.1 profiling output is quite different and much less like a standard profilers output. Running the 1.1 profiler with this program details of this output are given in Section 2.3.4 gives: count callee caller time 21 javalangSystem.gc V javalangFloatingDecimal.dtoaIJIV 760 8 javalangSystem.gc V javalangDouble.equalsLjavalangObject;Z 295 2 javalangDouble.doubleToLongBitsDJ javalangDouble.equalsLjavalangObject;Z 0 I have shown only the top four lines from the output. This output actually identifies both the FloatingDecimal.dtoa and the Double.equals methods as taking the vast majority of the time, and the percentages given by the reported times are listed as around 70 and 25 of the total program time for the two methods, respectively. Since the callee for these methods is listed as System.gc , this also identifies that the methods are significantly involved in memory creation and suggests that the next tuning step might be to analyze the object-creation output for this program.

2.3.2 Java 2 cpu=samples Profile Output