Primary Merge Effectiveness Optimizations

§1.3 Performance 29 Task CM5 time AP1000 time Idle 0.22 0.23 Communicating 0.97 0.99 Merging 0.75 1.24 Serial Sorting 3.17 4.57 Rearranging 0.38 0.59 Total 5.48 7.62 Table 1.1 : Sort times seconds for 8 million integers the CM5 in the clock speeds. This explains most of the performance difference, but not all. The remainder of the difference is due to the fact that sorting a large number of elements is a very memory-intensive operation. A major bottleneck in the sorting procedure is the memory bandwidth of the nodes. When operating on blocks which are much larger than the cache size, this results in a high dependency on how often a cache line must be refilled from memory and how costly the operation is. Thus, the remainder of the difference between the two machines may be explained by the fact that cache lines on the CM5 consist of 32 bytes whereas they consist of 16 bytes on the AP1000. This means a cache line load must occur only half as often on the CM5 as on the AP1000. The results illustrate how important minor architectural differences can be for the performance of complex algorithms. At the same time the vastly different network structures on the two machines are not reflected in significantly different communica- tion times. This suggests that the parallel sorting algorithm presented here can per- form well on a variety of parallel machine architectures with different communication topologies.

1.3.6 Primary Merge Effectiveness

The efficiency of the parallel sorting algorithm relies on the fact that after the primary merge phase most elements are in their correct final positions. This leaves only a small amount of work for the cleanup phase. Figure 1.10 shows the percentage of elements which are in their correct final po- sitions after the primary merge when sorting random 4 byte integers on a 64 and 128 processor machine. Also shown is 100 1 − P √ N which provides a very good ap- proximation to the observed results for large N. §1.3 Performance 30 50 55 60 65 70 75 80 85 90 95 100 100000 1e+06 1e+07 1e+08 percent completion No. of Elements 100 \left 1-P{\sqrt N} \right P=128 P=64 Figure 1.10 : Percentage completion after the primary merge It is probably possible to analytically derive this form for the primary merge per- centage completion but unfortunately I have been unable to do so thus far. Never- theless it is clear from the numerical result that the proportion of unsorted elements remaining after the primary merge becomes very small for large N. This is important as it implies that the parallel sorting algorithm is asymptotically optimal.

1.3.7 Optimizations

Several optimization “tricks” have been used to obtain faster performance. It was found that these optimizations played a surprisingly large role in the speed of the algorithm, producing an overall speed improvement of about 50. The first optimization was to replace the standard C library routine memcpy with a much faster version. At first a faster version written in C was used, but this was eventually replaced by a version written in SPARC assembler. The second optimization was the tuning of the block size of sends performed when elements are exchanged between nodes. This optimization is hidden on the CM5 in the CMMD send and receive routine, but is under the programmer’s control on the AP1000. The value of the B parameter in the block-wise merge routine is important. If it is §1.4 Comparison with other algorithms 31