Thearling and Smith Helman, Bader and JaJa

§1.4 Comparison with other algorithms 31 too small then overheads slow down the program, but if it is too large then too many copies must be performed and the system might run out of memory. The value finally chosen was 4 √ L 1 + L 2 . The method of rearranging blocks in the block-wise merge routine can have a big influence on the performance as a small change in the algorithm can mean that data is far more likely to be in cache when referenced, thus giving a large performance boost. A very tight kernel for the merging routine is important for good performance. With loop unrolling and good use of registers this routine can be improved enor- mously over the obvious simple implementation. It is quite conceivable that further optimizations to the code are possible and would lead to further improvements in performance.

1.4 Comparison with other algorithms

There has been a lot of research into parallel sorting algorithms. Despite this, when I started my work on sorting in 1992 I was unable to find any general purpose efficient parallel sorting algorithms suitable for distributed memory machines. My research attempted to fill in that gap. This section examines a number of parallel sorting algorithms where performance results are available for machines comparable with those that my research used.

1.4.1 Thearling and Smith

In [Thearling and Smith 1992] Kurt Thearling and Stephen Smith propose a parallel integer sorting benchmark and give results for an implementation of an algorithm on the CM5. Their algorithm is based on a radix sort, and their paper concentrates on the effects of the distribution of the keys within the key space using an entropy based measure. The use of a radix based algorithm means that the algorithm is not comparison- based and thus is not general purpose. Additionally, the sending phase of the algo- rithm consumes temporary memory of ON which reduces the number of elements that can be sorted on a machine of a given size by a factor of 2. The results given in their paper are interesting nonetheless. The time given for §1.4 Comparison with other algorithms 32 sorting 64 million uniformly distributed 32 bit elements on 64 CM5 processors was 17 seconds. To provide a comparison with their results I used an in-place forward radix sort based on the American flag sort[McIlroy et al. 1993] for the local sorting phase of our algorithm and ran a test of sorting 64 million 32 bit elements on 64 cells of the AP1000. I then scaled the timing results using the ratio of sorting speed for the AP1000 to the CM5 as observed in earlier results 9 . This gave a comparative sorting time of 17.5 seconds. By using a more standard and memory consuming radix sort a time of 15.8 seconds was obtained. The interesting part of this result is how different the two algorithms are. Their algorithm uses special prefix operations of the CM5 to compute the radix histograms in parallel then uses a single all to all communication stage to send the data directly to the destination cell. About 85 of the total sort time is spent in the communication stage. Despite the totally different algorithm the results are quite similar.

1.4.2 Helman, Bader and JaJa

In [Helman et al. 1996] David Helman, David Bader and Joseph JaJa detail a sample sort based parallel sorting algorithm and give results for a 32 node CM5, directly comparing their results with the results for our algorithm as presented in [Tridgell and Brent 1995; Tridgell and Brent 1993]. They also compare their results with a wide range of other parallel sorting algorithms and show results indicating that their algorithm is the fastest in most cases. Like the Thearling algorithm Helman’s algorithm uses a radix sort, although in this case it is applied independently to the elements in each node. This still leaves them with an ON memory overhead and the problem that the algorithm is not gen- eral purpose because a comparison function cannot be used in a radix sort. The table they show comparing their algorithm with ours which they call the TB algorithm shows results for sorting 8 million uniformly distributed integers on a 32 node CM5. They give the results as 4.57 seconds for their algorithm and 5.48 seconds for ours. 9 The CM5 at ANU had been decommissioned by the time the Thearling paper came out, so I couldn’t test directly on the same hardware. §1.5 Conclusions 33