Divide and conquer Almost sorting

§1.1 How fast can it go? 4 it needed a fast, scalable, general-purpose parallel sorting routine. The rest of this chapter details the design, implementation and performance of just such a routine. 1.1 How fast can it go? A question that arises when considering a new algorithm is “How fast can it go?”. It helps greatly when designing an algorithm to understand the limits on the algorithm’s efficiency. In the case of parallel sorting we can perform some simple calculations which are very revealing and which provide a great deal of guidance in the algorithm design. It is well known that comparison-based sorting algorithms on a single CPU require log N time 1 , which is well approximated[Knuth 1981] by N log N. This limitation arises from the fact the the unsorted data can be in one of N possible arrangements and that each individual comparison eliminates at most half of the arrangements from consideration. An ideal parallel sorting algorithm which uses P processors would reduce this time by at most a factor of P, simply because any deterministic parallel algorithm can be simulated by a single processor with a time cost of P. This leads us to the ob- servation that an ideal comparison-based parallel sorting algorithm would take time N P log N . Of course, this totally ignores the constant computational factors, communication costs and the memory requirements of the algorithm. All those factors are very impor- tant in the design of a parallel sorting algorithm and they will be looked at carefully in a later part of this chapter.

1.1.1 Divide and conquer

We now consider parallel sorting algorithms which are divided into two stages. In the first stage each processor sorts the data that happens to start in its local memory and in the second stage the processors exchange elements until the final sorted result 1 Throughput this thesis log x will be used to mean ⌈log 2 x ⌉. §1.1 How fast can it go? 5 is obtained. It seems natural to consider dividing the algorithm in this manner as efficient algorithms for the first stage are well known. How long would we expect each stage to take? The first stage should take O N P log N P simply by using an optimally efficient serial sorting algorithm on each processor 2 . This clearly cannot be improved upon 3 . The more interesting problem is how long the second stage should take. We want the overall parallel sorting algorithm to take O N P log N time which means we would ideally like the second stage to take O N P log P time. If it turns out that this is not achievable then we might have to revisit the decision to split the algorithm into the two stages proposed above.

1.1.2 The parallel merging problem

The second stage of the parallel sorting algorithm that is now beginning to take shape is to merge P lists of NP elements each stored on one of P processors. We would like this to be done in O N P log P time.

1.1.2.1 The two processor problem

Let us first consider the simplest possible parallel merging problem, where we have just two processors and each processor starts with N 2 elements. The result we want is that the first processor ends up with all the small elements and the second processor ends up with all the large elements. Of course, “small” and “large” are defined by reference to a supplied comparison function. We will clearly need some communication between the two processors in order to transmit the elements that will end up in a different processor to the one they start in, and we will need some local movement of elements to ensure that each processor ends up with a locally sorted list of elements 4 . 2 I shall initially assume that the data is evenly distributed between the processors. The problem of balancing will be considered in the detailed description of the algorithm. 3 It turns out that the choice of serial sorting algorithm is in fact very important. Although there are numerous “optimal” serial sorting algorithms their practical performance varies greatly depending on the data being sorted and the architecture of the machine used. 4 As is noted in Section 1.2.12 we don’t strictly need to obtain a sorted list in each cell when this two processor merge is being used as part of a larger parallel merge but it does simplify the discussion. §1.1 How fast can it go? 6 } } } Step 1 Step 3 Step 2 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8 Cell 1 Figure 1.1 : The parallelisation of an eight way hypercube merge Both of these operations will have a linear time cost with respect to the number of elements being dealt with. Section 1.2.8 gives a detailed description of an algorithm that performs these operations efficiently. The basis of the algorithm is to first work out what elements will end up in each processor using a Olog N bisection search and then to transfer those elements in a single block. A block-wise two-way merge is then used to obtain a sorted list within each processor. The trickiest part of this algorithm is minimizing the memory requirements. A simple local merge algorithm typically requires order N additional memory which would restrict the overall parallel sorting algorithm to dealing with data sets of less than half the total memory in the parallel machine. Section 1.2.12 shows how to achieve the same result with O √ N additional memory while retaining a high degree of efficiency. Merging algorithms that require less memory are possible but they are quite computationally expensive[Ellis and Markov 1998; Huang and Langston 1988; Kronrod 1969].

1.1.2.2 Extending to P processors

Can we now produce a P processor parallel merge using a series of two proces- sor merges conducted in parallel? In order to achieve the aim of an overall cost of O N P log P we would need to use Olog P parallel two processor merges spread across the P processors. The simplest arrangement of two processor merges that will achieve this time cost is a hypercube arrangement as shown for eight processors in Figure 1.1. This seems ideal, we have a parallel merge algorithm that completes in log P par- §1.1 How fast can it go? 7 allel steps with each step taking O N P time to complete. There is only one problem, it doesn’t work

1.1.3 Almost sorting

This brings us to a central idea in the development of this algorithm. We have so far developed an algorithm which very naturally arises out of a simple analysis of the lower limit of the sort time. The algorithm is simple to implement, will clearly be very scalable due to its hypercube arrangement and is likely to involve minimal load balancing problems. All these features make it very appealing. The fact that the final result is not actually sorted is an annoyance that must be overcome. Once the algorithm is implemented it is immediately noticeable that although the final result is not sorted, it is “almost” sorted. By this I mean that nearly all elements are in their correct final processors and that most of the elements are in fact in their correct final positions within those processors. This will be looked at in Section 1.3.6 but for now it is good enough to know that, for large N, the proportion of elements that are in their correct final position is well approximated by 1 − P √ N . This is quite extraordinary. It means that we can use this very efficient algorithm to do nearly all the work, leaving only a very small number of elements which are not sorted. Then we just need to find another algorithm to complete the job. This “cleanup” algorithm can be designed to work for quite small data sets relative to the total size of the data being sorted and doesn’t need to be nearly as efficient as the initial hypercube based algorithm. The cleanup algorithm chosen for this algorithm is Batcher’s merge-exchange al- gorithm[Batcher 1968], applied to the processors so that comparison-exchange oper- ations are replaced with merge operations between processors. Batcher’s algorithm is a sorting network[Knuth 1981], which means that the processors do not need to communicate in order to determine the order in which merge operations will be per- formed. The sequence of merge operations is predetermined without regard to the data being sorted. §1.2 Algorithm Details 8

1.1.4 Putting it all together