The two processor problem

§1.1 How fast can it go? 5 is obtained. It seems natural to consider dividing the algorithm in this manner as efficient algorithms for the first stage are well known. How long would we expect each stage to take? The first stage should take O N P log N P simply by using an optimally efficient serial sorting algorithm on each processor 2 . This clearly cannot be improved upon 3 . The more interesting problem is how long the second stage should take. We want the overall parallel sorting algorithm to take O N P log N time which means we would ideally like the second stage to take O N P log P time. If it turns out that this is not achievable then we might have to revisit the decision to split the algorithm into the two stages proposed above.

1.1.2 The parallel merging problem

The second stage of the parallel sorting algorithm that is now beginning to take shape is to merge P lists of NP elements each stored on one of P processors. We would like this to be done in O N P log P time.

1.1.2.1 The two processor problem

Let us first consider the simplest possible parallel merging problem, where we have just two processors and each processor starts with N 2 elements. The result we want is that the first processor ends up with all the small elements and the second processor ends up with all the large elements. Of course, “small” and “large” are defined by reference to a supplied comparison function. We will clearly need some communication between the two processors in order to transmit the elements that will end up in a different processor to the one they start in, and we will need some local movement of elements to ensure that each processor ends up with a locally sorted list of elements 4 . 2 I shall initially assume that the data is evenly distributed between the processors. The problem of balancing will be considered in the detailed description of the algorithm. 3 It turns out that the choice of serial sorting algorithm is in fact very important. Although there are numerous “optimal” serial sorting algorithms their practical performance varies greatly depending on the data being sorted and the architecture of the machine used. 4 As is noted in Section 1.2.12 we don’t strictly need to obtain a sorted list in each cell when this two processor merge is being used as part of a larger parallel merge but it does simplify the discussion. §1.1 How fast can it go? 6 } } } Step 1 Step 3 Step 2 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8 Cell 1 Figure 1.1 : The parallelisation of an eight way hypercube merge Both of these operations will have a linear time cost with respect to the number of elements being dealt with. Section 1.2.8 gives a detailed description of an algorithm that performs these operations efficiently. The basis of the algorithm is to first work out what elements will end up in each processor using a Olog N bisection search and then to transfer those elements in a single block. A block-wise two-way merge is then used to obtain a sorted list within each processor. The trickiest part of this algorithm is minimizing the memory requirements. A simple local merge algorithm typically requires order N additional memory which would restrict the overall parallel sorting algorithm to dealing with data sets of less than half the total memory in the parallel machine. Section 1.2.12 shows how to achieve the same result with O √ N additional memory while retaining a high degree of efficiency. Merging algorithms that require less memory are possible but they are quite computationally expensive[Ellis and Markov 1998; Huang and Langston 1988; Kronrod 1969].

1.1.2.2 Extending to P processors