§1.2 Algorithm Details 8
1.1.4 Putting it all together
We are now in a position to describe the algorithm as a whole. The steps in the algo- rithm are
• distribute the data over the P processors • sort the data within each processor using the best available serial sorting algo-
rithm for the data • perform logP merge steps along the edges of a hypercube
• find which elements are unfinished this can be done in logNP time • sort these unfinished elements using a convenient algorithm
Note that this algorithm arose naturally out of a simple consideration of a lower bound on the sort time. By developing the algorithm in this fashion we have guaran-
teed that the algorithm is optimal in the average case.
1.2 Algorithm Details
The remainder of this chapter covers the implementation details of the algorithm, showing how it can be implemented with minimal memory overhead. Each stage of
the algorithm is analyzed more carefully resulting in a more accurate estimate of the expected running time of the algorithm.
The algorithm was first presented in [Tridgell and Brent 1993]. The algorithm was developed by Andrew Tridgell and Richard Brent and was implemented by Andrew
Tridgell.
1.2.1 Nomenclature
P is the number of nodes also called cells or processors available on the parallel
machine, and N is the total number of elements to be sorted. N
p
is the number of elements in a particular node p 0
≤ p P. To avoid double subscripts N
p
j
may be written as N
j
where no confusion should arise. Elements within each node of the machine are referred to as E
p ,i
, for 0 ≤ i N
p
and ≤ p P. E
j ,i
may be used instead of E
p
j
,i
if no confusion will arise.
§1.2 Algorithm Details 9
When giving “big O” time bounds the reader should assume that P is fixed so that O
N and ONP are the same. The only operation assumed for elements is binary comparison, written with the
usual comparison symbols. For example, A B means that element A precedes ele- ment B. The elements are considered sorted when they are in non-decreasing order
in each node, and non-decreasing order between nodes. More precisely, this means that E
p ,i
≤ E
p , j
for all relevant i j and p, and that E
p ,i
≤ E
q , j
for 0 ≤ p q P and all
relevant i, j. The speedup offered by a parallel algorithm for sorting N elements is defined as
the ratio of the time to sort N elements with the fastest known serial algorithm on one node of the parallel machine to the time taken by the parallel algorithm on the
parallel machine.
1.2.2 Aims of the Algorithm
The design of the algorithm had several aims: • Speed.
• Good memory utilization. The number of elements that can be sorted should closely approach the physical limits of the machine.
• Flexibility, so that no restrictions are placed on N and P. In particular N should not need to be a multiple of P or a power of two. These are common restrictions
in parallel sorting algorithms [Ajtai et al. 1983; Akl 1985]. In order for the algorithm to be truly general purpose the only operator that will
be assumed is binary comparison. This rules out methods such as radix sort [Blelloch et al. 1991; Thearling and Smith 1992].
It is also assumed that elements are of a fixed size, because of the difficulties of pointer representations between nodes in a MIMD machine.
To obtain good memory utilization when sorting small elements linked lists are avoided. Thus, the lists of elements referred to below are implemented using arrays,
without any storage overhead for pointers.
§1.2 Algorithm Details 10
The algorithm starts with a number of elements N assumed to be distributed over P
processing nodes. No particular distribution of elements is assumed and the only restrictions on the size of N and P are the physical constraints of the machine.
The algorithm presented here is similar in some respects to parallel shellsort [Fox et al. 1988], but contains a number of new features. For example, the memory over-
head of the algorithm is considerably reduced.
1.2.3 Infinity Padding