Prefix Sums on a Tree
13.2.3 Prefix Sums on a Tree
We now describe a parallel algorithm for computing the prefix sums that combines the advantages of those in the previous two sections without their disadvantages. First, the algorithm is designed to run on a (binary) tree of processors operating
synchronously: A tree is not only less specialized than the network in section 13.2.1, but in addition is a simpler interconnection than the perfect unshuffle. Second, the algorithm involves no mask computation and hence requires very simple processors.
Let the inputs
x,, . .., reside in the n leaf processors P,, .. . ,
of
a binary tree, one input to a leaf. When the algorithm terminates, it is required that hold
During the algorithm, the root, intermediate, and leaf processors are required to perform very simple operations. These are described for each processor type.
Root Processor
(1) if an input is received from the left child then send it to the right child
end if.
(2) if an input is received from the right child then discard it
end if. Intermediate Processor
(1) if an input is received from the left and right children then (i) send the sum of the two inputs to the parent (ii) send the left input to the right child
end if.
(2) if an input is received from the parent
then send it to the left and right children
end if.
Leaf Processor
(2) send the value of
to the parent.
(3) if an input is received from the parent
then add it to
end if.
Note that the root and intermediate processors are triggered to action when they receive an input. Similarly, after having assigned to and sent to its parent, a leaf processor is also triggered to action by an input received from its parent. After the rightmost leaf processor has received log n inputs, the values of
are the prefix sums of
s,, ... ,
x,, . . . ,
13.2 Computing Prefix Sums
Example 13.1
The algorithm is illustrated in Fig. 13.3 for the input sequence X =
Analysis. The number of steps required by the algorithm is the distance between the
n). Since = 2n - 1,
and rightmost leaves, which is 2 log n. Thus
= O(n log n). This cost is not optimal. It is not difficult, however, to obtain a cost-optimal algorithm by increasing the capabilities of the leaf processors. Let a processor tree with N leaves
be available, where n N. We assume for simplicity that n is a multiple of N , although the algorithm can easily be adapted to work for all values of n. Given the input sequence
P,, ...,
leaf processor
initially contains the elements
The root and intermediate processors behave exactly as before, whereas the leaves now execute the steps given in the next procedure. In what
follows, denotes the number of 1 bits in the binary representation of i, that is,
and m =
Leaf Processor
(1) Compute all prefix sums of
store the results in
to the parent processor. (2) Set a temporary sum
and send
to zero.
(3) if an input is received from the parent then add it to
end if.
(4) if
is the sum of exactly inputs received from the parent then add to each of
end if.
In order to understand the termination condition in
4, note that is precisely the number of roots of
that will send input to Analysis. The number of data that are required by the algorithm to travel up
to the left of
and down the tree is independent of the number of elements stored in each leaf processor. It follows that the running time of the algorithm is the sum of
1. the time required by leaf
to compute
, and then send
to its parent time] since all leaves execute this step simultaneously;
2. the time required by the
to receive its final input N) time]; and
leaf
3. the time required by the rightmost leaf (the last processor to terminate) to add
to each of the sums it contains time].
Decision and Optimization Chap. 13
Figure 13.3 Computing prefix sums on tree of processors.
Thus
N). Since
= 2N - 1,
= O(n + N log N). It fol-
lows that the algorithm is cost optimal if N log N = For example, N =
n) will suffice to achieve cost optimality. It should be noted here that the algorithm's cost
is due primarily to the fact that the time taken by computations within the leaves dominates the time required by the processors to communicate among themselves. This was achieved by partitioning the prefix sum problem into disjoint subproblems that require only a
small amount of communication. As a result, the model's limited communication ability (subtrees are connected only through their roots) is overcome.
13.2 Computing Prefix Sums 349