Prefix Sums on a Tree

13.2.3 Prefix Sums on a Tree

We now describe a parallel algorithm for computing the prefix sums that combines the advantages of those in the previous two sections without their disadvantages. First, the algorithm is designed to run on a (binary) tree of processors operating

synchronously: A tree is not only less specialized than the network in section 13.2.1, but in addition is a simpler interconnection than the perfect unshuffle. Second, the algorithm involves no mask computation and hence requires very simple processors.

Let the inputs

x,, . .., reside in the n leaf processors P,, .. . ,

of

a binary tree, one input to a leaf. When the algorithm terminates, it is required that hold

During the algorithm, the root, intermediate, and leaf processors are required to perform very simple operations. These are described for each processor type.

Root Processor

(1) if an input is received from the left child then send it to the right child

end if.

(2) if an input is received from the right child then discard it

end if. Intermediate Processor

(1) if an input is received from the left and right children then (i) send the sum of the two inputs to the parent (ii) send the left input to the right child

end if.

(2) if an input is received from the parent

then send it to the left and right children

end if.

Leaf Processor

(2) send the value of

to the parent.

(3) if an input is received from the parent

then add it to

end if.

Note that the root and intermediate processors are triggered to action when they receive an input. Similarly, after having assigned to and sent to its parent, a leaf processor is also triggered to action by an input received from its parent. After the rightmost leaf processor has received log n inputs, the values of

are the prefix sums of

s,, ... ,

x,, . . . ,

13.2 Computing Prefix Sums

Example 13.1

The algorithm is illustrated in Fig. 13.3 for the input sequence X =

Analysis. The number of steps required by the algorithm is the distance between the

n). Since = 2n - 1,

and rightmost leaves, which is 2 log n. Thus

= O(n log n). This cost is not optimal. It is not difficult, however, to obtain a cost-optimal algorithm by increasing the capabilities of the leaf processors. Let a processor tree with N leaves

be available, where n N. We assume for simplicity that n is a multiple of N , although the algorithm can easily be adapted to work for all values of n. Given the input sequence

P,, ...,

leaf processor

initially contains the elements

The root and intermediate processors behave exactly as before, whereas the leaves now execute the steps given in the next procedure. In what

follows, denotes the number of 1 bits in the binary representation of i, that is,

and m =

Leaf Processor

(1) Compute all prefix sums of

store the results in

to the parent processor. (2) Set a temporary sum

and send

to zero.

(3) if an input is received from the parent then add it to

end if.

(4) if

is the sum of exactly inputs received from the parent then add to each of

end if.

In order to understand the termination condition in

4, note that is precisely the number of roots of

that will send input to Analysis. The number of data that are required by the algorithm to travel up

to the left of

and down the tree is independent of the number of elements stored in each leaf processor. It follows that the running time of the algorithm is the sum of

1. the time required by leaf

to compute

, and then send

to its parent time] since all leaves execute this step simultaneously;

2. the time required by the

to receive its final input N) time]; and

leaf

3. the time required by the rightmost leaf (the last processor to terminate) to add

to each of the sums it contains time].

Decision and Optimization Chap. 13

Figure 13.3 Computing prefix sums on tree of processors.

Thus

N). Since

= 2N - 1,

= O(n + N log N). It fol-

lows that the algorithm is cost optimal if N log N = For example, N =

n) will suffice to achieve cost optimality. It should be noted here that the algorithm's cost

is due primarily to the fact that the time taken by computations within the leaves dominates the time required by the processors to communicate among themselves. This was achieved by partitioning the prefix sum problem into disjoint subproblems that require only a

small amount of communication. As a result, the model's limited communication ability (subtrees are connected only through their roots) is overcome.

13.2 Computing Prefix Sums 349