Consider the following algorithm for sorting the sequence A = {a,, a,, . . . , of b-bit

14.17 Consider the following algorithm for sorting the sequence A = {a,, a,, . . . , of b-bit

integers. Two arrays of entries each are created in memory. These two arrays are called bucket and bucket I. The algorithm consists of b iterations. At the beginning of each iteration, all positions of both buckets contain zeros. During iteration j, each element of A, where

is examined: A 1 is placed in position i of either bucket or bucket 1 depending on whether

is or 1, respectively. The values in bucket followed by those in bucket 1, form a sequence of

of length 2n. The prefix sums ... , of this sequence are now computed. Finally element

and

is placed in position or of (depending on whether bucket or bucket 1 contains a 1 in position i), concluding this iteration. Show how this algorithm can be implemented in parallel and analyze its running time and cost.

14.18 The networks in sections 14.2-14.6 receive their inputs and produce their outputs least By contrast, the networks in sections 14.7 and 14.8 receive their inputs and produce their output's most

This may be a problem if the output of one network (of the first type) is to serve as the input to another network (of the second type), or vice versa. Suggest ways to overcome this difficulty.

14.19 Let us define (i) clock cycle as the time elapsed from the moment one input bit arrives at a network to the moment the following bit arrives and (ii) gate delay as the time taken by a gate to produce its output. Show that, for the networks in this chapter to operate properly, it is important that

clock cycle gate delay.

14.20 Argue that the running time analyses in this chapter are correct provided that the ratio of clock cycle to gate delay is constant.

14.21 Show that the process of computing the majority of fundamental statistical quantities, such as the mean, standard deviation, and moment, can be speeded up using the networks described in this chapter. 14.22 Design a network for computing the greatest common divisor of two b-bit integers.

14.10 REMARKS

As mentioned in the introduction, most models of computation assume that the word size of the input data is fixed and that each data word is available in its entirety when needed; see, for

example, [Aho], [Horowitz], and [Knuth In this section, we briefly review some of the algorithms that were designed to solve the problems addressed in sections 14.2-14.8 based on these two assumptions. When comparing those algorithms to the networks of this chapter, one should keep in mind that the latter d o not make the preceding two assumptions and can therefore be used (if needed) in situations where these assumptions apply (as well as in situations where they d o not).

14.1 0 Bibliographical Remarks 385

The fastest known algorithm for adding two b-bit integers is the carry-look-ahead adder [Kuck]. It runs in

b) time and uses O(b log b) gates with arbitrarily large fan-out. The algorithm's cost is therefore O(b log 2 b). This is to be contrasted with the

cost of the SA-box The sum of n b-bit integers can be computed by a tree of carry-look-ahead adders

[Ullman]. This requires

gates for a cost of log

b)) time and

log 2 b)). By comparison, the tree of SA-boxes described in section 14.3 uses fewer gates, has a lower cost, and is faster for b =

n). Another algorithm superior to the tree of carry-look-ahead adders is described in problem 14.2. Two solutions are given in [Kuck] to the problem of multiplying two b-bit integers. The first one uses carry-look-ahead adders and requires O(log 2 b) time and O(b 2 10g b) gates. The second and more elaborate solution is based on a combination of carry-save and carry-look- ahead adders. It uses O(b 2 ) gates and runs in O(log 2 b) time (when the fan-out of the gates is constant) and

b) time (when the fan-out is equal to b) for costs of O(b 2 10g 2 b) and O(b 2 10g b), respectively. Both of these costs are larger than the O(b 2 ) cost of the multiplication tree and multiplication mesh of section 14.4. If carry-look-ahead adders are used in section 13.2.3 for computing the prefix sums of a sequence of n integers, then the tree algorithm described therein would require

b)) time and

log 2 b)). Assume for concreteness that b =

log b) gates for a cost of

log

n). Then the preceding expressions describing the running time, number of gates, and cost become

n)), respectively. The corresponding expressions for the networks of section 14.5 are

log n)),

log

log n)), and

n), O(n log n), and O(n log 2 n). Procedure CUBE MATRIX MULTIPLICATION of section 7.3.2 uses n 3 processors and runs in

n) time. If the processors are based on the integer multiplier given in [Kuck] and whose gate and time requirements are O(b 2 ) and O(log 2 b), respectively, then the product of two n x n matrices of b-bit integers can be obtained in

n)(log 2 b)) time using O(n 3 b 2 ) gates.

n). The cost of procedure CUBE MATRIX MULTIPLICATION in this case is O(n 3 10g 3 n log 2 10gn). This is larger than the O(n 3 10g 2 n) cost of the network described in section 14.6. Note also that the product of the solution time by the number

This yields a cost of O((n 3 10g n)(b 2 10g 2 b)). Again, let b =

gates used for any sequential matrix multiplication algorithm of the type described, for example, in [Coppersmith] and [Gonnet], can be improved from O(n x b 2 10g 2 b) where

3 (using the integer multiplier in [Kuck]) to O(n x b 2 ) (using the multiplication tree or mesh of section 14.4). Many tree algorithms exist for selecting the kth smallest element of a sequence of n b-bit integers (assuming that all bits are available simultaneously). Some of these are reviewed in [Aggarwal

The best such algorithm uses processors and runs in O(log 2 n) time. Counting bit operations, this running time becomes O(b log 2 n). Unlike (the modified) procedure TREE SELECTION described in section 14.7, this algorithm is not cost optimal. A cost-optimal algorithm for sorting n b-bit integers is described in [Leighton]. It uses processors and runs in O(b log n) time (counting bit operations), for an optimal cost of log n). Using the bit comparators described in section 14.8 and in [Knuth

sorting can

be performed in O(b + log n) time with gates.

The networks in this chapter are mostly from [Akl [Cooper], and Other algorithms concerned with bit operations are described in [Aggarwal

[Akl [Batcher], [Brent], [Kannan], [Luk], [Reeves], [Siegel], and [Yu] for a variety of computational problems.

The Bit Complexity of Parallel Computations Chap. 14