MATRIX MULTIPLICATION
14.6 MATRIX MULTIPLICATION
It is required to compute the product of two n x n matrices of b-bit integers. We begin by showing how the networks of the previous sections can be used for the
computation of the inner product of two vectors of integers. A matrix multiplier is then viewed as a collection of networks for inner-product computation.
14.6 Matrix Multiplication
PERFECT
PERFECT SHUFFLE
Figure 14.13 Computing prefix sums on network with constant fan-out.
be two vectors of b-bit integers whose inner product, that is,
Let =
u,, .. . ,
and v =
v,, ... ,
is to be computed. The n products
for i = 0,
.. , n - 1, can be computed in
parallel using n multiplication trees. This requires time and processors. These n products are now fed into an addition tree with
leaves to obtain the final sum. This second stage runs in
n) + time on processors. Consequently,
the inner product requires
processors. The inner- product network is illustrated in Fig. 14.14, where the small triangles represent multiplication trees and the large triangle an addition tree.
n) + time and
The product of two n x n matrices consists of n 2 inner vector products (each row of the first matrix is multiplied by each column of the second). Suppose that we have a multiplier for vectors that multiplies two vectors in time units using p processors.
Then n 2 copies of this multiplier can
be used to multiply two n x n matrices in q time
2, will do the job in time
units using n 2 p processors. In general, n a copies, where
and use n a p processors.
Our vector multiplier of Fig. 14.14 has
376 The Bit Complexity of Parallel Computations Chap.
Figure 14.14 Inner-product network.
Thus
copies of this multiplier will compute the matrix product in time
Given a randomly ordered sequence A = {a,, a,, ... , a,) of n b-bit integers and an integer k, where 1 k n, it is required to determine the kth smallest element of A. In chapter 2 we called this the selection problem and presented a parallel algorithm for its solution that runs on the EREW SM SIMD model, namely, procedure PARALLEL
SELECT. Assuming that each integer fits in a word of fixed size b, the procedure uses x
1, and runs in O(n ) time, when counting operations on words. When bit operations are counted, the procedure requires O(bn x ) time for a
processors, where
cost of This cost is optimal in view of the operations required to simply read the input. We now describe an algorithm for the selection problem with the following properties:
1. The algorithm operates on b-bit integers where b is a variable, and the bits of each word arrive one every time unit.
2. It runs on a tree-connected parallel computer, which is significantly weaker than the SM SIMD model.
3. It matches the performance of procedure PARALLEL SELECT while being conceptually much simpler.
We begin by describing a simple version of the algorithm whose cost is not optimal. It is based on the following observation. If a set M consisting of the m largest members of
14.7 Selection 377
A can be found, then either (i) the kth smallest is included in M, in which case we discard from further
consideration those elements of A that are not in M, thus reducing the length of the sequence by n - m, or
(ii) the kth smallest is not in M, in which case the m elements of M are removed from
A. In order to determine M, we look at the most significant bit of the elements of A. If the
binary representation of element of A, where 1 i n, is
then is in M if - 1) = 1; otherwise is not in M when - 1) =
If this process is repeated, by considering successive bits and rejecting a portion of the original sequence each time, the kth smallest will be left. (Of course more than one
integer may be left if all the elements of A are not distinct.) For ease of presentation, we assume that n, the size of the input sequence, is a power of
2. The algorithm runs on a tree-connected network of simple processors with n leaves
.. . , P,. Leaf processor can
(i) receive the bits of serially, most significant bit first, from some input medium; (ii) send the bits of
to its parent serially;
(iii) send its own index i to its parent, if requested; and (iv) switch itself "of f' if told to do so.
Initially, all leaf processors are "on." Once a leaf has been switched off, it is excluded from the remainder of the algorithm's execution: It stops reading input and no longer
sends or receives messages to and from its parent.
Each of the n - 2 intermediate processors can (i) relay messages of fixed size from its two children to its parent and vice versa;
(ii) behave as an SA-box; and (iii) compare two
n)-bit values.
Finally, the root processor can (i) send and receive messages of fixed size to and from its two children;
(ii) compare two
n)-bit values;
(iii) behave as an SA-box; and (iv) store and update three
n)-bit values.
The algorithm is given in what follows as procedure TREE SELECTION. When the procedure terminates, the index of the kth smallest element of A is
378 The Bit Complexity of Parallel Computations Chap. 14
contained in the root. If several elements of A qualify for being the kth smallest, the one with the smallest index is selected.
procedure TREE SELECTION (A, k) Step 1: {Initialization)
(1.1) The root processor reads and k (1.2) 1
n is the length of the sequence remaining) (1.3) q k {the
smallest element is to be selected)
(1.4) finished
false.
Step 2: while not finished do (2.1) for i = 1 to do in parallel reads the next bit of
end for
(2.2) The sum of the n bits just read is computed by the intermediate and root processors acting as an addition tree (2.3) if
then
not in M}
(i) (ii) the intermediate processors relay to all leaves the root's message: if latest bit read was 1 then switch
end if then
element found) (i) the intermediate processors relay to all leaves the root's message:
if latest bit read was 1 then send index to root
end if
(ii) the intermediate processors relay to the root the index of the leaf
containing the
smallest element
(iii) finished
(ii) (iii) the intermediate processors relay to all leaves the root's message:
if latest bit read was then switch "off
end if
end if end if
(2.4) if = then (i) the intermediate processors relay to all leaves the root's message: if still "on" then send index to root
end if
(ii) the intermediate processors relay to the root the index of
the only remaining integer (iii) finished
true
end if
(2.5) if (there are no more input bits) and (not finished) then (i) the intermediate processors relay to all leaves the root's message:
if still "on" then send index to root
end if
(ii) the intermediate processors to the root the index of the
smallest-numbered leaf that is still "on" (iii) finished true
end if end while.
Note that none of the processors (root, intermediate, or leaf) is required at any stage of the algorithm's execution to store all b bits of an input integer. Therefore, the network's storage requirements are independent of b.
Example 14.1
Assume that we want to find the fourth smallest value in 15, 12, 3, 7, 6, 13). Initially, = 8 and q =
4. During the first iteration of step 2, the most significant bit of each input integer is read by one leaf, as shown in Fig.
The sum of these bits,
4, is computed at the root. Since -q- = leaf processors P,, P,, and are switched off, and = 4. During the second iteration, the second most significant bits are read by the processors that are still on. This is shown in Fig.
where the processors that were
switched off are marked with an x . Since =
q - = - 2, and processors and are switched off. Now = 2 and q = 2.
and P,, is =
In'the third iteration, the sum of the third most significant bits, read by
2. Since -q- = -2 and both input bits were 1, no processor is switched Again, = 2 and q = 2. In the fourth (and last) iteration,
1 and -q- = - 1: The index of processor is sent to the root, signifying that the fourth smallest value in the input sequence is 7.
Step 1 takes constant time. There are at most b iterations of step 2. During each iteration the sum of n bits read by the leaves can be obtained by the root
Analysis.
in n) time by letting the n - 2 intermediate nodes and root simulate an addition tree with n one-bit numbers as input. Unlike the root of the addition tree, however, the root processor here retains the log n bits of the sum. Thus the time required is O(b log n). Since the number of processors is 2n - 1, the algorithm's cost is
log n), which is not optimal. An algorithm with optimal cost can be obtained as follows. Let N be a power of
2 such that N log n n, and assume that 2N - 1 processors are available to select the
Figure 14.15 Selecting fourth smallest in sequence of eight numbers.
14.8 Sorting 381
kth smallest element. These processors are arranged in a tree with N leaves. The leaf processors are required to be more powerful than the ones used by procedure TREE
SELECTION: They should be able to compute the sum of bits. Each leaf processor is "in charge" of
elements of the sequence A. These integers arrive on
input media that the leaf examines sequentially. The parallel algorithm
consists of b iterations. For j = b - 1, b - 2, . .. ,0, iteration j consists of three stages.
(i) Every leaf processor finds the sum of the jth bits of (at most) integers. (ii) These sums are added by the remaining processors, and the root indicates which
elements must be discarded. (iii) Every leaf processor "marks" the discarded inputs.
operations. There are n) operations involved in stage (ii) to go up and down the tree. The time per iteration is
Stages (i) and
require
for a total running time of
Since = 2N - 1, we have
and this is optimal.