MATRIX MULTIPLICATION

14.6 MATRIX MULTIPLICATION

It is required to compute the product of two n x n matrices of b-bit integers. We begin by showing how the networks of the previous sections can be used for the

computation of the inner product of two vectors of integers. A matrix multiplier is then viewed as a collection of networks for inner-product computation.

14.6 Matrix Multiplication

PERFECT

PERFECT SHUFFLE

Figure 14.13 Computing prefix sums on network with constant fan-out.

be two vectors of b-bit integers whose inner product, that is,

Let =

u,, .. . ,

and v =

v,, ... ,

is to be computed. The n products

for i = 0,

.. , n - 1, can be computed in

parallel using n multiplication trees. This requires time and processors. These n products are now fed into an addition tree with

leaves to obtain the final sum. This second stage runs in

n) + time on processors. Consequently,

the inner product requires

processors. The inner- product network is illustrated in Fig. 14.14, where the small triangles represent multiplication trees and the large triangle an addition tree.

n) + time and

The product of two n x n matrices consists of n 2 inner vector products (each row of the first matrix is multiplied by each column of the second). Suppose that we have a multiplier for vectors that multiplies two vectors in time units using p processors.

Then n 2 copies of this multiplier can

be used to multiply two n x n matrices in q time

2, will do the job in time

units using n 2 p processors. In general, n a copies, where

and use n a p processors.

Our vector multiplier of Fig. 14.14 has

376 The Bit Complexity of Parallel Computations Chap.

Figure 14.14 Inner-product network.

Thus

copies of this multiplier will compute the matrix product in time

Given a randomly ordered sequence A = {a,, a,, ... , a,) of n b-bit integers and an integer k, where 1 k n, it is required to determine the kth smallest element of A. In chapter 2 we called this the selection problem and presented a parallel algorithm for its solution that runs on the EREW SM SIMD model, namely, procedure PARALLEL

SELECT. Assuming that each integer fits in a word of fixed size b, the procedure uses x

1, and runs in O(n ) time, when counting operations on words. When bit operations are counted, the procedure requires O(bn x ) time for a

processors, where

cost of This cost is optimal in view of the operations required to simply read the input. We now describe an algorithm for the selection problem with the following properties:

1. The algorithm operates on b-bit integers where b is a variable, and the bits of each word arrive one every time unit.

2. It runs on a tree-connected parallel computer, which is significantly weaker than the SM SIMD model.

3. It matches the performance of procedure PARALLEL SELECT while being conceptually much simpler.

We begin by describing a simple version of the algorithm whose cost is not optimal. It is based on the following observation. If a set M consisting of the m largest members of

14.7 Selection 377

A can be found, then either (i) the kth smallest is included in M, in which case we discard from further

consideration those elements of A that are not in M, thus reducing the length of the sequence by n - m, or

(ii) the kth smallest is not in M, in which case the m elements of M are removed from

A. In order to determine M, we look at the most significant bit of the elements of A. If the

binary representation of element of A, where 1 i n, is

then is in M if - 1) = 1; otherwise is not in M when - 1) =

If this process is repeated, by considering successive bits and rejecting a portion of the original sequence each time, the kth smallest will be left. (Of course more than one

integer may be left if all the elements of A are not distinct.) For ease of presentation, we assume that n, the size of the input sequence, is a power of

2. The algorithm runs on a tree-connected network of simple processors with n leaves

.. . , P,. Leaf processor can

(i) receive the bits of serially, most significant bit first, from some input medium; (ii) send the bits of

to its parent serially;

(iii) send its own index i to its parent, if requested; and (iv) switch itself "of f' if told to do so.

Initially, all leaf processors are "on." Once a leaf has been switched off, it is excluded from the remainder of the algorithm's execution: It stops reading input and no longer

sends or receives messages to and from its parent.

Each of the n - 2 intermediate processors can (i) relay messages of fixed size from its two children to its parent and vice versa;

(ii) behave as an SA-box; and (iii) compare two

n)-bit values.

Finally, the root processor can (i) send and receive messages of fixed size to and from its two children;

(ii) compare two

n)-bit values;

(iii) behave as an SA-box; and (iv) store and update three

n)-bit values.

The algorithm is given in what follows as procedure TREE SELECTION. When the procedure terminates, the index of the kth smallest element of A is

378 The Bit Complexity of Parallel Computations Chap. 14

contained in the root. If several elements of A qualify for being the kth smallest, the one with the smallest index is selected.

procedure TREE SELECTION (A, k) Step 1: {Initialization)

(1.1) The root processor reads and k (1.2) 1

n is the length of the sequence remaining) (1.3) q k {the

smallest element is to be selected)

(1.4) finished

false.

Step 2: while not finished do (2.1) for i = 1 to do in parallel reads the next bit of

end for

(2.2) The sum of the n bits just read is computed by the intermediate and root processors acting as an addition tree (2.3) if

then

not in M}

(i) (ii) the intermediate processors relay to all leaves the root's message: if latest bit read was 1 then switch

end if then

element found) (i) the intermediate processors relay to all leaves the root's message:

if latest bit read was 1 then send index to root

end if

(ii) the intermediate processors relay to the root the index of the leaf

containing the

smallest element

(iii) finished

(ii) (iii) the intermediate processors relay to all leaves the root's message:

if latest bit read was then switch "off

end if

end if end if

(2.4) if = then (i) the intermediate processors relay to all leaves the root's message: if still "on" then send index to root

end if

(ii) the intermediate processors relay to the root the index of

the only remaining integer (iii) finished

true

end if

(2.5) if (there are no more input bits) and (not finished) then (i) the intermediate processors relay to all leaves the root's message:

if still "on" then send index to root

end if

(ii) the intermediate processors to the root the index of the

smallest-numbered leaf that is still "on" (iii) finished true

end if end while.

Note that none of the processors (root, intermediate, or leaf) is required at any stage of the algorithm's execution to store all b bits of an input integer. Therefore, the network's storage requirements are independent of b.

Example 14.1

Assume that we want to find the fourth smallest value in 15, 12, 3, 7, 6, 13). Initially, = 8 and q =

4. During the first iteration of step 2, the most significant bit of each input integer is read by one leaf, as shown in Fig.

The sum of these bits,

4, is computed at the root. Since -q- = leaf processors P,, P,, and are switched off, and = 4. During the second iteration, the second most significant bits are read by the processors that are still on. This is shown in Fig.

where the processors that were

switched off are marked with an x . Since =

q - = - 2, and processors and are switched off. Now = 2 and q = 2.

and P,, is =

In'the third iteration, the sum of the third most significant bits, read by

2. Since -q- = -2 and both input bits were 1, no processor is switched Again, = 2 and q = 2. In the fourth (and last) iteration,

1 and -q- = - 1: The index of processor is sent to the root, signifying that the fourth smallest value in the input sequence is 7.

Step 1 takes constant time. There are at most b iterations of step 2. During each iteration the sum of n bits read by the leaves can be obtained by the root

Analysis.

in n) time by letting the n - 2 intermediate nodes and root simulate an addition tree with n one-bit numbers as input. Unlike the root of the addition tree, however, the root processor here retains the log n bits of the sum. Thus the time required is O(b log n). Since the number of processors is 2n - 1, the algorithm's cost is

log n), which is not optimal. An algorithm with optimal cost can be obtained as follows. Let N be a power of

2 such that N log n n, and assume that 2N - 1 processors are available to select the

Figure 14.15 Selecting fourth smallest in sequence of eight numbers.

14.8 Sorting 381

kth smallest element. These processors are arranged in a tree with N leaves. The leaf processors are required to be more powerful than the ones used by procedure TREE

SELECTION: They should be able to compute the sum of bits. Each leaf processor is "in charge" of

elements of the sequence A. These integers arrive on

input media that the leaf examines sequentially. The parallel algorithm

consists of b iterations. For j = b - 1, b - 2, . .. ,0, iteration j consists of three stages.

(i) Every leaf processor finds the sum of the jth bits of (at most) integers. (ii) These sums are added by the remaining processors, and the root indicates which

elements must be discarded. (iii) Every leaf processor "marks" the discarded inputs.

operations. There are n) operations involved in stage (ii) to go up and down the tree. The time per iteration is

Stages (i) and

require

for a total running time of

Since = 2N - 1, we have

and this is optimal.