MERGING O N THE EREW MODEL

3.4 MERGING O N THE EREW MODEL

As we saw in the previous section, concurrent-read operations are performed at several places of procedure CREW MERGE. We now show how this procedure can

be adapted to run on an N-processor EREW SM SIMD computer that, by definition, disallows any attempt by more than one processor to read from a memory location. The idea of the adaptation is quite simple: All we have to d o is find a way to simulate

multiple-read operations. Once such a simulation is found, it can be used by the parallel merge algorithm (and in general by any algorithm with multiple-read operations) to perform every read operation from the EREW memory.

course, we require the simulation to be efficient. Simply queuing all the requests to read from a given memory location and serving them one after the other is surely inadequate: It can increase the running time by a factor of N in the worst case. On the other hand,

using procedure BROADCAST of chapter 2 is inappropriate: A multiple-read operation from a memory location may not necessarily involve all processors. Typically, several arbitrary subsets of the set of processors attempt to

access to different locations, one location per subset. In chapter 1 we described a method for performing the simulation in this general case. This is now presented more formally as procedure MULTIPLE BROADCAST in what follows.

Assume that an algorithm designed to run on a CREW SM SIMD computer requires a total of M locations of shared memory. In order to simulate this algorithm on the EREW model with N processors, where N = for q

1, we increase the size of the memory from M to

- 1). Thus, each of the M locations is thought of as

the root of a binary tree with N leaves. Such a tree has q + 1 levels

a total of 2N - 1 nodes, as shown in Fig. 3.5 for N =

16. The nodes of the tree represent consecutive locations in memory. Thus if location D is the root, then its left and right children are D + 1 and D + 2, respectively. In general, the left and right children of

D + x are D + + 1 and D + + 2, respectively. Assume that processor wishes at some point to read from some location

in memory. It places its request at location

+ (N - + (i - a leaf of the tree

rooted at This is done by initializing two variables local to

request, is initialized to

1. which stores the current level of the tree reached by

0, and

2. which stores the current node of the tree reached by

request, is

initialized to (N - 1) + (i - 1). Note that need only store the position in the tree relative to

that its request has reached and not the memory location

+ (N - 1) (i - 1).

The simulation consists of two stages: the ascent stage and the descent During

the ascent stage, the processors proceed as follows: At each level a processor occupying a left child is first given priority to advance its request one level up the tree.

3.5 Memory organization for multiple broadcasting.

It does so by marking the parent location with a special marker, say, [i]. It then updates its level and location. In this case, a request at the right child is immobilized for the remainder of the procedure. Otherwise

if there was no processor occupying the left child) a processor occupying the right child can now "claim" the parent location. This continues until at most two processors reach level (log N) - 1. They each in turn read the value stored in the root, and the descent stage commences. The value just read goes down the tree of memory locations until every request to read by a processor has been honored. Procedure MULTIPLE BROADCAST follows.

procedure MULTIPLE BROADCAST

Step 1: for i = 1 to N do in parallel

(1.3) store [i] in location

end for.

Step 2: for v = to (log N) - 2 do

(2.1) for i = 1 to N do in parallel

at a left child advances up its tree}

(2.1.2) if

is odd and

3.4 Merging on the EREW Model

then (i)

(ii) store [i] in location

(2.2) for i = 1 to N do in parallel

at a right child advances up its tree if possible}

if

+ does not already contain a marker for some 1 j N

then (i)

(ii) store [i] in location

(iii)

end if end for end for.

Step 3: for = (log N) - down to do

(3.1) for i = 1 to N do in parallel at a left child reads from its parent and then moves down the tree}

(3.1.2) y (2 x

(3.1.3) if

is odd and

then (i) read the contents of

(ii) write the contents of

+ x in location

(iii)

(iv) if location

+ y contains [i]

(3.2) for i = 1 to N do in parallel at a right child reads from its parent and then moves down the tree}

if

is even and

then (i) read the contents of

(ii) write the contents of

+ in location +

(iii)

(iv) if location

+ y contains [i]

end if end for end for.

Step 1 of the procedure consists of three constant-time operations. Each the ascent

N) time. The overall running time of procedure MULTIPLE BROADCAST is therefore

a n d descent stages in steps 2 and 3, respectively, requires

N).

Merging

Chap. 3

Figure 3.6 Memory contents after step 2 of procedure MULTIPLE BROADCAST.

Example 3.2

Let N = 16 and assume that at a given moment during the execution of a CREW parallel algorithm processors

and

need to read a quantity Q

from a location D in memory. When simulating this multiple-read operation on an EREW computer using MULTIPLE BROADCAST, the processors place their requests at the appropriate leaves of a tree of locations rooted at D during step 1, as shown in Fig.

3.5. Figure 3.6 shows the positions of the various processors and the contents of memory locations at the end of step 2. The contents of the memory locations at the end of step 3

are shown in Fig. 3.7.

Note that:

1. The markers [i] are chosen so that they can be easily distinguished from data

values such as Q.

2. If during a multiple-read step of the CREW algorithm being simulated, a processor

may be chosen arbitrarily among the M memory locations used by the algorithm.

does not wish to read from memory, then

3. When the procedure terminates, the value of is negative and that of is out of bounds. These values are meaningless. This is of no consequence, however, since

are always initialized in step 1. We are now ready to analyze the running time

and

of an adaptation of procedure CREW MERGE for the EREW model. Since every read operation (simple or multiple) is simulated using procedure MULTIPLE BROADCAST in

N) time, the adapted procedure is at most

N) times slower than procedure CREW

3.5 A Better Algorithm for the EREW Model

Figure 3.7 Memory contents at end of procedure MULTIPLE BROADCAST.

MERGE, that is,

N) x

+ log n)

The algorithm has a cost of

= O(n log n + N log 2 n)

which is not optimal. Furthermore, since procedure CREW MERGE uses locations of shared memory, the storage requirements of its adaptation for the EREW model are

In the following section an algorithm for merging on the EREW model is described that is cost optimal and uses only

shared-memory locations.