Data parallel load balancing strategies (1)

Data-Parallel Load Balancing Strategies

Cyril Fonlupt
Laboratoire d'Informatique du Littoral
Universite du Littoral
France
Philippe Marquet
Jean-Luc Dekeyser
Laboratoire d'Informatique Fondamentale de Lille
Universite de Lille
France
December 17, 1996

Abstract

Programming irregular and dynamic data-parallel algorithms requires to take
data distribution into account. The implementation of a load balancing algorithm is a quite
dicult task for the programmer. However, a load balancing strategy may be developed
independently of the application. The integration of such a strategy in the data-parallel
algorithm may be relevant to a library or a data-parallel compiler run-time. We propose load
distribution data-parallel algorithms for a class of irregular data-parallel algorithms called

stack algorithms. Our algorithms allow the use of regular and/or irregular communication
patterns to exchange the works between processors. The results of theoretical analysis of these
algorithms are presented. They allow a comparison of the di erent load balancing algorithms
and the identi cation of criterion for the choice of a load balancing algorithm.

Keywords

Load Balancing; Data Distribution; Data-Parallelism; SPMD.

1

Resume La programmation d'algorithmes data-paralleles irreguliers et dynamiques neces-

site la prise en compte de la distribution des donnees. L'implementation d'un algorithme
d'equilibrage de la charge est une t^ache ardue pour le programmeur. Cependant, une strategie
d'equilibrage de la charge peu ^etre developpee independamment de l'application. L'integration
d'une telle strategie au sein de l'algorithme data-parallele peut relever d'une bibliotheque ou
de l'executif d'un compilateur data-parallele. Nous proposons des algorithmes data-paralleles
de distribution de la charge pour toute une classe d'algorithmes data-paralleles irreguliers dit
algorithmes a pile. Nos algorithmes utilisent des communications regulieres et/ou irregulieres

pour realiser les echanges de travail entre les processeurs. Les resultats d'une analyse theorique
de nos algorithmes sont presentes. Ils permettent une comparaison des di erents algorithmes
et l'etablissement de criteres de choix d'un algorithme d'equilibrage.

Mots-cles E quilibrage de charge; Distribution des donnees; Parallelisme de donnees;

SMPD.

2

1 Introduction
Two parallel models coexist: data-parallelism and task parallelism. The good qualities of
the data-parallel model are recognized [2]. Nevertheless, the simplicity and the rigidity of
this model are sometimes reproached it. In particular, applications handling irregular data
structures do not seem to be able to bene t from a data-parallel programming. Our de nition
of data-parallelism goes against this idea.
Data-parallelism is the expression of single
ow algorithms. These algorithms consist
in a sequence of elementary instructions applied to scalar or parallel data. Most dataparallel languages confuse parallel data structures with regular data structures like arrays;
Fortran 90 [3, 19] and High Performance Fortran (HPF [11]) are examples of such languages.

However new propositions of languages attempt to widen this notion to irregular structures [7]
and to take into account algorithms such as molecular dynamic simulations or computational
uid dynamic solvers. It now clearly turns out that data-parallelism does not con ne itself
to array parallelism.
A data-parallel structure de nes a virtual machine that consists of:
 elemental virtual processors. Each processor owns an instance of each parallel variables.
Data-parallel operations are locally applied on each processors;
 communication links between processors. A communication link represents a dependence between the elemental data owned by two di erent processors. These links allow
the exchange of values between processors.
We study the data-parallel implementation of a class of irregular algorithms, called \stack
algorithms". A stack algorithm simulates the temporal evolution of a set of independents
objects constituent the stack. Every object of the stack is handled with the same elemental
algorithm. The treatment of an object modi es its characteristics, it may lead to the creation
of a son object or to the destruction of the object. The irregularity of these algorithms come
from this dynamic variation of the stack during the execution time.
The tracking of particles in a detector such as Monte Carlo simulation methods is a
prototype of stack algorithms [28]. Particles are followed step by step in the detector. At
every step, the particle might collapse and generate secondary particles or disappear. In the
case of new particles, they are stacked and their execution is delayed. The tracking algorithm
can easily be parallelized [14].

Tree search algorithms are another example of stack algorithms. Tree search is a central
problem for solving di erent problems in Arti cial Intelligence or Operations Research [16, 26].
Powley et al. [25, 24] have parallelized the tree search using stack algorithms. A node root
expands child nodes which themselves expand other nodes. Successive nodes are generated
in parallel and are stacked in a distributed way.
We implement a stack algorithm in distributing the stack of objects on the P processors
of a virtual SPMD machine. Local stacks of each processor evolve independently one of
each other. Object creations make the local stack grow; object deletions make the local
stack decrease. By de nition of the stack algorithms, there is no dependences between the
processors.
Nevertheless this rst parallelization leads to an unbalanced distribution of the load on the
processors. The dynamic nature of our algorithms prohibits the use of static load distribution
schemes which are known to be NP-complete [9].
3

We propose here dynamic data-parallel algorithms to insure the load balance of the di erent local stacks on the processors. The implementation of a load balancing algorithm within
a stack algorithm is ensured by a triggering policy. The importance of the triggering mechanism is essential [15]. The triggering mechanisms for our load balancing algorithms have been
studied in [8].

Content of the Paper


We introduce our data-parallel model of computation in the the next section. We then
present two families of load balancing algorithms: irregular and regular algorithms. Irregular algorithms trigger load transfers between whatever couple of processors while regular
algorithms restrict load transfers with the processors of the neighborhood. A comparison of
our algorithms based on a theoretical analysis of their behavior allows the identi cation of
criterion for the choice of a load balancing algorithm. The extension of our algorithms to
other data-parallel applications and their integration in data-parallel language compilers are
nally examined.

2 Our Parallel Model of Computation
This section presents the parallel machine and the data-parallel language used to describe
our algorithms. Some useful notations are also introduced.

2.1 A Parallel Machine

We are working on a parallel virtual machine of P processors supporting the SPMD execution
model. Each of the P processors is numbered by an unique index between 0 and P ? 1.
Communication links connect processors. Two processors can communicate through two
virtual networks:
1. A virtual irregular communication network. It is able to manage any communication

pattern independently of the communication links.
2. A virtual regular communication network described later in this article. It is exclusively
supported by the communication links.

2.2 Processor Workload

In order to estimate the unbalance of the system we need a function to compute the load of
the processors. In our case the load of a processor is the number of data it is owning. The
load function is denoted w.
The total workload of the system W is of course equal to:
W=

The average workload of the system

W
P

X w[i]

P ?1

i=0

is denoted w.

4

2.3 A Data-Parallel Language

In order to describe our algorithms, we use an extension of the data-parallel language proposed
by Hillis and Steele [13]. The instructions are classi ed into two types:
1. General data-parallel instructions.
2. Data-parallel instructions managing the parallel stack. These instructions realize load
transfers between processors.

2.3.1 Variables

Our data-parallel language distinguishes two types of variables. Furthermore, the parallel
stack is used in our algorithms:

A parallel variable is distributed on all processors. It is made of P elementary values. The


instance of the parallel variable x[ ] on the processor k will be denoted x[k].
A scalar variable consists in a single value. It is accessible on each processor.
The parallel stack is an implicit parallel object distributed on all the processors. On a
given processor, the stack contains the local objects. For a given processor, w[k] is the
size of the local stack.

2.3.2 Data-Parallel Instructions

The data-parallel instructions of the language are described in the following paragraphs.

Data-Parallel Section
for all

k in

The instruction

parallel do


c od

triggers a data-parallel activity. It makes all the processors active, thus execute the same
data-parallel code c. This code may include any of the data-parallel instruction described
below. In this code c, k indicates the processor index.
At any time the processor activity can be modi ed by an if then else construction
on a parallel variable. Furthermore, a data-parallel while modi es the activity for the the
execution of the while body. Only those processors that evaluate the controlling expression
to true will be active for the current iteration of the body of the while. The data-parallel
while iterates while at least a processor is active.

Data-Parallel Reductions
s := sum(par[k])

The instruction

is used to compute the sum of a parallel variable par[ ] for all active processors. The result s
is a scalar variable.
The instruction
s := count()

counts the number of active processors. The result will be ranging from 0 to P .

5

[ ] := rank(i[k])
k (index)
0 1 2 3 4
Activity 1 1 0 1 1
i[ ]
4 4 2 2 0
result[ ]
2 3 ? 1 0
Figure 1: The rank instruction
result k

[ ] := a[b[k]]
k (index)
0 1 2 3 4
Activity 1 1 0 1 1
b[ ]

2 1 2 4 1
a[ ]
6 8 0 7 7
result[ ]
0 8 ? 7 8
Figure 2: Indirection in communications
result k

Data-Parallel Enumeration The instruction

[ ] := enumerate()
associates each active processor to an unique integer in the range [0; count()?1]. The enumeration starts with the processor whose index is the lowest.
par k

Data-Parallel Pre x The instruction

par 2[k ] := scanAdd(par 1[k ])
returns for each active processor the cumulative sum of the par1[ ] variable.

Data-Parallel Sort The instruction

par 1[k ] := rank(par 2[k ])
allows a sort of the active processors based on the local value of the
example is displayed in Figure 1.

par

2[ ] variable. An

Data-Parallel Communications The communications are implicitly expressed in our language. For example, the operation x[k] = y [k + 1] will cause every active processor k to fetch
the value of y [ ] of its successor. (The successor is the successor by the index.) A communication may be embedded in another instruction allowing indirect communications (see Figure 2
for an example).
2.3.3 Load Transfers
We introduce two new operations allowing the transfer of work between processors. These
operations implicitly work on the parallel stack, moving objects between the di erent local
stacks.

6

Data-parallel Send

The instruction

send(dest[k], size[k])

sends size[ ] elements from the local stack of any active processors to the processors indexed
by the dest[ ] parallel variable. These elements are removed from the local stack and stored
on the remote stack.
Data-parallel Get

The instruction

receive(from[k], size[k])

is the reversed operation of the send() instruction. The data are cleared from the remote
stack and stored in the local stack. Note that the activity is associated to the receivers.
These transfers are parallelized, the number of moved objects is not necessarily the same
for any two processors.

3 Irregular Load Balancing Redistribution Strategies
The load balancing mechanism sends stack elements from processors to other processors.
For a given processor, when the elements can migrate all over the system independently of
the physical links, the strategy is called irregular. Like Baumgartnet et al. [1] pointed it
out, an irregular communication pattern allows a better redistribution of work. As a rule,
communications cost depends heavily on the complexity of the communication pattern.
In the following paragraphs, we present several irregular redistribution mechanisms and
we give an upper bound of the complexity for these algorithms.

3.1 The Central Algorithm

The Central redistribution algorithm is based on the works of Powley [24] and Hillis [12].
Firstly the average workload of the system is computed and broadcasted to every processor
in the system. Then the processors can be classi ed into three classes:




the idle processors (they have no data to compute);
the overloaded processors (their load is above the average workload w);
the other ones.

The algorithm tries to match each overloaded with an idle peer. The Central policy is
a two thresholds policy (0,w). This means that the ping-pong e ect of the one threshold
policy is avoided [10, 27].
The pseudo-code of the Central algorithm is presented in Figure 3. Figure 4 presents
the system before the redistribution phase and the di erent steps of an execution. It is not
dicult to prove that as a rule the Central scheme is not asymptotically convergent towards
an uniform redistribution of work [8]: Consider a distribution where each processor owns at
least one data; the Central algorithm will not improve the load balance.

7

for all

k

in parallel do

Initializations:

:= sum(w[k])
:= W=P
rcv [k ] := nill
rendez vous[k ] := nill

W

threeshold

Idle processor enumeration:
if

[ ] = 0 then
[ ] := enumerate()
rcv [dest[k ]] := k
w k

dest k

fi

Overloaded processor enumeration:
if

[]

w k > threeshold

then

[ ] := enumerate()
rendez vous[k ] := rcv [f riend[k ]]
f riend k

fi

Redistribution:
if

fi

[ ] = nill then
vous[k ]; w [k ]=2)

rendez vous k

send(rendez

6

od

Figure 3: Pseudo-code of the Central algorithm

8

Initial workload

k
w[k]

0
0

1
7

2
4

3
2

4
1

5
8

6
7

7
12

8
9

Idle processor enumeration
dest[k]
rcv[k]

0
0

9
0
0

rendez vous[k]

10
0

1

2

11
7

10

Overloaded processor enumeration
f riend[k]

9
0

1
9

2
10

3

4

4

4

12

9

5

Redistribution
Final workload
w[k]

3

4

4

2

1

4

3

7

Figure 4: The di erent steps of an execution of the Central algorithm.
Let us focus on the processor # 9. It receives the number 1 in the enumeration of the idle
processors: it sends its index to the processor # 1. The processor # 5 receives the number
1 on the overloaded processor enumeration: it fetches the index of the processor # 9 on the
processor # 1 and yields some of its work to this processor # 9.

9

for all

k

in parallel do

Initialization:

:= sum(w[k])
threeshold := W=P
rcv [k ] := 0
rendez vous[k ] := nill
value[k ] := 0
W

Underloaded processor sort:

if

[ ] < threeshold then
[ ] := w[k] ? threeshold
dest[k ] := rank(value[k ])
rcv [dest[k ]] := k
w k

value k

fi

Overloaded processor sort:

if

[]

w k > threeshold

then

[ ] := w[k] ? threeshold
f riend[k ] := rank(?value[k ])
rendez vous[k ] := rcv [f riend[k ]]

value k

fi

Redistribution:

if
fi

[ ] 6= nill then
send(rendez vous[k]; (w[k] ? w[rendez vous[k]])=2)
rendez vous k

od

Figure 5: Pseudo-code of the Rendez-Vous algorithm

3.2 The Rendez-Vous Algorithm

The Rendez-Vous redistribution algorithm we propose has a more powerful matching scheme
than the Central algorithm. To the opposite of the Central mechanism, the Rendez-Vous
scheme allows a matching between extreme processors: very heavily loaded nodes will be
matched with the most lightly nodes and in the same way, processors just above the average
load will exchange their work with processors just under the average load.
In order to realize this exchange of work, irregular communications occur in the system.
Some of the nodes play the role of mailboxes and allow the other processors to exchange their
addresses. The aim of this technique is to realize a very smooth repartition of the overall
load.
The pseudo-code of the Rendez-Vous algorithm is summed up in gure 5. Note that the
two rank() instructions can be factorized into one single rank() operation. Figure 6 presents
the initial step of the algorithm and the exchange of work among the processors for a given
execution of the Rendez-Vous algorithm.
With an analytical method, we show that the Rendez-Vous scheme converges towards
the asymptotical optimum [6]. Furthermore, on a P processors systems, at most log2 (P )
10

Initial workload

k
w[k]

0
0

1
7

2
4

3
2

4
1

5
8

6
7

7
12

8
9

9
0

10
0

4
1

4
2

11
7

0
5

2
4

3
3

9

10

4

3

3

0

4

3

8

5

3

3
4

6
nill

2
10

4
3

0
0

1
9

5
2

4

5

4

5

6

5

Underloaded processor sort

?value[k]

dest[k]
rcv[k]

4
0
0

2

Overloaded processor sort
value[k]
friend[k]
rendez vous[k]

Redistribution
Final workload
w[k]

6

4

4

4

4

Figure 6: Execution of the Rendez-Vous algorithm.
Let us focus on the processor # 9. It receives the number 1 in the enumeration of the
underloaded processors: it sends its index to the processor # 1. The processor # 8 receives
the number 1 on the overloaded processor enumeration: it fetches the index of the processor
# 9 on the processor # 1. The processors # 9 and # 8 share their works.

11

6

Rendez-Vous iterations are needed to reach a balanced steady state.

3.3 The Random Algorithm

The Random algorithm is based on an simple mechanism. Each time an element is created
on a processor, it is sent on a randomly selected node anywhere in the system. For each
node, the expectation to receive a part of the load is the same regardless of its place in the
system.
This redistribution scheme leads to a permanent balanced state while all local stacks are
only growing up. The destruction of elements on stacks leads to a permanent unbalanced
state which will never be undertaken by the Random algorithm: As a matter of fact, in the
case of deletion of elements in the stack, the Random algorithm is unable to react.
Furthermore, if this scheme seems to have theoretically a good behavior [30], it shows
experimentally a \not-so-good" behavior due to the high level communications it induces. In
fact, if the program tends to generate lots of data, the redistribution wastes more and more
time communicating.

4 Regular Pattern Communication Strategies

For strategies with regular communications, a neighborhood notion is de ned over the network topology. The neighborhood is the set of linked processors for a given virtual topology.
For a multi-grid topology, a regular communication is characterized by a parallel communication at the same distance in the same direction. For example, the instruction
x[k] := y [(k + a) mod P ]
where a is a scalar value, denotes a regular communication on a ring of P processors.
For multi-grid topology, regular communications use the neighborhood network. A regular
communication is as a rule more ecient than an irregular communication. However, the
number of schema of regular communications is less numerous than the total number of
permutations (irregular pattern): a priori, a step of a regular load balancing scheme will
have less improvement on the system. Figure 7 presents an example of the di erence between
regular and irregular communications. For the same topology, only one step is necessary for
an irregular communication pattern to redistribute the workload while two steps are needed
for a regular communication pattern.
In the following paragraphs, we introduce several regular communication algorithms.

4.1 The Tiling Algorithm

The Tiling algorithm divides the system in small and disjoint topology sub-domain of processors called windows. A perfect load balancing is realized in each window using regular
communications. In order to propagate the work over the entire system, the window is shifted
for the next load balancing phase.
We describe here the algorithm for 1D ring of processors. In that case, a window consists
in a group of 2 connected processors. A perfect load balancing is realized in each window with
two regular communications. For a n-dimension grid, the window consists of 2n processors,
2n regular communications insure a perfect load balance in each window, and then a shift is
made in each of the n dimensions of the multi-grid.
12

Initial workload

Regular communication
Step 1

Irregular communication

Regular communication
Step 2
load (w[k])
processor (k)
Figure 7: Di erence between a regular and an irregular communication pattern

First windows
k

0

1

2

3

4

5

6

7

k

0

1

2

3

4

5

6

7

Second windows

Figure 8: The Tiling algorithm windows for a ring of processors

13

for all

k

in parallel do

First windows:

if (k mod 2) = 0 then

[ ] := w[k] + w[k + 1]
[ ] := average[k]=2
if w[k] > average[k] then

average k

average k

First regular communication:
send( + 1 [ ] ?
[ ])
k

;w k

average k

else

Second regular communication:
receive( + 1
[ ] ? [ ])
k

fi

; average k

w k

fi

Second windows:

if (k mod 2) = 1 then

[ ] := w[k] + w[k + 1]
average[k ] := average[k ]=2
if w[k] > average[k] then

average k

First regular communication:
send( + 1 [ ] ?
[ ])
k

;w k

average k

else

Second regular communication:
receive( + 1
[ ] ? [ ])
k

od

fi

; average k

w k

fi

Figure 9: Pseudo-code of the Tiling algorithm on a ring of processors

14

First link
Second link
Addition of a third link
Addition of a fourth link

Figure 10: The X-Tiling algorithm: increasing the neighborhood cardinality
Figure 8 shows the division of the processors in windows on a ring of processors. Windows
of 2 processors are created and the load is evenly balanced inside these \ rst" windows with
2 regular communications. After that, the windows are slightly moved and the process is
iterated. The pseudo-code of the Tiling algorithm can be found in Figure 9.
We prove that the load of each processor converges towards the uniform work distribution.
For each processor, a sequence represents the load the processor is computing. It can be shown
that this sequence is increasing and bounded. This implies that the load of each processor is
convergent towards the average load of the system.

4.2 The X-Tiling Algorithm

Luling et al. [18] note that strategies using neighborhood communications are not very ecient
for systems with high number of processors. Note particularly that when the distribution of
work is partitioned in underloaded areas and overloaded areas it takes quite a long time for
the Tiling algorithm to propagate the work from overloaded processors to underloaded ones.
We propose to increase the cardinality of the neighborhood in order to improve the number
of schema of regular communications. We add to the topology some links to increase the
number of processors in the neighborhood. The links are added so that the topology is to
be included into a hypercube topology (Figure 10). The hypercube is the best trade-o
between the number of links and the number of steps needed to connect any two processors.
The X-Tiling algorithm is similar to the Tiling algorithm. Successively all processors in
the neighborhood are balanced. The extension of the topology allows the regularity of the
communications to be maintained: We keep a very regular communication pattern characterized by communications in the same direction at the same distance.
In order to analyze the asymptotical behavior of the Tiling and X-Tiling algorithms, we
use the Cybenko's method [4, 33, 34]. The load balancing scheme is described by a matrix
called matrix of exchange and the processor load by a vector. The load of each processor at
15

for all

k

in parallel do

Initializations:

:= sum(w[k])
w
 := W=P
r := W mod P

W

[ ] ? w during P ? r iterations:

First transfers: load over w k

?

for i := 1 to P
r do
if w[k] > w
 then
send((k + 1) mod
fi
od

[ ] ? w)

P; w k

[ ] ? (w + 1) during r iterations:

Second transfers: load over w k

for i := 1 to r do
if w[k] > w
 + 1 then
send((k + 1) mod P; w[k]
fi
od

? (  + 1))
w

od

Figure 11: Pseudo-code of the Rake algorithm on a ring of processors
time t +1 is the result of the dot product of the matrix of exchange by the load vector at time
. As a rule the matrix has got some strong properties and the convergence can be proven.
Furthermore, the speed of convergence of the algorithm is directly proportional to the second
largest eigen value of the matrix of exchange. This property allows us to evaluate and to
bound the speed of convergence. We show that by using only regular communications, the
X-Tiling schemes will converge in less than log2 (P ) calls to the load balancing algorithm [6].

t

4.3 The Rake Algorithm

The Tiling and X-Tiling algorithms are regular algorithms using \elementary" exchanges of
loads between processors. We will now consider regular algorithms that used multiple regular
communications to achieve a better load balance of the system.
The aim of the Rake algorithm is to redistribute the work in an evenly way on all the
processors. It is well suited for multi-dimensional grid.
The Rake algorithm uses only regular communications with processors in its neighborhood
set. A number of regular communications are realized to achieve a good redistribution of work.
We describe here the Rake algorithm for 1D ring of processors (Figure 11). In that case, the
neighborhood set of a processor consists in two processors: the right processor and the left
processor.
The total workload of the P processors is
W

= wP
 + r with 0  r < P
16

Initial workload
=5
W = 13
w
=2
r = 3
P

First transfers



w

send(3; 3)

send(1; 2)

send(0; 4)

send(4; 3)

+1

w

Second transfers

send(1; 1)

send(0; 2)

Final workload

+1
w

w

Figure 12: An execution of the Rake algorithm on a ring of processors

17

d
c
b
a

g
f
e

c
b

a

Initial workload

e
d

g
f

Final workload

Figure 13: The implementation of the local stacks as a \FIFO" allows a neighborhood conservation
After one application of the Rake algorithm, P ? r processors own w data and r processors
own w + 1 data.
Firstly the average load (w) is processed. In a rst transfer phase, during P iterations,
each processor gives to its right neighbor the data over the average workload. After this phase,
at least P ? r processors own w loads. In a second transfer phase, during r iterations, each
processor gives to its right neighbor the data over w + 1. This ensures a perfect distribution
of the load. Figure 12 shows the main step of an execution of the Rake algorithm on a ring
of processors.
In the case of a multi-dimensional grid, the previous step is done in each of the dimensions.
On numerous data-parallel algorithms, especially in image processing, the neighborhood
notion has to be kept: A picture can not be cut into several parts and be redistributed
anywhere in the system. For example in skeletonization algorithms on a parallel machine,
some processors become idle while others own all the remaining work. A load balancing
algorithm can greatly improve the performance of the parallel machine. In the case of a work
redistribution the locality has to be preserved. If the local stack on each processor is managed
as a \FIFO" structure, we assert that \neighbor" data remain on the same processor or on a
neighbor processor for the one dimension case. An example of such an execution is displayed
in Figure 13.
We theoretically show [8] that in the case of the Rake algorithm the average load of the
processors converges to the average workload of the system in the one dimension case. In
the case of the multi-grid the Rake algorithm is also convergent but the di erence of load
between any 2 processors in the system is bounded by the number of dimensions.

4.4 The Pre-Computed Sliding Algorithm

The Pre-Computed Sliding is an improvement of the Rake algorithm. Instead of transferring
data over the average workload of the system like the Rake scheme, it computes the minimal
number of data exchanges needed to balance the load of the system. Unlike the Rake algorithm, the Pre-Computed Sliding may send data in the two directions. We describe here the
Pre-Computed Sliding for a linear arrangement of processors.
After one application of the Pre-Computed Sliding algorithm, the r rst processors own
w + 1 loads and the last P ? r processors own w loads.
In a rst step, each processor computes the (possibly negative) number of loads to be
received from its right neighbor (transfer[k]). A rst communication phase realizes all the
18

for all

k in

parallel do

Initializations:

W := sum(w[k])
w := W=P
r := W mod P
sum[k] := scanAdd(w[k])
goal[k] := (k + 1)  w
if k < r then
goal[k] := goal[k] + 1
fi

transfer[k] := goal[k] ? sum[k]
Left transfer phase:

transfer[k] > 0 do
possible[k] := min(transfer[k]; w[k + 1])
if possible[k] > 0 then
receive(k + 1; possible[k])
transfer[k] := transfer[k] ? possible[k]

while

fi

od
Right transfer phase:

transfer[k] < 0 do
possible[k] := min(?transfer[k]; w[k])
if possible[k] > 0 then
send(k + 1; possible[k])
transfert[k] := transfer[k] + possible[k]

while

fi

od
od

Figure 14: Pseudo-code of the Pre-Computed Sliding algorithm

19

Initial workload

P =5
W = 13
w = 2
r=3

goal[k] ? goal[k ? 1]

k

0

goal[k]
transfer[k]

3
3

scanAdd(w[k]) 0

1
0
6
6

2
5
9
4

3
7
11
4

4
13
13
0

5

2

4

{

Left transfers

Iteration 1
possible[k]
receive()

0

receive()

-5 -2 -4
+5 +2 +4

w[k]

transfer[k]

Iteration 2

possible[k]

3

1

2

0

3

1

2

0

{

-1 -2
w[k] +3 -3
+1 +2

{

transfer[k]

0

0

0

Right transfers

No iteration

Figure 15: An execution of the Pre-Computed Sliding algorithm

20

0

{

needed load transfers in the left direction. A second communication phase insures the right
transfers.
The pseudo-code of the Pre-Computed Sliding is presented in Figure 14. Figure 15
presents an example of execution of the Pre-Computed Sliding algorithm.
Like the Rake algorithm, the implementation of the local stacks as a \FIFO" structure
allows the Pre-Computed Sliding algorithm to keep a neighborhood topology for the loads.

4.5 The Neighbor Algorithm

The Neighbor algorithm is an adaptation of several MIMD load balancing mechanisms [32, 29].
The target architecture is cut into elementary domains called islands. An island is made of
a center processor and all processors in its neighborhood. The aim of the Neighbor algorithm
is to make the load of every node in the island equal. The partial overlapping between the
islands allows to propagate the load.
Each center node processes the average load of its island. Overloaded suburban nodes try
to give some of its work to the center node.
In a 2D topology, each node belongs to 4 islands as a suburban node and its island.
With an analytical method, we have shown that the load of each node tends asymptotically
towards the average load of the system. Unfortunately, we were not able to evaluate an upper
bound of the number of iterations needed to reach this state. Nevertheless we have empirically
shown that it is very slow for systems having a large number of processors.
As for the Tiling algorithm, the Neighbor may be extended by adding links between
processors so that the processor topology is to be included in a hypercube.

5 Cost Prediction
When implementing a load balancing scheme, the programmer is in front of the following
dilemma: the tradeo between cost and quality has to be maximized.

The cost is relative to the complexity of one iteration of the algorithm.
The quality is the product of one iteration cost by the number of iterations needed to reach
a steady-state where each virtual processor owns the average load of the system.

Even if experimental results are important, it is still dicult to compare any two strategies [17]. In order to evaluate our strategies on a common basis, we have chosen to make
a mathematical analysis [8]. It allows us to calculate the number of iterations (quality) to
reach a steady state. The mathematical method evaluates the asymptotical behavior of the
algorithms. In order to evaluate the cost vs quality dilemma, we use some parameters (t )
presented below.
The t de ne the cost of some basic parallel instructions regardless of the target architecture. These values are dependent on the topology and the network of the virtual machine.
For example, the cost t of a parallel sort of P  P values on a P  P grid is in O(P ).
x

x

s

ti is the cost of an irregular parallel communication.
tu is the cost of a regular communication.

21

Table 1: Cost and quality for the di erent load balancing schemes for a virtual grid of P  P
processors
Algorithm

Load Balancing Cost
Cost for
for 1 Iteration
convergence
Central
3tr + 3ti
(Convergence Not Guaranteed)
Rendez-Vous
tr + ts + 3ti
2 log2(P )(tr + ts + 3ti )
Tiling
8tu
> 4Ptu
X-Tiling
8tu + 4(log2(P ) ? 2)tu
8tu + 4(log2(P ) ? 2)tu
Random
tipc
(Convergence Not Guaranteed)
Rake
tr + Ptu
tr + 2Ptu
Pre-Computed Sliding
 tr + Ptu
 tr + 2Ptu

tr is the cost of a reduction or a parallel pre x (sum() or scannAdd()).
ts is the cost of a parallel sort, like the rank() instruction.
pc is the probability of creation of a data for a given period of time; it is used in the Random

algorithm.
The table 1 sums up our main results and points out the cost and quality of our algorithms
on a P  P processor grid.
Some points are worth mentioning.
 One iteration of the X-Tiling algorithm leads to a perfect load balance.
 The Tiling and X-Tiling algorithms, which may seem similar, have a highly di erent
quality (O(log2 P ) for X-Tiling and linear for Tiling).
 Even if the Central and Rendez-Vous algorithms look very similar, we have proven that
the rst one is not convergent although the second one does converge in O(log2 P ).
 The Pre-Computed Sliding algorithm is a real improvement over the Rake algorithm;
cost and quality for Pre-Computed Sliding are always weaker than for Rake.
The choice of a good load balancing algorithm is not an easy task. Two questions may
arise. For a given architecture, which is the best redistribution scheme. For a given stack
algorithm, which is the best load balancing algorithm.

A Load Balancing Scheme for a Given Architecture
An architecture is characterized by one (or several) communication networks. The ratio of
regular to irregular communications will be a predominant guide in the choice of an algorithm
(specially for parallel machines that simulate the irregular communications on a regular communication network). Once this ratio is known, an algorithm from the regular or irregular
communication class can be selected.
The ratio of network to CPU performance can also be taken into account to implement a
load balancing scheme. As a matter of fact, this ratio will give some advice for choosing the
time to trigger the redistribution algorithm [8].
22

A Load Balancing Scheme for a Given Stack Algorithm
We empirically de ne the spatial and temporal disorders. They characterize the behavior of
the stack algorithms.

The spatial disorder refers to the distribution of workload throughout the system. Either

most of the load is spatially localized (low spatial disorder), or the load is \more" evenly
distributed in the system (high spatial disorder).
In the case of high spatial disorder, a strategy with a global exchange of data will be
the best choice. Di usion schemes like Tiling will not be very ecient, techniques like
X-Tiling will be more suited to respond to high spatial disorder.
The temporal disorder describes the evolution of load between processors. A system with
a high temporal disorder is characterized by great variations (creations and destructions
of data) in the processors work (ie the standard deviation between any two processors
is high).
A Rendez-Vous-like algorithm may be used if there is a high spatial disorder. A \clever"
matching policy allows a quick decreasing of the spatial disorder. For example, a single
application of the Rendez-Vous algorithm leads to a good distribution of works, even if
it is not optimal. To sum up, in case of high spatial disorder, a load balancing algorithm
with a good cost/quality ratio has to be selected.
Furthermore the cost prediction presented in this section have been empirically veri ed
by some experiments on a MasPar parallel computer [5].

6 Conclusion
We have studied the data-parallel implementation of stack algorithms. We have proposed
data-parallel algorithms to load balance the parallel implementation of the stack algorithms.
We have de ned a parallel model of computation and have proposed a number of algorithms characterized by regular or irregular communication patterns. The cost and quality
of each of our algorithms have been examined. They allow to identify some criteria to choose
a redistribution scheme.
Among our algorithms, Rake and Pre-Computed Sliding algorithms allow a neighborhood
conservation in one dimension. Some people have proposed algorithms allowing neighborhood
conservation in the multi-dimensional case [21, 20]. We are working towards the de nition of
such algorithms where the optimality may be proven [23, 22].
We are implementing our algorithms in a parallel library. This library is aimed at being
used by parallel programmers. A data-parallel language compiler may systematically generate
calls to this load balancing library [31].

References
[1] Katherine M. Baumgartnet, Ralph Kling, and Benjamin Wah. A global load balancing strategy for a distributed system. In IEEE Conference on Distributed Computing
Systems, pages 93{102, Hong-Kong, 1988.
23

[2] Luc Bouge. The data parallel programming model: A semantic perspective. In The Data
Parallel Programming Model, pages 4{26. Lectures Notes in Computer Science, Tutorial
Serie, vol. 1132, 1996.
[3] Walt Brainerd, Charlie Goldberg, and Jeanne Adams. Programmer's Guide to Fortran
90. Springer-Verlag, third edition edition, 1995.
[4] George Cybenko. Dynamic load balancing for distributed memory multiprocessors. Journal of Parallel and Distributed Computing, 7, 1989.
[5] Jean-Luc Dekeyser, Cyril Fonlupt, and Philippe Marquet. A data-parallel view of the
load balancing, experimental results on MasPar MP-1. In Wolfang Gentzsch and Uwe
Harms, editors, High Performance Computing and Networking Conference, volume 797
of Lectures Notes in Computer Science, pages 338{343, Munich, Germany, April 1994.
[6] Jean-Luc Dekeyser, Cyril Fonlupt, and Philippe Marquet. Analysis of synchronous dynamic load balancing algorithms. In Parallel Computing: State-of-the Art Perspective
(ParCo'95), volume 11 of Advances in Parallel Computing, pages 455{462, Gent, Belgium, September 1995. Elsevier Science Publishers.
[7] Jean-Luc Dekeyser and Philippe Marquet. Supporting irregular and dynamic computations in data-parallel languages. In The Data Parallel Programming Model, pages
197{219. Lectures Notes in Computer Science, Tutorial Serie, vol. 1132, 1996.
[8] Cyril Fonlupt. Distribution Dynamique de Donnees sur Machines SIMD. These de
doctorat (PhD Thesis), Laboratoire d'Informatique Fondamentale de Lille, Universite de
Lille 1, December 1994. (In French).
[9] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory
of NP-Completeness. Freeman, 1979.
[10] R. S. Harbus. Dynamic process migration: To migrate or not to migrate. Technical
Report CSRI-42, University of Toronto, Toronto, Canada, July 1986.
[11] High Performance Fortran Forum. High Performance Fortran language speci cation.
Scienti c Programming, 2(1-2):1{170, 1993.
[12] W. Daniel Hillis. The Connection Machine. The MIT Press, Cambridge, MA, 1985.
Traduction francaise, Masson, Paris, 1988.
[13] W. Daniel Hillis and Guy L. Steele Jr. Data parallel algorithms. Communications of the
ACM, 29(12):1170{1183, December 1986.
[14] D. M. Jones and J. M. Goodfellow. Parallelization strategies for molecular simulation
using the Monte Carlo algorithm. Journal of Computational Chemistry, 14(2):127{137,
1993.
[15] George Karypis and Vipin Kumar. Unstructured tree search on SIMD parallel computers: A summary of results. In Supercomputing' 92, pages 453{462, Minneapolis, MN,
November 1992.
24

[16] Vipin Kumar and V. Nageshwara Rao. Parallel depth rst search (part II). Int'l Journal
of Parallel Programming, 16(6), 1987.
[17] Peter Kok Keong Loh, Wen Jing Hsu, Cai Wentong, and Nadarajah Sriskanthan. How
network topology a ects dynamic load balancing. IEEE Parallel and Distributed Technology, pages 25{35, Fall 1996.
[18] R. Luling, B. Monien, and F. Ramme. Load balancing in large networks: A comparative
study. In 3rd IEEE Symposium on Parallel and Distributed Processing, Dallas, 1991.
[19] Mike Metcalf and John Reid. Fortran 90 Explained. Oxford University Press, 1990.
[20] Serge Miguet and Jean-Marc Pierson. Dynamic load balancing in a parallel particle simulation. In High Performance Computing Symposium, pages 420{431, Montreal, Canada,
July 1995.
[21] Serge Miguet and Yves Robert. Elastic load-balancing for image-processing algorithms.
In H. . Zima, editor, First Int'l ACPC Conf, pages 438{451, Salzburg, Austria, 1991.
Springer-Verlag.
[22] David Nicol and David O'Hallaron. Improved algorithms for mapping pipelined and
parallel computations. IEEE Transactions on Computers, 40(3):119{134, March 1994.
[23] David M. Nicol. Rectilinear partitionning of irregular data parallel computations. Journal
of Parallel and Distributed Computing, 23(2):119{134, November 1994.
[24] Curt Powley, Chris Ferguson, and Richard Korf. Depth- rst heuristic search on a SIMD
machine. Arti cial Intelligence, 60, 1993.
[25] Curt Powley, Chris Ferguson, and Richard E. Korf. Parallel tree search on a SIMD
machine. In Third IEEE Symposium on Parallel and Distributed Processing, Dallas, TX,
December 1991.
[26] V. Nageshwara Rao and Vipin Kumar. Parallel depth rst search (part I). Int'l Journal
of Parallel Programming, 16(6), 1987.
[27] A. Ross and B. McMillin. Experimental comparison of bidding and drafting load sharing
protocols. In Proceedings of the 5th Distributed Memory Computing Conference, pages
968{974, April 1990.
[28] R. Y. Rubinstein. Simulation and the Monte-Carlo Method. John Wiley & Sons, NewYork, 1981.
[29] V. A. Saletore. A distributive and adaptive dynamic load balancing scheme for parallel processing of medium-grain tasks. In Proceedings of the 5th Distributed Memory
Conference, pages 995{999, April 1990.
[30] Raghu Subramanian and Isaac D. Scherson. An analysis of di usive load-balancing. In
ACM Symposium on Parallel Algorithms and Architectures (SPAA' 94), pages 220{225,
June 1994.
[31] Gil Utard and Guy Cuvillier. Compilation of data-parallel program for a network of
workstation. Research Report 94-31, LIP, Ens-Lyon, France, 1994.
25

[32] Mark H. Willebeek-LeMair and Anthony P. Reeves. Strategies for dynamic load balancing
on highly parallel computers. IEEE Transactions on Parallel and Distributed Systems,
4(9):979{993, September 1993.
[33] Cheng-Zhong Xu and Francis C. M. Lau. Analysis of the generalized dimension exchange
method for dynamic load balancing. Journal of Parallel and Distributed Computing,
16:385{393, 1992.
[34] Cheng-Zhong Xu and Francis C. M. Lau. The generalized dimension exchange method
for load balancing in k-ary n-cubes and variants. Journal of Parallel and Distributed
Computing, 24(1), January 1995.

26