Asynchronous Dynamic Load Balancing of T

Asynchronous Dynamic Load Balancing of Tiles
Tung Nguyen Michelle Mills Strout Larry Carter Jeanne Ferrante
Computer Science and Engineering Department, UC San Diego
9500 Gilman Drive, San Diego CA 92093-0114
t21nguye@ucsd.edu, fmstrout, carter, ferranteg@cs.ucsd.edu

Extended Abstract
Many scienti c computations have work-intensive kernels consisting of nested loops with a regular stencil
of data dependences. For such loop nests, tiling [W96] is a well-known compiler optimization that can help
achieve ecient parallelism. A loop nest of depth K can be represented as a K -dimensional Iteration Space
Graph (ISG) [W96], where each point is an iteration, and edges between the points represent true, value-
ow
dependences between iterations. Tiling is a partitioning of the points of the ISG into regular size and shape
units. Tiles can be allocated to di erent processors to support parallelism. In this paper, we consider the
simple but common case, shown in Figure 1(a), of a 2-dimensional ISG with stencil of three dependences
of distance 1. These point dependences give rise to tile dependences, which require synchronization and
communication between tiles. We address the speci c case where a tile needs data computed in the tile to its
left and the tile below it to execute. Execution of tiles proceeds in a wavefront fashion: each processor will
be at least one row ahead of its neighbor to the right. Rebalancing can prevent much larger imbalances.
Our target architecture is a distributed-address space machine such as an IBM SP2 or network of workstations. Initially, each processor is allocated the same xed number of columns of tiles in block distribution.
Each processor executes rows of tiles in its allocated columns a row at a time. If each processor has the same

capacity and load, then this allocation should be ecient. However, if processor loads can vary over time,
then dynamically adjusting the allocation of columns to achieve a load balance suitable to the individual
loads is desirable. Varying processor loads can arise because of an evolving irregular application or because of
contention. We present the Left-Right Hando Protocol, a simple dynamic load balancing algorithm which
does not require centralized control and is independent of network con guration. Using our algorithm, a
given processor communicates only with its left and its right neighbor in successive steps to determine if some
columns of tiles should be reallocated, as illustrated in Figure 1(a). When the amount of rebalancing is below
a certain threshhold, there is no idle time introduced, that is, the faster processors do not need to wait at
barriers for the slower processors.
There is an enormous amount of literature on dynamic load balancing [XL97]. Our scheme uses a local
rescheduler; this has advantage over methods using a global scheduler, such as [SS94], which can require
a dedicated processor and global synchronization, and may not scale well to a large number of processors.
Our technique falls in the class of deterministic iterative load-balancing techniques, more speci cally di usion methods, where the work can \di use" over time from processors with high \concentrations" of work
to a less busy neighbor [XL94]. Di usion methods can be further classi ed with respect to communication
synchronization. Optimal solutions for synchronous di usion techniques in the context of non-changing loads
have been determined [Cy89], and in the case of changing loads, the variance can be bounded. However, the
required synchronization may be costly. Asynchronous schemes have the potential to be faster, making them
preferable in many cases. In the asynchronous case, convergence with non-varying loads can be guaranteed
if communication delays are bounded; however, the rate of convergence and the optimal solution are not
theoretically known [XL94]. Experimental results are therefore useful in determining the practicality of asynchronous schemes. In this paper, we present experiments to determine the eciency and time to convergence

of our asynchronous load-balancing scheme on an application that ts our ISG assumptions.
The Left-Right Hando Protocol is given by the nite state machine diagram in Figure 1(b). Both a
processor, L, and the processor to its right, R, use the same diagram, although they may be in di erent states
and carry out di erent actions in each state. Processors are numbered successively starting with 0; initially
the left processors are even-numbered and the right processors are odd, but on the next row, the left processors
are odd and the right processors are even, and so on. When the Left-Right Data Transfer state is reached, L

sends the data needed by R for its next row, and R waits to receive the data. The processors then return to
the Listening state. Each processor has a checkpoint in its current row. Prior to reaching the checkpoint, a
processor can service a request to reschedule. However, if a processor reaches its checkpoint before a request
is received, execution always continues to the last tile of that row. The states of the nite state machine in
Figure 1(b) are partitioned as to whether the processor has neither sent or received a message; has either sent
or received, but not both; and nally, has both sent and received.
Intuitively, processor R starts executing tiles in its current row in Listening mode, checking for either
a checkpoint message from L, or whether its checkpoint is reached. If a message is received, then R knows
that L is ahead in its current row, and moves to Rescheduling state to call its local rescheduler to make a
decision. It then sends a rescheduling message to L (which includes the data below the reallocated tiles), and
proceeds in Finish-Row state to nish the row's remaining work, which may have changed as a result of the
rescheduling. When that work is completed, a Left-Right Data Transfer is performed. On the other hand,
if R reaches its checkpoint before it has received a message from L, it sends L a checkpoint message, and

proceeds in Fast-Forward state to execute its current row. After completing this row, R waits to receive a
message from L, which is either a rescheduling or checkpoint message. In the case of a rescheduling message,
R moves to Rescheduling state and processes the message. In both this case and the case of a checkpoint
message, R then performs a Left-Right Data Transfer before it returns to Listening state.
In the full paper, we experiment with a variety of checkpoint locations and reschedulers. We compare
the execution times of no load balancing, our Left-Right Hando protocol technique, and a version which
performs global synchronization at the end of every row, on both varying and static but unequal loads. We
also measure the time to convergence on static but non-uniform loads.

References

[Cy89] G Cybenko. \ Load balancing for distributed memory multiprocessors", J. of Par. and Distr. Computing, 7, 1989.
[SS94] B. S. Siegell and P. Steenkiste. \ Automatic generation of parallel programs with dynamic load balancing", IEEE Symp.
on High Performance Dist. Computing, Aug. 1994.
[W96] M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, 1996.
[XL94] C.-Z. Xu and F. C.M. Lau. \Iterative dynamic load balancing in multicomputers". J. of Operational Research Society,
45(7):786{796, 1994.
[XL97] C.-Z. Xu and F. C.M. Lau. Load Balancing in Parallel Computers. Kluwer, 1997.
Acknowledgement: We gratefully acknowledge the guidance of Professor Fran Berman in her CSE 260 class.
no send, no receive


Listening

at checkpoint
send checkpt. msg

rec checkpt.
msg

Dependence
Pattern

FastForward

receive, no send
rec
checkpt.
msg

send, no receive

rec resched.
msg
Resched.

Resched.
send
resched. msg
FinishRow

send, receive
Left-Rt.
Data Tran.

P0

P1

P2

(a)


P3

(b)

Figure 1: (a) Two-dimensional Iteration Space with Dynamic Load Balancing; (b) Finite State Machine
diagram for the Left-Right Hando protocol.