Asynchronous Dynamic Load Balancing of T

Asynchronous Dynamic Load Balancing of Tiles
Tung Nguyen Michelle Mills Strout Larry Carter Jeanne Ferrante
Computer Science and Engineering Department, UC San Diego
9500 Gilman Drive, San Diego CA 92093-0114
[email protected], fmstrout, carter, [email protected]

Extended Abstract
Many scientic computations have work-intensive kernels consisting of nested loops with a regular stencil
of data dependences. For such loop nests, tiling [W96] is a well-known compiler optimization that can help
achieve ecient parallelism. A loop nest of depth K can be represented as a K -dimensional Iteration Space
Graph (ISG) [W96], where each point is an iteration, and edges between the points represent true, value-
ow
dependences between iterations. Tiling is a partitioning of the points of the ISG into regular size and shape
units. Tiles can be allocated to dierent processors to support parallelism. In this paper, we consider the
simple but common case, shown in Figure 1(a), of a 2-dimensional ISG with stencil of three dependences
of distance 1. These point dependences give rise to tile dependences, which require synchronization and
communication between tiles. We address the specic case where a tile needs data computed in the tile to its
left and the tile below it to execute. Execution of tiles proceeds in a wavefront fashion: each processor will
be at least one row ahead of its neighbor to the right. Rebalancing can prevent much larger imbalances.
Our target architecture is a distributed-address space machine such as an IBM SP2 or network of workstations. Initially, each processor is allocated the same xed number of columns of tiles in block distribution.
Each processor executes rows of tiles in its allocated columns a row at a time. If each processor has the same

capacity and load, then this allocation should be ecient. However, if processor loads can vary over time,
then dynamically adjusting the allocation of columns to achieve a load balance suitable to the individual
loads is desirable. Varying processor loads can arise because of an evolving irregular application or because of
contention. We present the Left-Right Hando Protocol, a simple dynamic load balancing algorithm which
does not require centralized control and is independent of network conguration. Using our algorithm, a
given processor communicates only with its left and its right neighbor in successive steps to determine if some
columns of tiles should be reallocated, as illustrated in Figure 1(a). When the amount of rebalancing is below
a certain threshhold, there is no idle time introduced, that is, the faster processors do not need to wait at
barriers for the slower processors.
There is an enormous amount of literature on dynamic load balancing [XL97]. Our scheme uses a local
rescheduler; this has advantage over methods using a global scheduler, such as [SS94], which can require
a dedicated processor and global synchronization, and may not scale well to a large number of processors.
Our technique falls in the class of deterministic iterative load-balancing techniques, more specically diusion methods, where the work can \diuse" over time from processors with high \concentrations" of work
to a less busy neighbor [XL94]. Diusion methods can be further classied with respect to communication
synchronization. Optimal solutions for synchronous diusion techniques in the context of non-changing loads
have been determined [Cy89], and in the case of changing loads, the variance can be bounded. However, the
required synchronization may be costly. Asynchronous schemes have the potential to be faster, making them
preferable in many cases. In the asynchronous case, convergence with non-varying loads can be guaranteed
if communication delays are bounded; however, the rate of convergence and the optimal solution are not
theoretically known [XL94]. Experimental results are therefore useful in determining the practicality of asynchronous schemes. In this paper, we present experiments to determine the eciency and time to convergence

of our asynchronous load-balancing scheme on an application that ts our ISG assumptions.
The Left-Right Hando Protocol is given by the nite state machine diagram in Figure 1(b). Both a
processor, L, and the processor to its right, R, use the same diagram, although they may be in dierent states
and carry out dierent actions in each state. Processors are numbered successively starting with 0; initially
the left processors are even-numbered and the right processors are odd, but on the next row, the left processors
are odd and the right processors are even, and so on. When the Left-Right Data Transfer state is reached, L

sends the data needed by R for its next row, and R waits to receive the data. The processors then return to
the Listening state. Each processor has a checkpoint in its current row. Prior to reaching the checkpoint, a
processor can service a request to reschedule. However, if a processor reaches its checkpoint before a request
is received, execution always continues to the last tile of that row. The states of the nite state machine in
Figure 1(b) are partitioned as to whether the processor has neither sent or received a message; has either sent
or received, but not both; and nally, has both sent and received.
Intuitively, processor R starts executing tiles in its current row in Listening mode, checking for either
a checkpoint message from L, or whether its checkpoint is reached. If a message is received, then R knows
that L is ahead in its current row, and moves to Rescheduling state to call its local rescheduler to make a
decision. It then sends a rescheduling message to L (which includes the data below the reallocated tiles), and
proceeds in Finish-Row state to nish the row's remaining work, which may have changed as a result of the
rescheduling. When that work is completed, a Left-Right Data Transfer is performed. On the other hand,
if R reaches its checkpoint before it has received a message from L, it sends L a checkpoint message, and

proceeds in Fast-Forward state to execute its current row. After completing this row, R waits to receive a
message from L, which is either a rescheduling or checkpoint message. In the case of a rescheduling message,
R moves to Rescheduling state and processes the message. In both this case and the case of a checkpoint
message, R then performs a Left-Right Data Transfer before it returns to Listening state.
In the full paper, we experiment with a variety of checkpoint locations and reschedulers. We compare
the execution times of no load balancing, our Left-Right Hando protocol technique, and a version which
performs global synchronization at the end of every row, on both varying and static but unequal loads. We
also measure the time to convergence on static but non-uniform loads.

References

[Cy89] G Cybenko. \ Load balancing for distributed memory multiprocessors", J. of Par. and Distr. Computing, 7, 1989.
[SS94] B. S. Siegell and P. Steenkiste. \ Automatic generation of parallel programs with dynamic load balancing", IEEE Symp.
on High Performance Dist. Computing, Aug. 1994.
[W96] M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, 1996.
[XL94] C.-Z. Xu and F. C.M. Lau. \Iterative dynamic load balancing in multicomputers". J. of Operational Research Society,
45(7):786{796, 1994.
[XL97] C.-Z. Xu and F. C.M. Lau. Load Balancing in Parallel Computers. Kluwer, 1997.
Acknowledgement: We gratefully acknowledge the guidance of Professor Fran Berman in her CSE 260 class.
no send, no receive

Listening

at checkpoint
send checkpt. msg

rec checkpt.
msg

Dependence
Pattern

FastForward

receive, no send
rec
checkpt.
msg

send, no receive

rec resched.
msg
Resched.

Resched.
send
resched. msg
FinishRow

send, receive
Left-Rt.
Data Tran.

P0

P1

P2

(a)

P3

(b)

Figure 1: (a) Two-dimensional Iteration Space with Dynamic Load Balancing; (b) Finite State Machine
diagram for the Left-Right Hando protocol.

Asynchronous Dynamic Load Balancing of T

Dokumen yang terkait

Analisis Komparasi Internet Financial Local Government Reporting Pada Website Resmi Kabupaten dan Kota di Jawa Timur The Comparison Analysis of Internet Financial Local Government Reporting on Official Website of Regency and City in East Java

Analisis Komposisi Struktur Modal Pada PT Bank Syariah Mandiri (The Analysis of Capital Structure Composition at PT Bank Syariah Mandiri)

An Analysis of illocutionary acts in Sherlock Holmes movie

Improping student's reading comprehension of descriptive text through textual teaching and learning (CTL)

Teaching speaking through the role play (an experiment study at the second grade of MTS al-Sa'adah Pd. Aren)

Enriching students vocabulary by using word cards ( a classroom action research at second grade of marketing program class XI.2 SMK Nusantara, Ciputat South Tangerang

The Effectiveness of Computer-Assisted Language Learning in Teaching Past Tense to the Tenth Grade Students of SMAN 5 Tangerang Selatan

Analysis On Students'Structure Competence In Complex Sentences : A Case Study at 2nd Year class of SMU TRIGUNA

The correlation between listening skill and pronunciation accuracy : a case study in the firt year of smk vocation higt school pupita bangsa ciputat school year 2005-2006

Existentialism of Jack in David Fincher’s Fight Club Film

Dukungan

Links

Asynchronous Dynamic Load Balancing of T

Dokumen yang terkait

Analisis Komparasi Internet Financial Local Government Reporting Pada Website Resmi Kabupaten dan Kota di Jawa Timur The Comparison Analysis of Internet Financial Local Government Reporting on Official Website of Regency and City in East Java

Analisis Komposisi Struktur Modal Pada PT Bank Syariah Mandiri (The Analysis of Capital Structure Composition at PT Bank Syariah Mandiri)

An Analysis of illocutionary acts in Sherlock Holmes movie

Improping student's reading comprehension of descriptive text through textual teaching and learning (CTL)

Teaching speaking through the role play (an experiment study at the second grade of MTS al-Sa'adah Pd. Aren)

Enriching students vocabulary by using word cards ( a classroom action research at second grade of marketing program class XI.2 SMK Nusantara, Ciputat South Tangerang

The Effectiveness of Computer-Assisted Language Learning in Teaching Past Tense to the Tenth Grade Students of SMAN 5 Tangerang Selatan

Analysis On Students'Structure Competence In Complex Sentences : A Case Study at 2nd Year class of SMU TRIGUNA

The correlation between listening skill and pronunciation accuracy : a case study in the firt year of smk vocation higt school pupita bangsa ciputat school year 2005-2006

Existentialism of Jack in David Fincher’s Fight Club Film

Dokumen yang Anda mencari sudah siap untuk unduhkan