Hypercube Dynamic Load Balancing .

Hypercube Dynamic Load Balancing

zyxw

Edward J. Wegman
[email protected]

Duane King
[email protected]

Center for Computational Statistics
George Mason University
ABSTRACT
This paper reports on the results of a preliminary study in
dynamic load balancing on an Intel Hypercube. The
purpose of this research is to provide experimental data
in how parallel algorithms should be constructed to
obtain maximal utilization of a parallel architecture.
This study is one aspect of an ongoing research project
into the construction of an automated parallelization tool.
This tool will take FORTRAN source as input, and

construct a parallel algorithm that will produce the same
results as the original serial input. The focus of this
paper is on the load balancing aspect of that project. The
basic idea is to reserve a certain percentage of the
computation task, subdivide that percentage into
arbitrarily fine tasks, and dole those small tasks out to
nodes on request. Ij”the percentage is chosen correctly,
then a minority of nodes should be involved in consuming
the filler tasks, and the overall throughput of the job
should increase as a result of the individual node
efJciencies having increased.

parameters.
Our load balancing system determines the execution
schedule from the source code, with additional user input.
After each component is identified, it is classified into a
category. The categories are determined by timing algorithms,
and using this information as a rough approximation of
running time. The algorithms that have been timed represent
varying degrees of complexity. The greater the complexity

rating of a task, the greater the risk it runs of upsetting the
global load balance. In order to prevent this, the tasks with
higher complexity ratings are broken into smaller components.
The smaller components can then be spread among the nodes,
thus reducing the risk that any one node will impede the
progress of all the other nodes. This severing into smaller
components will induce an increased amount of overhead, but
this will be coupled with an overall performance gain by
having all nodes operating continuously. The translation of the
source code into an appropriate mathematical structure is
currently a research project within the center. We are currently
coding static load balancing parameters into the source code in
order to determine how load balancing should be done. The
algorithm that will be described in this paper generally consists
of holding a fixed percentage of the computational task from
the nodes, and using that percentage to insure that all the nodes
are concurrently engaged in useful work.
We have chosen a set of different algorithms to form an
initial index of profiles. The six algorithms are: Mandelbrot
Set, Lanzos Nonlinear Optimization, Multiple Linear

Regression, Kernel Density Estimation, Bootstrapping, and
Parallel Coordinate Density. These algorithms provide a base
for doing load balancing on primarily numerical algorithms.
This set of six algorithms was chosen as a testbed since they
are representative of the type of work that we are currently
engaged in. The remainder of this paper will discuss the
progress of our research, and the results we have obtained
trying to balance the computation of the mandelbrot set across
sixteen nodes of a hypercube. This algorithm was chosen
since it is computationally complex, and exhibits many of the
computational problems that are likely to be encountered. The
algorithm has the potential for taking a very long time to
execute, and the source code provides no clues as to which

zyxwvuts
zyxwvutsrq
zyxwvutsr
zy
zyxwvuts
zyxwvutsrqpo

zyxwvut

This paper will outline our approach to performing
dynamic load balancing on an Intel iPSC/2. We take the
view that the problem of load balancing is really a
problem of dividing a “computational task” into smaller
components, each of roughly equal complexity, and each
an independent event. After this is done, the components
of the task can be sent to a node for execution. The key
to an optimally balanced load across all computational
nodes is the ability to form a statistical profile of the
individual components of each computational task. This
statistical profile will determine an initial sequence of
execution. Our experience indicates that a speedup on the
order of 80% is achievable with the judicious use of
profiled load balancing. During the process of execution,
the initial profile will be altered according to the actual
behavior exhibited by the nodes. The difference between
the actual and expected performance will be used to
determine how much additional time should be devoted to

altering the current execution schedule. Currently, our
work involves statically setting the load balancing

962

0-8186-2113-3/90/0000/0962$01.OO 0 1990 IEEE

zyxwvuts
zyxwvut
zyxwvut

range of numbers will result in long calculations. The
implementation that we use involves calculating the
mandelbrot set in the range of -2.00 to +2.00 in both the x
and y directions. This is done on a 1024 by 1024 pixel
canvas, with each pixel have 1024 possible colors. The
result of the calculation is sent from the nodes of the
hypercube to the manager, which transfers it to a Silicon
Graphics Iris 4D-120GTX. The time required to transfer
the data from the hypercube manager to the Iris is not

reflected, since that is more dependant on the amount of
traffic on our local ethemet.
The mandelbrot set was calculated in two distinct
fashions. In the first case, the region is split into 1024
vertical strips, and the strips are distributed to the nodes
for calculation. Each node will calculate every sixteenth
strip, thus performing an approximate load balancing,
assuming that the areas of greatest computation are larger
than one strip wide, so that each node will receive a
relatively large number of high density strips. The
second method used was to allow each node to calculate a
consecutive block of 64 strips. This is more indicative of
the way software could choose to break up the problem.
Timings were taken for each of the two methods, After
this, the software was altered so that each method would
not calculate a certain percentage of the vertical strips.
The differences in the two timings would then show
approximately what percentage of the data should be
reserved for load balancing.
The motivation behind this approach is to acquire

some hard data about how successful a load balancing
approach can be. We have done some theoretical work in
the Center which indicates that a load balancing
parameter of 50% should provide the best results. Since
the mandelbrot set was easy to code, and would present a
good example of an algorithm that doesn’t behave
“correctly”, it was chosen as the focus of this initial
study. We also needed to find out what would happen as
the internode communications increased. Primarily, as a
system comes into balance, there will have to be an
increasing amount of communication amongst the nodes,
and will this communication become a large enough
factor to affect the balancing of the nodes? Additionally,
we needed to know just how much a simple attempt at
balancing would cost in terms of performance. Since this
is all geared towards being incorporated into an
automated source code parallelizing system, we also
needed to know just was important about when an
algorithm is done. In running the test data for this paper,
we discovered that the “best” algorithms were those that

provided the most visual information across the greatest
portion of the screen the fastest. This way the observer
would be able to select a portion of the screen for further
exploration. While we can see how to code it for such a

purpose, we aren’t sure how to translate those concems into
recognizable mathematical structures that a computer program
can use as constraints to design an optimal parallel algorithm
from existing serial code.
The data that we obtained from our experiments were the
overall throughput timings to transfer the result data back to
the hypercube manager. The data that will be presented here
will consist of the time it took for the entire task to complete,
and how long each node spent calculating during that time. It
should be stressed that the timings are given in seconds, and
are meant to reflect wall clock times, not the amount of work
the CPU had to do to obtain the results. Additionally, the node
programs were written to use a fairly large chunk of memory
as a cache, to avoid the inherent problems in asynchronous
message passing in the iPSC/2 (Namely, the requirement that

buffers that are in transit can’t be modified).
The experiments were run for both versions of the
algorithm, the vertical strips as well as the consecutive blocks
of data. For each of these, either 25, 50 or 75 per cent of the
vertical strips were held in reserve for load balancing.
Four experimental runs were made with the vertical strip
data. The base run, with all 1024 strips distributed to the
nodes, took 366 seconds to complete. With 25% held in
reserve, it took 386 seconds to complete. At 50% in reserve, it
took 385 seconds to complete, and at 75% it took 383 seconds
to complete. Since all three figures have just about equivalent
times, it would appear that trying to load balance this
algorithm is not going to prove worthwhile.
However, trying to load balance the alternate algorithm
would prove to be quite useful. Using blocks of consecutive
strips, it took 1405 seconds to complete the algorithm. The
load balancing would prove to be of considerable value to this
algorithm, as the time dropped to 383 seconds with 75% of the
data held in reserve. With 50% of the data held in reserve, it
took 724 seconds to complete. With 25% of the data held in

reserve, it took 1086 seconds to complete.
Another important factor in the data is the determination
of the load imbalance present using the different reserves.
Under the first method, there is very little imbalance present.
Without holding any data in reserve for balancing, the first
node and the last node to complete only had a ten second
disparity. With the data reserves, this dropped to within one or
two seconds. The second method, however, presented a very
large imbalance. Without holding any data in reserve, half the
nodes were complete within 30 seconds, while the algorithm
didn’t complete until 1405 seconds had elapsed. With 75% of
the data in reserve, there was only a one second disparity in
node completion times. With a 50% data reserve, there was a
308 second disparity, and at 25% reserve, there was a 900
second disparity.
Bearing these results in mind, we have come to the
conclusion that it is necessary to make a probabilistic estimate
of completion times, and when there is a possibility for a large

zyxwvu

zyxw
z
963

zyxwvu

9
10
11
12
13
14
15

Seconds to Completion:
385
385
384
385
385
384
384
384
384
50% reserve vertical strip
384
384
384
385
385
384
385

Node:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Seconds to Completion:
384
384
384
383
384
384
384
384
383
25% reserve vertical strip
384
384
384
384
383
383
384

Node:

0

imbalance, then it is worthwhile to take the performance
hit, and force all the nodes to perform small calculations
and maintain close communications with the manager.
The alternative to this is to assign each node its large task,
and have it maintain contact with the manager about its
progress. This would enable the manager to reallocate
large tasks to processors which have completed and are
sitting idle, waiting for work. The drawback to this, is of
course that not many algorithms lend themselves to this
much of a fragmented approach.
Node:
0
1
2
3
4
5
6
7

8
9
10
11
12
13
14
15

Seconds to Completion:
363
365
366
366
365
365
364
361
362
Vertical strip data
356
355
356
357
363
364
363

zyxwvutsrqp
zyxwvutsr

Node:
0
1
2
3

4

1
2
3
4
5
6
7
8

5
6
7
8
9
10
11

12
13
14
15

Seconds to Completion:
386
385
385
385
386
385
385
385
385
75% reserve vertical strip
385
386
385
385
385
385
385

964

zyxw
zyxwvutsrqpo
zyxwvutsrqp
zyx

Node:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Seconds to Completion:
23
23
62
399
384
721
1095
1405
1082
consecutive block data
419
23
23
22
22
21
20

Node:
0
1
2
3
4
6
7
8
9
10
11
12
13
14
15

Seconds to Completion:
382
383
382
382
382
382
382
382
382
75% reserve block data
382
382
382
382
382
382
383

Node:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Seconds to Completion:
316
316
316
316
316
368
565
724
559
50% reserve block data
316
316
316
316
316
316
316

5

965

Node:
0
1
2
3
4

5

6
7
8
9
10
11
12
13
14
15

Seconds to Completion:
186
186
186
307
300
552
847
1086
838
25% reserve block data
330
186
186
186
186
186
186