Load balancing physics routines .

LOAD BALANCING PHYSICS ROUTINES
R. W. FORD
Centre for Novel Computing, The University of Manchester, U.K.
and
P. M. BURTON
Meteorological Oce, Bracknell, U.K.
ABSTRACT
This paper presents three algorithms for load balancing physics routines with
dynamic load imbalance. Results are presented for the two most computationally demanding load imbalanced physics routines (short wave radiation
and convection) in the UKMO's forecast and climate models. Results show
between 30% and 40% performance improvement in these routines running
on the UKMO's Cray T3E.
1 Introduction

The physics phase of forecast and climate models is particularly amenable to large
scale parallelism as, typically, each grid column can be independently computed.
However, the existence of load imbalance in this phase is well known [6, 8] and has
been widely acknowledged as being one of the major factors in loss of performance,
particularly when using large numbers of processors, for example see [5, 9, 14, 15].
Interestingly, the same load imbalance problem is also observed in data assimilation
[9, 10].

A domain decomposition approach is usually employed during the physics phase of
parallel forecast and climate models, where equal sized regions of the computational
grid are allocated to processors. In the U.K. Meteorological Oce's (UKMO's) case,
these regions are rectangular in latitude and longitude and there are an equal number
of processors and regions. If no load balancing is performed, then all processors
will compute an equal number of grid columns (ignoring di erences due to integer
division) and typically, for every time-step, a particular processor will compute the
same grid columns. Unfortunately, the computational cost of grid columns can vary
signi cantly, leading to load imbalance.
Improving load balance implies a re-distribution of grid columns. For a load
balancing algorithm to be e ective, the performance bene t due to improved load
balance must outweigh any load balancing overheads. These overheads can be split
into the following:
1



Communication. The total amount of communication used to re-distribute the
grid columns should be minimised1 and overlapped with computation where
possible.




Number of messages. The number of messages required to perform the communication should be minimised. If the amount of communication is xed then
this also maximises the size of messages. It also maximises the size of each grid
column computed at a time (a chunk), if each chunk associated with a message
is computed separately.



Memory requirements. The increased memory requirements should be minimised.



Chunk size. The chunk size should be maximised. Some of the bene ts of doing so are noted in [3]. On vector architectures, small chunks result in small
vector lengths thus reducing performance. On message passing architectures
small chunks potentially lead to short messages, which can also be inecient. If
the model implementation dynamically allocates arrays (allowing the same executable to run with varying resolutions and processors without re-compilation)
then there is also an overhead per chunk; this is signi cant in the UKMO's
uni ed forecast and climate model (UM) [2]. If smaller chunks prove bene cial,

for example due to better cache use, then the larger chunk can subsequently be
blocked.

The relative importance of the overheads described above will vary depending on
the characteristics of the model and machine that it is running on. Attempting to
minimise all of these overheads is useful, as it maximises the chance of obtaining good
performance across a number of current and next generation machines.
As the sophistication of the load balance algorithm increases, there is a corresponding improvement in load balance and decrease in the above overheads. However, more
sophisticated algorithms take longer to compute and may use global communication
to calculate the load imbalance. The philosophy of this paper is to develop more sophisticated algorithms, as their cost is small compared to the cost of load imbalance
and the load balance overheads described above.
For our purposes load imbalance can be usefully split into static and dynamic imbalance, then further split into accurately predictable and un-predictable imbalance.


Static imbalance (predictable and un-predictable) where load may vary over grid
columns but not over time-steps. Examples of this type of imbalance may be
found in routines which perform computation based on orographic features.

This assumes there is no bene t in performing the communication in steps to avoid network
contention.

1



Predictable dynamic imbalance where the load also varies over time-steps but in
a manner such that the load for each grid column can be accurately estimated
before the computation takes place. An example of this type of imbalance is
short wave radiation in the UKMO's UM.



Un-predictable dynamic imbalance where the load varies in an un-predictable
manner such that the load for each grid column cannot be accurately estimated
a priori. This is typically found when work depends on the dynamic meteorological phenomena. Examples are convection and precipitation.

This paper presents solutions targeted at the latter two forms of load imbalance.
Clearly, however, dynamic algorithms can also cope with static distributions. These
algorithms are tested on the UKMO's two most computationally intensive loadimbalanced routines, short wave radiation and convection. Short wave radiation is
used to test our predictable dynamic imbalance algorithm. This re-distributes grid
columns with the minimum amount of communication and attempts to minimise the

number of messages. Convection is used to test two un-predictable dynamic imbalance algorithms. The rst enables under-loaded processors to steal chunks from
over-loaded processors and the second uses the execution time from the previous
time-step as an estimate of the load for the current time-step.
In Section 2 we describe how our work complements other work in this area.
Sections 3 and 4 present the algorithms developed for predictable and unpredictable
dynamic imbalance, respectively. Section 5 gives the results of our experiments for
both the UKMO's operational forecast and climate models running on their Cray
T3E. Finally, Section 6 draws some conclusions and points the way to future work in
this area.
2 Related Work

A number of static grid column distributions have been suggested to solve (or reduce) the problem of (combined static and dynamic) load imbalance. Schattler and
Krenzien [11] suggest decomposing the grid into more regions than processors. These
are allocated such that processors perform computation from disparate parts of the
globe. The hope is that load imbalance will be smoothed out. This method is simple
to implement and is likely to improve load balance, especially if the imbalance is
dominated by radiation. However, it is likely to vary depending on the input data
and, in general, requires experiments with the model to nd the optimum number of
regions (which may change depending on the input data). Dynamic load balancing
algorithms would be much more able to cope with changing loads and do not rely on a

statistical smoothing over the globe. Note, this static distribution could subsequently
use a dynamic algorithm to further improve load balance. In [20] we distribute grid
columns cyclicly across processors (the extreme version of the above distribution) for
data assimilation, signi cantly improving the performance.

Foster and Toonen [12] show that the load balance of the PCCM2 model [16] can
be improved by swapping grid columns between processors allocated regions on the
opposite side of the globe in longitude. Three ways of achieving this are presented,
swapping every other grid column, swapping excess day-lit grid columns and swapping
grid columns based on a pre-computed estimate of the cost of both lit and non-lit grid
columns. The algorithms work well as (again) the radiation costs dominate. They
note that there are other variances, but suggest the partitioning of PCCM2, which
pairs symmetric north/south latitudes, compensates for this. This may not always be
the case in general (for example in local area models) and is not the case in the UM.
Our algorithm for short wave radiation can be considered as a global version of their
method. Our algorithm re-partitions data amongst all processors guaranteeing load
balance, whereas they load balance purely between pairs which does not. In their
case symmetry ensures a good solution. Their methods require communication only
between pairs of processors when pre-computing how many grid columns should be
sent (whereas ours requires global communication). However, their method does not

guarantee to minimise the number of grid columns sent (globally) and therefore data
transferred between these processors, which is the dominant communication cost.
Dent and Mozdzynski [13] also employ a global re-partitioning algorithm for an
irregular distribution of radiation grid columns in ECMWF's IFS [1]. As in our
algorithm, they send grid columns from the most over-loaded processor to the most
under-loaded processor until all processors are load balanced. They claim that this
minimises the number of messages, however (to our knowledge), they do not try to
match pairs, which can give a further reduction in the number of messages required.
If the (static or dynamic) computation costs of the grid column can be modelled
accurately by a set of weighting factors, typically factoring whether certain computation within the physics routine takes place at grid columns, then the grid columns
can be re-partitioned to give load balance. In [3] we used the top level of a set of
gather masks to predict the computational cost of each grid column and improve load
balance. Isaksen and Hamrud [10] have also suggested, and Saarinen, Vasiljevic and
Jarvinen [19] have used, weighting factors for a static imbalance (in data assimilation). Foster and Toonen [12] (as mentioned earlier) pre-compute the cost of lit and
non-lit grid columns. However, there may be many weighting factors, their accurate
calculation may be dicult and their values will vary across architectures and as the
physics are improved. Again, a dynamic load balance algorithm could be used in
conjunction with this technique.
In [3] and [4] we explored the idea of using the execution time from the previous
time-step to balance the load on the current time-step, presenting algorithms for

shared address space parallel architectures; this paper presents a distributed memory
variant of this algorithm. This technique is based on the observation that the load
distribution is slowly varying over time-steps, something that one might expect as slow
changes are usually required for model stability. It e ectively turns an unpredictable
load imbalance problem into a predictable load imbalance problem. This method

could also be used for the rst few time-steps to load balance a static load imbalance
problem.
Elbern [17] uses execution time from the previous time-step for a dynamically
varying workload, a two dimensional partition and a local shuing of work between
adjacent processors. Our algorithm moves grid columns from over-loaded processors to under-loaded processors irrespective of their location. We also treat the grid
columns as a single vector which allows us to load balance to the resolution of a single
grid column rather than a line of grid columns.
If the execution time from the previous time-step is not a good measure of the
current load, or the cost of the algorithm is prohibitive, then a fully dynamic algorithm
such as Dynamic Chunk or Guided Self Scheduling (GSS) [7] can be used. We present
results of a dynamic chunk implementation. This was chosen as it gives a simple
control over the chunk size.

3 Predictable dynamic imbalance: Short Wave Radiation

Short wave radiation in the UM is a good example of predictable load imbalance. For
the load to be predictable, we must make some assumptions as to its characteristics.
Examining the code, we observe that it is only grid columns which are lit that perform
computation. In the UM these grid columns are gathered into smaller arrays. Non-lit
grid columns can therefore be safely assumed to have zero cost. The next observation
is that each lit grid column performs approximately the same amount of computation.
Our assumption therefore, is that each lit grid column takes the same amount of time
to compute. Earlier work has proven the accuracy of this assumption [3]. Given these
assumptions, an equal distribution of lit grid columns across processors should result
in an equal amount of time per processor and therefore load balance.
In a global model, half of the grid grid columns (G) will be lit and half will be
in darkness. If there are P processors, a standard domain decomposition would give
each processor G=P grid grid columns. As the number of lit grid columns is G=2 then
a load balanced solution would allocate G=2P grid grid columns to each processor.
For a standard domain decomposition, where the grid grid columns are partitioned in
latitude and longitude, for any reasonable number of processors, it is easy to see that
some processors will have all of their grid columns lit. This suggests that we should
observe up to 50% load imbalance in short wave radiation in a global model.
In a standard domain decomposition there is, therefore, a non-uniform distribution
of lit grid columns to processors. These must be re-distributed equally amongst

processors. The minimum amount of communication required to achieve load balance
is relatively simple to calculate. If a processor has more than the average number of
grid columns, it computes the average number of grid columns and sends the excess
to processor(s) with less than the average. If a processor has less than the average
number of grid columns, it computes all of its grid columns and receives the di erence
between its number of grid columns and the average.

If the amount of communication required per grid column is C , Gi is the number
of grid columns on
processor i and Gav is the average number of grid columns per
processor Gav = Ppi=1 Gi=p (which
should be G=2 in the global case) the minimum
P
amount of communication is C 2[1 ] Gi ? Gav .
G >G
Achieving this re-distribution with the minimum number of messages (and therefore maximum message size) is much more dicult. We attempt to minimise the
number of messages by sending grid columns from the most over-loaded processor to
the most under-loaded processor until all processors are load balanced.
Our algorithm orders processors in terms of load2, then sends grid columns from
the processor at the head of the list to the processor at the tail of the list. The number

of grid columns transferred is the amount required to balance the least imbalanced of
the two processors. At least one processor is always guaranteed to be load balanced
after this step. Both will be balanced if they are pairs i.e: they are equally overand under-loaded. If they are not pairs, the reduced imbalance on the remaining
processor is checked with the list to see if it has a pair. If so, grid columns are
transferred between them and they are removed from the list. If not, the processor
is re-inserted in the list. This process continues until all processors are balanced, see
[18] for a more detailed analysis of this algorithm.
The UM gathers the lit grid columns on each processor into contiguous dynamically allocated arrays. We increase the size of these dynamic arrays appropriately
so that there is space to store any grid columns sent from remote processors. This
technique allows remote grid columns to be computed in the same chunk as local grid
columns, thus maximising chunk sizes.
i

:::p

i

av

4 Unpredictable dynamic imbalance
Convection in the UM is a good example of unpredictable dynamic imbalance. The
complexity of the control paths in the code make it dicult to obtain an accurate
estimation of the computational cost of a grid column (although in [3] we show that
some performance improvement is possible using this approach). In this section,
we present two possible approaches to unpredictable dynamic imbalance, namely,
dynamic chunk and dynamic feedback.
4.1 Dynamic Chunk
Dynamic chunk is one of a class of dynamic algorithms that splits the local grid
columns into a number of chunks. Chunks are then computed locally until all are
completed. When complete, the processor looks for chunks from remote processors
which have yet to be computed and steals them. Dynamic chunk uses a xed chunk
size which allows easy control. Such algorithms do not require any application speci c
information and can be implemented eciently on a Cray T3E. Note, this algorithm
2

This ordering requires global communication.

uses the T3E's global address space, one sided messaging and atomic operations to
implement a shared memory algorithm, allowing chunk stealing from a processor
while it is computing; this is not possible to implement eciently in message passing
without a separate server thread. There is also a clear trade-o between the load
balance, which improves as the number of chunks increase (as it is not possible to
load balance to a ner resolution than a chunk) and the relative overheads associated
with a chunk, which increase as the chunk size decreases.
4.2 Dynamic Feedback
Our dynamic feedback algorithm uses the execution time from the previous time-step
to estimate the load per grid column for the current time-step. The rst time-step
simply uses the time taken by a processor and the number of grid columns it computes
to estimate the load of each grid column. The work is then re-distributed based on
this estimation to load balance the code to give each processor an average amount of
time. The same algorithm discussed in Section 3 is used to minimise communication
and hence the number of messages. However, this version is a little more complicated
as we are balancing time, which can be split to any accuracy, and transferring grid
columns, which can obviously only be transferred in integer numbers. This means
that the pairs option discussed in Section 3 cannot be used and that under-loaded
processors cannot be removed from the list, see [18] for more details. Subsequent
time-steps use the time from locally and remotely computed chunks to estimate the
load.
Each local chunk and each remote chunk is computed and timed separately. This
allows accurate timing of all chunks, minimises extra memory use and maximises the
size of local chunks. However, although the algorithm also attempts to keep remote
chunks as large as possible, these can be arbitrarily small (down to a single grid
column). In the UM's convection routine, the cost of a grid column in a small chunk
is signi cantly greater than the cost of the same grid column in a large chunk, see
Section 5.2.
5 Results

5.1 Short Wave Radiation
Figure 1 shows the results of running the UM with its current static partition (labelled
\asis") and with the new predictable dynamic load balance algorithm (labelled \lb"),
described in Section 3. The example is a forecast model running on 144 processors
(the number of processors used operationally). For both methods the maximum and
average time is given. The maximum time shows how long the routine takes, whereas
the average time shows how long the routine would take if it were perfectly balanced
(the ideal). Our method load balances short wave radiation well, reducing the time to
solution by nearly 40%. The di erence between the average load balanced time and

average original time gives a measure of the overhead of the algorithm. The di erence
between the maximum load balanced time and the average load balanced time gives
a measure of the inaccuracy in the assumptions mentioned in Section 3.
1.3

1.2

Time (s)

1.1
asis max
asis av
lb max
lb av

1

0.9

0.8

0.7

0.6
0

5

10

15

20

25

Timestep

Figure 1: Predictable dynamic imbalance algorithm with 144 processors

5.2

Convection

Figure 2 shows the results of running the UM with its current static partition (labelled
\asis") and and the new dynamic chunk algorithm (labelled \lb"), described in Section
4.1, for three and ve chunks. The example is the same forecast model run used in
the previous section, again running on 144 processors (the number of processors used
operationally). For both methods the maximum and average time is given. The
maximum time shows how long the routine takes, whereas the average time shows
how long the routine would take if it were perfectly balanced (the ideal). Our method
improves the load balance of convection, reducing the time to solution by over 40%.
The di erence between the average load balanced times and average original time is
negligible, however there is some load imbalance remaining. Five chunks per processor
gives a better load balance solution than three chunks per processor.
The dynamic feedback algorithm (described in Section 4.2) and the dynamic chunk
algorithm (described in Section 4.1) were tested with a climate model. The climate
model was chosen as it presents a greater challenge to the load balancing algorithms

1

0.9

Time (s)

0.8

asis max
asis average
lb 3 segs max
lb 3 segs av
lb 5 segs max
lb 5 segs av

0.7

0.6

0.5

0.4

0.3
0

50

100
Timestep

150

200

Figure 2: Dynamic Chunk algorithm with 144 processors
due to the small size of the data set. The results of running the UM with the current
static partition (labelled \asis") and with both the new predictable dynamic chunk
algorithm (labelled \lb chunk") and the dynamic feedback algorithm (labelled \lb
feedback") are given in Figure 3. The example is running on 36 processors (the
number of processors used operationally). The original code computes convection in
3 chunks on each processor; if only 1 chunk is used we see a corresponding increase in
performance. Maximum and average (the ideal) times are given for these two runs.
This di erence shows the increased overhead of using smaller chunks. The dynamic
chunk algorithm provides further improvement and the feedback algorithm gives the
best results, reducing the time to solution by approximately 33%.
We have mentioned the overhead associated with decreasing the size of chunks.
In Figure 4 we quantify this overhead. In this example, a climate model was run
sequentially for two time-steps and for a number of di erent chunk sizes. The results
shown are for time-step 2, however the time-step 1 results are nearly identical. The
results presented are for the total time taken by convection and therefore show the
relative cost of a grid column in di erent sized chunks. Small chunks are seen to
have a signi cantly greater cost than larger chunks. A grid column allocated its own
chunk will cost more than seven times that of a grid column in a large (>100 grid
columns) chunk. Chunks larger than those presented gave no further performance
improvement.

asis max
asis average
asis 1seg max
asis 1seg average
lb chunk 3segs max
lb feedback max

0.13
0.12
0.11

Time (s)

0.1
0.09
0.08
0.07
0.06
0.05
0.04
0

10

20

30

40

50
60
Timestep

70

80

90

100

Figure 3: Dynamic Chunk and Feedback algorithms with 36 processors
8
summed convection segment time
7

Time (s)

6

5

4

3

2

1
0

50

100

150

200
250
Segment size

300

350

400

Figure 4: Total convection time with varying chunk size

6 Conclusions and Future Work
This paper has presented an e ective algorithm for load balancing predictable dynamic load imbalance, which minimises communication between processors and attempts to minimise the number of messages. This results in nearly 40% performance
increase for short wave radiation in the UKMO's operational forecast model. This
routine will be integrated into the UM in the next release.
It has also presented two e ective algorithms for load balancing un-predictable dynamic load imbalance. These improve the performance of convection in the UKMO's
operational forecast and climate models by between 30% and 40%. These algorithms
are also expected to be applicable to other physics routines in the UM. The dynamic
chunk algorithm is already being used operationally for convection.
For future work we intend to examine the overheads of the dynamic feedback algorithm in more detail, introducing modi cations to increase the chunk size by merging
remote chunks, if appropriate, and possibly attempting to reduce the dynamic memory allocation overhead which increases the costs of small chunk sizes. We also intend
to implement the most appropriate algorithms in other physics routines and examine
the limits of scalability of the model.

7 Acknowledgements
The authors would like to thank the UKMO for funding this research.

References
[1] Barros, S. R. M., D. Dent, L. Isaksen and G. Robinson (1994) The IFS model{
overview and parallel strategies, Proc. 6th Workshop on the Use of Parallel Processors in Meteorology (G.-R. Ho man and N. Kreitz, eds.), World Scienti c,
pp. 303{318.
[2] Bell, R. S. and A. Dickinson, (1990) The Meteorological Oce Uni ed Model for
data assimilation, climate modelling and NWP and its implementation on a Cray
Y-MP, Proc. 4th Workshop on the Use of Parallel Processors in Meteorology,
World Scienti c.
[3] Ford, R. F., D. Snelling and A. Dickinson, (1994) Controlling load balance, cache
use and vector length in the UM, Proc. 6th Workshop on the Use of Parallel
Processors in Meteorology (G.-R. Ho man and N. Kreitz, eds.), World Scienti c,
pp. 195{205.
[4] Bull, J. M., R. W. Ford and A. Dickinson (1996) A feedback based load balance
algorithm for physics routines in NWP, Proc. 7th Workshop on the Use of Parallel
Processors in Meteorology (G.-R. Ho man and N. Kreitz, eds.), World Scienti c,
pp. 239{249.

[5] Michalakes, J. G. and R. S. Nanjundiah (1994) Computational load in model
physics of the parallel NCAR Community Climate Model, Tech. Rep. ANL/MCSTM-186, Argonne National Laboratory, Argonne, Illinois.
[6] Michalakes, J. G., (1991) Analysis of workload and load balancing issues in the
NCAR Community Climate Model, Tech. Rep. ANL/MCS-TM-144, Argonne National Laboratory, Argonne, Illinois.
[7] Polychronopoulos, C. P. and D. J. Kuck, (1987) Guided self-scheduling: a practical scheduling scheme for parallel supercomputers, IEEE Trans. Comp. vol. C-36,
no. 12, pp. 1425{1439.
[8] Dent D., et al. (1994) The IFS model performance measurements, Proc. of 6th
Workshop on the Use of Parallel Processors in Meteorology (G.-R. Ho man and
N. Kreitz, eds.), World Scienti c, pp. 352{369.
[9] Burton P. and A. Dickinson (1996) Parallelising the Uni ed Model for the Cray
T3E, Proc. 7th Workshop on the Use of Parallel Processors in Meteorology (G.-R.
Ho man and N. Kreitz, eds.), World Scienti c, pp. 68{82.
[10] Isaksen L. and Hamrud M., (1996) ECMWF operational forecasting on a distributed memory platform: analysis, Proc. 7th Workshop on the Use of Parallel
Processors in Meteorology (G.-R. Ho man and N. Kreitz, eds.), World Scienti c,
pp. 22{35.
[11] Schattler U. and E. Krenzien (1996) Model development for parallel computers
at DWD, Proc. 7th Workshop on the Use of Parallel Processors in Meteorology
(G.-R. Ho man and N. Kreitz, eds.), World Scienti c, pp. 83{100.
[12] Foster I. T., and B. R. Toonen (1994) Load-balancing algorithms for climate
models, Tech. Rep. ANL/MCS-TM-190, Argonne National Laboratory, Argonne,
Illinois.
[13] Dent D. and G. Mozdzynski (1996) ECMWF operational forecasting on a distributed memory platform: forecast model, Proc. 7th Workshop on the Use of
Parallel Processors in Meteorology (G.-R. Ho man and N. Kreitz, eds.), World
Scienti c, pp. 36-51.
[14] Eerola K., et al. (1996) A parallel version of the Hirlam forecast model: strategy
and results, Proc. 7th Workshop on the Use of Parallel Processors in Meteorology
(G.-R. Ho man and N. Kreitz, eds.), World Scienti c, pp. 134{143.
[15] Kucken M. and U. Schattler (1996) REMO - implementation of a parallel version
of a regional climate model, Proc. 7th Workshop on the Use of Parallel Processors
in Meteorology (G.-R. Ho man and N. Kreitz, eds.), World Scienti c, pp. 144{
154.

[16] Drake, J., et al. (1994) PCCM2: A GCM adapted for scalable parallel computers,
Proc. AMS Annual Meeting, AMS.
[17] Elbern H., (1996) Load Balancing of a comprehensive air quality model, Proc.
7th Workshop on the Use of Parallel Processors in Meteorology (G.-R. Ho man
and N. Kreitz, eds.), World Scienti c, pp. 429{444.
[18] Ford R. W., (1998) A message minimisation algorithm, CNC Tech. Rep., Department of Computer Science, The University of Manchester, Manchester, U.K.
[19] Saarinen S., D. Vasiljevic and H. Jarvinen (1996) Parallelization of observation
processing, Proc. 7th Workshop on the Use of Parallel Processors in Meteorology
(G.-R. Ho man and N. Kreitz, eds.), World Scienti c, pp. 52{67.
[20] Burton P., R. Ford and D. Salmond (1998) Investigation and implementation
of a number of di erent load balancing strategies in the UK Met. Oce's Uni ed Model, Proc. 4th European SGI/Cray MPP Workshop, Institute for Plasma
Physics, Garching/Munich, Germany, September.