An analysis of diffusive load balancing

An Analysis of Di usive Load-Balancing
Raghu Subramanian

3

Isaac D. Scherson

raghu@ics.uci.edu

isaac@ics.uci.edu

Department of Information and Computer Science
University of California
Irvine, CA

92717-3425

U.S.A.

Abstract


Di usion is a well-known algorithm for load-balancing in which tasks move from heavily-loaded
processors to lightly-loaded neighbors. This paper presents a rigorous analysis of the performance of
the di usion algorithm on arbitrary networks.
We derive both lower and upper bounds on the running time of the algorithm. These bounds are
stated in terms of the network's bandwidth.
For the case of the generalized mesh with wrap-around (which includes common networks like the
ring, 2D-torus, 3D-torus and hypercube), we derive tighter bounds and conclude that the di usion
algorithm is inecient for lower dimensional meshes.

3

This research was supported in part by the Air Force Oce of Scienti c Research under grant number F49620-92-J-0126,
the NASA under grant number NAG-5-1897, and the NSF under grants number MIP-9106949 and number MIP-9205737

1

1

Introduction


The load-balancing problem is as follows: Let G be an undirected connected graph of N nodes, and let
M tasks be scattered among the nodes. Re-distribute the tasks such that each node ends up with either
bM=N c or dM=N e tasks. The algorithm must run in a distributed fashion, i.e., each node's decisions
must be based only on local knowledge. (This formulation is borrowed from [11]1.)
To motivate the above formulation, let us consider two applications in which the the need for loadbalancing arises. In these examples, observe that the all tasks have roughly the same execution time, and
the load-balancing phases attempt to equalize the number of tasks at each node. Also note that the high
degree of parallelism makes a centralized load-balancing algorithm unviable.


Back-tracking: Consider a search space consisting of vectors where each component can assume
a nite number of values. Back-tracking is an algorithm to nd a vector in the search space that
satis es some feasibility condition. In back-tracking, the solution vector is constructed incrementally,
component by component. A task corresponds to a partially constructed vector. Each partial vector
(task) resides in the local memory of some processor.
The back-tracking algorithm alternates between expansion phases and load-balancing phases. In
an expansion phase, each processor in parallel takes a partial vector (task), if any, from its local
memory. If the partial vector is in fact complete, and also satis es the feasibility condition, then the
back-tracking algorithm returns the completed vector and terminates. If it is clear that the partial
vector can not be completed feasibly, then the partial vector vanishes from the local memory, and
the task is said to have terminated unsuccessfully. Otherwise, several new partial vectors appear in

the local memory of the same processor, corresponding to each way of extending the original partial
vector by one more component, and the task is said to have spawned children.
In a load-balancing phase, partial vectors (tasks) are redistributed among processors evenly. If the
load-balancing phase is omitted, then with each expansion phase, the distribution of tasks among
processors gets skewed. Some processors get swamped with tasks, while others stay idle. This slows
down the overall back-tracking algorithm.
If the load-balancing phase proves to be too expensive, then its cost may be amortized by invoking
several expansion phases for every invocation of the load-balancing phase.



Consider a typical iterative algorithm to solve a partial di erential equation (PDE).
The problem is rst discretized by partitioning space into regions, and the function is tentatively
assumed to be constant within a region. During each iteration, each region in parallel gets the
function values in the neighboring regions, and uses them to update its own function value. If, in
the process, a region discovers that its function value di ers drastically from the neighbors', then
it realizes that assumption that the function is constant within the region may not be valid, so it
splits up into ner regions.
Here, a task corresponds to a region. Each task resides in the local memory of a processor. During
an expansion phase, each processor picks a task, if any, from its local memory and executes it. This

may result in additional tasks appearing in the local memory of a processor, re
ecting that a region
split up. During a load-balancing phase, tasks are redistributed evenly.
Solving PDEs:

Other applications where load-balancing arises are branch-and-bound optimizations, theorem proving,
interpretation of PROLOG programs, and ray tracing [8].
1

Incidentally, the paper describes a surprising \application" of load-balancing. It is well known that permutation routing
can be reduced to sorting. The authors show that general (many to many) routing can be reduced to sorting plus loadbalancing.

2

Network

Lower-Bound

Upper-Bound


2D-torus


(N log  )

O(N )

3D-torus


(N 2=3 log  )

O(N 2=3 )


(N 2 log  )

Ring

O(N 2 )


Hypercube
(log N log  ) O(log N )
Figure 1. Bounds on the running time of the di usion algorithm for certain common networks. N denotes the
number of nodes in the network, and  denotes the standard deviation (\imbalance") of the initial load distribution.

Overview of Results
In this paper, we consider the well-known di usion algorithm for load balancing. The principle behind
the di usion algorithm is that if a processor has more tasks than a neighbor, then it sends a few tasks to
the neighbor. The number of tasks sent is proportional to the di erential between the two processors, the
proportionality constant being a characteristic of the connecting edge.
This paper presents a rigorous analysis of the performance of the di usion algorithm on an arbitrary
network. We derive both lower and upper bounds on the running time of the algorithm. These bounds are
stated in terms of the network's electrical conductance and
uid conductance (de ned in Section 3),
which are measures of the network's bandwidth.
If N is the number of nodes in the network, 0 is the network's electrical conductance, and  is
the standard deviation (\imbalance") of the initial load distribution, then the running time of the
di usion algorithm is
log 

N

(
) and O( ):
0
0
 If 8 is the network's
uid conductance, and  is as above, then the running time of the di usion
algorithm is
log 


(
) and O( 2 ):
8
8
For the special case of an (n1 2 n2 2 1 1 1 2 nd ) mesh with wraparound, we provide the following tighter
bound:
d log 
d


( 2
) and O( 2
):
(1)

sin ( maxi=1 d ni )
sin ( maxi=1 d ni )
Figure 1 gives a feel for Bound 1 by showing the form it assumes in certain common cases. From the
table, it is clear that the di usion algorithm is inecient for lower dimensional meshes. For example, on
a ring and 2D torus, the di usion algorithm takes at least linear time, indicating that it is no better than
a centralized algorithm (in which one processor collects all information and directs the load balancing).


111

111

Comparison with Prior Work
Traditional formulations of the load-balancing problem allow the tasks' execution times to di er [3]. These

formulations are more general than ours, but with the generality comes intractability: the simplest such
formulations turn out to be N P -complete. As a result, most work has consisted of proposing ad hoc
heuristics and comparing them by simulations.
3

Algorithm Diffuse:
(1) for iteration 1 to 1 begin
(2) All processors i parbegin
(3)
load[i]
number of tasks at i
(4)
Broadcast load[i] to all neighbors
(5)
for each j that is i's neighbor begin
(6)
if load[i] > load[j ] then
(7)
Send Pij (load[i]0load[j ]) tasks to j
(8)

end
(9) parend
(10)end
Figure 2. Algorithm for load-balancing (with divisible tasks)

We argue that the case of xed-sized tasks is suciently important to merit study. First, as illustrated
above, there are several highly parallel applications where the assumption of xed-sized tasks is valid.
Second, it has been argued that xed-sized tasks adequately model variable-sized tasks which can be
pre-empted when they exceed a certain time quantum [1]. Finally, the case of xed-sized tasks is tractable
and amenable to rigorous analysis.
Di usion is only one of several load balancing algorithms that have been studied in the past [14]. The
di usion algorithm is studied in detail in [2] and [5]. Our presentation of the algorithm in Section 2 closely
follows [5]. Our analysis of the algorithm in Section 3 extends the analysis in [5]. For example, [5] provides
explicit bounds the running time of the di usion algorithm only in the case of a hypercube. In contrast,
we provide bounds for an arbitrary network.
Both [2] and [5] make the simplifying assumption that tasks can be divided into arbitrary fractions.
In Section 4 that this assumption raises thorny problems that can not be glossed over. Then we suggest
how the di usion algorithm can be modi ed to handle indivisibile tasks.

2 Di usive Load Balancing (with Divisible Tasks)

In this section, we review the di usion algorithm for load balancing. For simplicity, we assume that tasks
are divisible into arbitrary fractions. (For example, we allow half-a-task to move across an edge, blithely
ignoring that such a thing is meaningless.) Recall that the original aim was to end up with bM=N c or
dM=N e tasks at each node. Now that we allow fractional tasks, the revised aim is to end up with exactly
M=N tasks at each node.
In Section 4, we will reconsider the indivisibility of tasks, and show how to modify the di usion
algorithm accordingly.
The intuition behind the di usion algorithm is that if a processor has more tasks than a neighbor, then
a few tasks di use to the neighbor. The number of tasks that di use is proportional to the di erence in the
number of tasks at the two processors. The proportionality constant is a characteristic of the connecting
edge, and is called its di usivity.
Figure 2 shows the di usion algorithm in detail. Algorithm Diffuse makes use of an N 2 N di usivity
matrix, P , which satis es the following conditions:



 1=2.

(The number half 1=2 is chosen only for simplicity: any positive constant will do. The
import of the condition is that Pii should be bounded away from zero.)
Pii

4

 For i =
6 j,

P

ij

>

0 if ij is an edge in G, and Pij = 0 if ij is not an edge in G.

is symmetric: Pij = Pji
P
 P is stochastic: nj=1 Pij = 1
Each processor i has a variable called load[i]. At the beginning of each iteration, load[i] is set to the
number of tasks at vertex i (line (3) of Figure 2). Normally load[i] would be an integer variable, but since
we are assuming that tasks are divisible, it is a real variable.
Then, each processor sends its load to all its neighbors (line (4)). As a result, each processor knows
the loads of all its neighbors.
If a processor's load is heavier than a neighbor's, then the processor sends some of its tasks to the
neighbor (lines (6) and (7)). The number of tasks sent is proportional to the di erence in load, the constant
of proportionality being the appropriate entry of the P . (Observe that this number may be non-integral.)
On the other hand, if a processor's load is lower than a neighbor then it does not send any tasks { rather,
it receives tasks from the neighbor.
The parend in line (9) tacitly implies a barrier synchronization. Thus, no processor may start the next
iteration until all processors have completed the current iteration. [2, page 515] shows that the Algorithm
Diffuse works just as well without the barrier synchronization. We retain the barrier synchronization
to simplify analysis; this yields slightly pessimistic results. In practice, the barrier synchronization is
dispensed with.
We do not intend the 1 in line (1) to mean that the number of iterations is in nity. The analysis
in Section 3 will show that, even though the load distribution becomes increasingly balanced with each
iteration of Algorithm Diffuse, it may never ever become exactly balanced. We symbolically denote this
gradual convergence by an 1 in line (1). In practice, the user decides on some tolerable imbalance, and
runs enough iterations to reach within that tolerance.


3

P

Analysis of the Load Balancing Algorithm with Divisible Tasks

Now we analyze the performance of Algorithm Diffuse, still retaining the assumption of divisible tasks.
We derive both lower and upper bounds on the running time of the algorithm. These bounds are stated in
terms of the network's electrical conductance and
uid conductance, which are measures of the network's
bandwidth. For the case of a generalized mesh (with wrap-around), we derive tighter bounds.
To state the main result of this section, let us rst introduce some terminology.
For each t  0, de ne the load distribution, `(t), as
0 (t) 1
BB `1(t) CC
BB `2. CC ;
B@ .. CA
(t)
`
N
where `(i t) is the number of tasks at vertex i after iteration t of Algorithm Diffuse. (Thus, `(0) denotes
P `(0), and de ne the balanced distribution, b,
the initial load distribution.) Let the total load, M , be N
i=1 i
as
1
0

BB
BB
@

M=N
M=N

..
.

M=N

5

CC
CC
A

:

Next, we de ne the electrical conductivity and
uid conductivity of G, which are two measures of its
bandwidth.

 Imagine G to be an electrical network, with edges representing resistors. Set the resistance of each

edge to the reciprocal of corresponding entry of the di usivity matrix P . Let u and v be vertices
of G. De ne Res(u; v ) as the e ective electrical resistance between u and v , that is, the voltage of
v with respect to u if a unit of current were to be injected at v and extracted at u. The electrical
conductance of G is de ned as
1
:
0 = min
u;v Res(u; v )

 Consider G to be gas distribution network, with edges representing pipes. Set the capacity of each

edge to the of corresponding entry of the di usivity matrix P . Let S be a subset of the vertices of
G, and S be its complement. De ne Cap(S; S) as the e ective capacity between S and S , that is,
P
2
i2S;j 2S Pij . The
uid conductance of G is de ned as
8 = min
S

Cap(S; S) :
min(jS j ; S )








We have introduced enough terminology to state the main result of this section:
Theorem 3.1 (Correctness and Complexity of Algorithm

Diffuse).

verges to the balanced distribution. (regardless of its initial value):

The load distribution con-

(t)
lim
!1 l = b:

t




Moreover, the time for

l(t) 0 b

to fall below a prespeci ed constant tolerance satis es the following
bounds:


(


log 
N
log 

); O( );
(
); O( 2 ):
0
0
8
8



where  =

l(0) 0 b

represents the imbalance in the initial load distribution.
For the case of an (n1 2 n2 2 1 1 12 nd ) mesh with wraparound, the running time satis es the following
tighter bounds:
d log 
d

( 2
); O( 2
):



sin ( maxi=1

111

d ni

)

sin ( maxi=1

111

d ni

)

For clarity, we present the proof of Theorem 3.1 into two subsections. The rst subsection derives the
Error Bound, and the second subsection completes the proof.
3.1

The Error Bound

For each t  0, de ne the error distribution, e(t) , as `(t) 0 b.
Observe that from one iteration to the next, the load distribution changes according to the equation
`(t+1) = P `(t) ;
2

(2)

Fluid conductance is not a standard term in interconnection network literature, but the idea occurs in several guises,
such as the load-factor de ned in [10].

6

since, by inspection of Algorithm
Also note that
PN

Diffuse, `it

( +1)

= `(i t) 0

(t)

PN

j =1 Pij (`i

P b = b;

0 `jt ) =
( )

(t)
j =1 Pij `j

PN

= (P `(t) )i.
(3)

PN

since P bi = j =1 Pij bj = j =1 Pij (M=N) = M=N = bi. Equation 3 has the nice signi cance that if we
start with a load distribution of b, then after one iteration we end up with b again.
From Equations 2 and 3, it follows that e(t+1) = `(t+1) 0 b = P (`(t) 0 b) = P e(t) . So the error
distribution transforms in the same way as the load distribution.
Since P is a real symmetric matrix, it is diagonalizable, and eigenvectors corresponding to di erent
eigenvalues form an orthogonal basis [13, page 296]. Let 1; 2; . . .; N be the eigenvalues of P , and
without loss of generality, let them be ordered such that j1j  j2j  . . .  jN j. Let v (1); v (2); . . .; v (N )
be the corresponding eigenvectors. From the theory of Markov chains, it is known that 1 = 1, that v (1)
is some scalar multiple of
0
1
1
B
C
B 1 C
B
C;
B .. C
@ . A
1

and that j2 j is strictly less than 1 [12].
Since the components of e(t) sum to zero and the components of the rst eigenvector v (1) are all equal,
their inner product is zero. So e(t) has no component in the direction of basis vector v1, and hence can
be expressed as a linear combination of v (2); . . .; v (N ). Observe that P scales the lengths of v (2); . . .; v (N )
by j2j ; . . .; jN j respectively, all of which are  j2j < 1. Therefore P scales the length of e(t) by a factor
 j2j < 1.





Therefore,

e(t+1)

=

P e(t)

 j2j

e(t)

, which implies


(t)



e



( )
(0)



 (j j) t
2

e

:

(4)

We call equation 4 the Error Bound. Informally, the Error Bound says that the length of the error
vector shrinks geometrically, where the scale factor is  j2 j.
Note that the Error Bound is tight. For, if we choose e(0) to be v2 then
application
of
P scales


each
the length of e(t) by a factor of exactly j2j. Hence for this choice of e(0),

e(t)

= (j2j)(t)

e(0)

.
3.2

Conclusion of Proof

From the Error Bound, it follows that limt!1 e(t) = 0, which implies that limt!1l(t) = b, proving the
\correctness" part of the theorem.
Let us say that we
desire
a tolerance of . Then we must execute the body of the line (2) loop T times,

e(0)
(T )
where T is such that
e
< . By the Error Bound, T = log1=j2 j k  k .
The time for any processor to execute lines (3) and (4) is at most a constant, say c. The time for
processor i to execute the loop from line (5) to line (8) is proportional to
the number of tasks
it has to


P
PN
PN
(t01) (t01)
(t01)
(t01)
(t01)
(t01)
e
+
send, which is at most N
P
e
=
P

P
(
`
0
`
e
0
e
) 
j

j =1 ij i
j =1 ij i
j

j =1
ij i
j


PN
P
(t01)
(t01)
N 2P
(t01)

. So the total time for any processor to execute
= 2
e
j =1 2Pij maxk ek
j =1
ij
e

iteration t is at most c + 2

e(t01)

.

7



P



Therefore the time for all T iterations is Tt=1(c + 2

e(t)

) = c(log1= 2
Using the Error Bound, this expression can be bounded as
j



0

(t 1)



k ) + PT 2(


e t


).
t
=1

( )







j

ke

(0)
log

e(0)


e
0
1) and O(
):

(
1= j2j
1 0 j2j

(5)

It only remains to bound on j2j. Here are several formulas to do so, drawn from the theory of rapidly
mixing Markov chains:

 The second eigenvalue of P is bounded by the electrical conductance of G as follows [4, Theorem 7]:
0
:
N

1 0 20  j2j  1 0

 The second eigenvalue of P can be bounded by the
uid conductance of G as follows [7]:
8(G)2
:
2

1 0 28(G)  j2 j  1 0

 Let G be the n 2 n 2 1 1 1 2 nd mesh with wraparound. De ne the matrix P as follows: Set all
1

2

diagonal entries to 1=2. If ij is an edge of G, then set Pij = 41d . Set all other non-diagonal entries
to zero.
The second eigenvalue of the matrix P is given by the following equation [4, Theorem 10]:

1
2 = 1 0 sin2 (
d
maxi=1

d ni

111

):

Plugging in these bounds for j2 j in Equation 5 proves the complexity part of the theorem.

4

Handling the Indivisibility of Tasks

Algorithm Diffuse assumed that tasks are divisible. In this section we give examples to show that the
indivisibility of tasks raises non-trivial problems that can not be glossed over. Then we show how to
modify Algorithm Diffuse to handle indivisible tasks.
Once we recognize that tasks are atomic, Algorithm Diffuse has an obvious problem in Line (7):
Send Pij (load[i]0load[j ]) tasks to j ,

because this quantity may not be integral.
Let us try replacing line (7) with
Send bPij (load[i]0load[j ])c tasks.
The problem, as Figure 3 shows, is that the load distribution may converge to an unbalanced distribution.
If we try replacing line (7) with
Send dPij (load[i]0load[j ])e tasks,
then an immediate problem is that a processor may not have enough tasks for all its neighbors. Even
if we are willing to ignore that, Figure 4 shows that the load distribution may keep oscillating between
unbalanced distributions.
A more sophisticated approach combines
oors and ceilings:
8

1/3
1/3
1/3

1/3

1/3

1/3

1/3

1/3

1/3
1/3

1/3

1/3

1/3
1/3

1/3

1/3

Figure 3. The number of tasks at adjacent vertices di er by one. Since the weight on each edge is 1=3, we would
have liked to transfer 1=3 of a task across each edge, but in the \
oor" scheme, no tasks move. Thus, the load
distribution has converged wrongly.

Flip a biased coin that lands head with probability Pij and tail with probability 1 0 Pij
If a head is obtained then send dPij (load[i]0load[j ])e tasks. If a tail is obtained then send
bPij (load[i]0load[j ])c tasks.
The intuition behind this approach is that sending 32 task (for example) is the same as sending a whole task
with probability 23 . However, the algorithm turns out to have a curious behaviour: the load distribution
balances to an extent, but fails to balance any further. More precisely, the \entropy" stabilizes at non-zero
value. (Of course, a fortuitous sequence of coin
ips may balance the load, but that is very unlikely to
happen.)
As the above examples show, the convergence of the algorithm greatly depends on how the fractions
are rounded. Below, we state a rounding scheme that guarantees convergence to the balanced distribution.
We omit the proof because it is straightforward and uninstructive.
Case 1: G is biconnected.

that





Find an orientation of G (that is, assign a direction to each edge of G) such

there are no directed cycles
there is a unique maximal and a unique minimal vertex
there is an edge joining the maximal to the minimal vertex

Such an orientation may be found as follows: Find an open ear decomposition of G (one exists
because G is biconnected [6]). Orient the rst open ear E1 arbitrarily. Assume, by induction, that
the edges of the open ears E1; E2; . . . ; Ei01 have already been oriented; that there are no directed
cycles yet; and that the endpoints of the open ear E1 are the minimum and maximum vertices.
Now we wish to orient the open ear Ei . Let the points of attachment of Ei be the vertices u and
v . Because the current partial orientation is acyclic, directed paths can not exist both from u to v
and v to u. Without loss of generality, assume that there is no directed path from u to v . Then
orient all edges of the open ear Ei from v to u. Clearly, this will create no directed cycle, and the
endpoints of the open ear E1 remain the minimum and maximum vertices, thus maintaining the
inductive assertion.
Modify line (7) of Algorithm Diffuse as follows:
If the edge ij is directed from i to j , then i sends dPij (load[i]0load[j ])e tasks to j . If the
edge ij is directed from j to i, then i sends bPij (load[i]0load[j ])c tasks to j .
9

1/3
1/3
1/3

1/3
1/3

1/3
1/3

1/3
1/3

1/3

1/3

1/3

1/3

1/3

1/3
1/3

1/3

1/3

1/3
1/3

1/3

1/3

1/3

1/3

1/3
1/3

1/3

1/3

1/3

1/3

1/3

1/3

Figure 4. The number of tasks at adjacent vertices di er by two. Since the weight on each edge is 1=3, we would
have liked to transfer 2=3 of a task across each edge, but in the \ceiling" scheme, a whole task moves. Thus, the
load distribution oscillates.

G is not biconnected. Find a biconnected supergraph H that be simulated by G with constant
delay. The graph H may be constructed by adding edges to G as follows: for each cut-vertex u of
G, chain the neighbors of u in a cycle. It is easy to see that H can be one-to-one embedded in G
with dilation 2 and edge-congestion 4. By [9, page 404], G can simulate H with a constant-factor
delay.
Thus, even though the network at hand is G, the algorithm can load balance as though the network
were H , incurring only a constant-factor delay.

Case 2:

5

Conclusion

To summarize, we have presented a rigorous analysis of the performance of the di usion algorithm on
arbitrary networks. We derive both lower and upper bounds on the running time of the algorithm. These
bounds are stated in terms of the network's bandwidth.
For the case of the generalized mesh with wrap-around, we derive tighter bounds. and conclude that
the di usion algorithm is inecient for lower dimensional meshes.
As shown in the back-tracking and PDE examples of Section 1, load-balancing usually arises as a
part of another algorithm. This suggests that a load balancing algorithm should be not be judged in
isolation, but by how it improves the overall algorithm. Thus, in lower dimensional meshes, even though
10

the di usion algorithm may be inecient, it may be \good enough" for certain algorithms. It would be
interesting to see such an analysis.

References
[1] B.D. Alleyne. Personal communications, dept. of electrical engineering, princeton university, 1994.
[2] D.P. Bertsekas and J.N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice Hall, Englewood Cli s, NJ, 1989.
[3] T.L. Casavant and J.G. Kuhn. A taxonomy of scheduling in general-purpose distributed computing
systems. IEEE Transactions on Software Engineering, 14(2):141{154, 1988.
[4] A.K. Chandra et al. The electrical resistance of a graph captures its commute and cover times. In
Symposium on Theory of Computing, pages 574{586, may 1989.
[5] G. Cybenko. Dynamic load balancing for distributed memory multiprocessors. The Journal of Parallel
and Distributed Computing, 7(2):279{301, 1989.
[6] H.Whitney. Non-separable and planar graphs.
34:339{362, 1932.

,

Transactions of the American Mathematical Society

[7] M. Jerrum and A. Sinclair. Conductance and the rapid mixing property for markov chains: the
approximation of the permanent resolved. In Symposium on Theory of Computing, pages 235{243,
may 1988.
[8] R.M. Karp. Parallel combinatorial computing. In Jill P. Mesirov, editor, Very Large Scale Computation in the 21st Century, pages 221{238. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1991.
[9] T. Leighton. Introduction to Parallel Algorithms and Architectures: Arrays Trees Hypercubes. Morgan
Kau man, San Mateo, CA, 1991.
[10] C.E. Leiserson and B.M. Maggs. Communication ecient parallel algorithms for distributed randomaccess machines. Algorithmica, 3:53{77, 1988.
[11] D. Peleg and E.Upfal. The token distribution problem. In Symposium
Science, pages 418{427, 1986.
[12] E. Seneta.

Non-negative Matrices and Markov Chains

[13] G. Strang.

Linear Algebra and its Applications

on Foundations of Computer

. Springer-Verlag, New York, NY, 1981.

. Harcourt Brace Jovanovich, San Diego, CA, 1988.

[14] M.H. Willebeek-Lemair and A.P. Reeves. Strategies for dynamic load balancing on highly parallel
computers. IEEE Transactions on Parallel and Distributed Systems, 4(9):979{993, 1993.

11