BioSystems 54 1999 15 – 29
Fitness distributions in evolutionary computation: motivation and examples in the continuous domain
Kumar Chellapilla
a
, David B. Fogel
b,
a
Department of Elect. Comp. Engg., UCSD, La Jolla, CA
92037
, USA
b
Natural Selection, Inc.,
3333
N. Torrey Pines Ct., Ste.
200
, La Jolla, CA
92037
, USA Received 25 January 1999; accepted 11 June 1999
Abstract
Evolutionary algorithms are, fundamentally, stochastic search procedures. Each next population is a probabilistic function of the current population. Various controls are available to adjust the probability mass function that is used
to sample the space of candidate solutions at each generation. For example, the step size of a single-parent variation operator can be adjusted with a corresponding effect on the probability of finding improved solutions and the
expected improvement that will be obtained. Examining these statistics as a function of the step size leads to a ‘fitness distribution’, a function that trades off the expected improvement at each iteration for the probability of that
improvement. This paper analyzes the effects of adjusting the step size of Gaussian and Cauchy mutations, as well as a mutation that is a convolution of these two distributions. The results indicate that fitness distributions can be
effective in identifying suitable parameter settings for these operators. Some comments on the utility of extending this protocol toward the general diagnosis of evolutionary algorithms is also offered. © 1999 Elsevier Science Ireland Ltd.
All rights reserved.
Keywords
:
Fitness distributions; Evolutionary computation; Continuous domain www.elsevier.comlocatebiosystems
1. Introduction
When used for function optimization, evolu- tionary computation relies on a population of
contending solutions to a problem at hand, where each individual is subject to random variation
mutation, recombination, etc. and placed in competition with other extant solutions. Random
variation provides a means for discovering nov- elty while selection serves to eliminate those trials
that do not appear worthwhile in the context of the given criterion. Thus evolutionary algorithms
can be seen as performing a search over a state space S of possible solutions.
In essence, most evolutionary algorithms can be described by the difference equation
x[t + 1] = s6x[t] 1
Corresponding author. Tel.: + 1-619-4556449; fax: + 1- 619-4551560.
E-mail addresses
:
kchellapece.ucsd.edu K. Chellapilla, dfogelnatural-selection.com D.B. Fogel
0303-264799 - see front matter © 1999 Elsevier Science Ireland Ltd. All rights reserved. PII: S 0 3 0 3 - 2 6 4 7 9 9 0 0 0 5 7 - X
where x[t] is the population at time t under representation x, 6 is the variation operators,
and s is the selection operator. The stochastic elements of this difference equation include 6, and
often s, and the initialization mechanism for choosing x[0]. The choices made in terms of rep-
resentation, variation, selection, and initialization shape the probability mass function that describes
the likelihood of choosing solutions from S at the next iteration. Alternative choices can lead to
dramatically different rates and probabilities of improvement at each iteration.
The question of how to design evolutionary search algorithms to improve their optimization
performance has been given significant consider- ation, but few practical answers have been iden-
tified. Indeed, many supposed answers have in fact been false leads, dogmatically repeated over
many years such that they have almost become ‘conventional wisdom’. Only recently have several
of these generally accepted central tenets of evolu- tionary algorithms been questioned and exposed
as being misleading or simply incorrect. Four of these erroneous tenets are detailed here.
1
.
1
. Binary representations and maximizing implicit parallelism
Holland 1975 pp. 70 – 71, speculated that bi- nary representations would provide an advantage
to an evolutionary algorithm. The rationale un- derlying this claim relied on the notion of maxi-
mizing implicit parallelism in schemata. To review the concept, a schema is a template from some
alphabet of symbols, A, and a wild card symbol c
that matches any symbol in A. For example, given a binary alphabet A, the schema [10 c c ] is
a template for the strings [1000], [1001], [1010], and [1011]. Holland 1975 pp. 64 – 74, offered
that any evaluated string encoding a possible solution to a task at hand actually offers partial
information about the expected fitness of all pos- sible schemata in which that string resides. That
is, if string [0000] is evaluated to have some fitness, then partial information is also received
about the worth of sampling from variations in [ c c c c ], [0 c c c ], [ c 0 c c ], [ c 00 c ],
[ c 0 c 0], and so forth. This characteristic is termed intrinsic parallelism or implicit paral-
lelism, in that through a single sample, informa- tion is gained with respect to many schemata.
If information were actually gained by this process of implicit sampling, it would seem rea-
sonable to expect that maximizing the number of schemata that are processed in parallel would be
beneficial Holland 1975. For any representation that is a bijective mapping from the state space of
possible solutions to the encoded individuals, im- plicit parallelism is maximized for the smallest
cardinality of alphabet. That is, given a choice between using a binary encoding or any other
cardinality, binary encodings should be preferred. The emphasis on binary representations, specifi-
cally within genetic algorithms GAs, was so strong that Antonisse 1989 wrote ‘‘the bit string
representation has been raised beyond a common feature to almost a necessary precondition for
serious work in the GA’’.
1
More careful inspection of this issue, however, indicates immediate problems with the claim that
there should be an advantage to binary encodings. Of primary concern is the choice of binary encod-
ing. Radcliffe 1992 noted that there are many binary representations of solutions in S. For ex-
ample, if S = {1, 2, 3, 4}, then there are 4 differ- ent binary representations of these solutions. One
would be {1, 2, 3, 4} {00, 01, 10, 11}, while another would be {1, 2, 3, 4} {11, 01, 00, 10}. It
is easy to show empirically that the performance of evolutionary algorithms which employ different
binary representations for the same problem do not exhibit similar performance see De Jong et
al., 1995, described below, and for some binary representations, performance is worse that a com-
pletely random search. This provides a counterex- ample to the claim of optimality of binary
encodings and moreover the ‘principle of mini- mum encoding’, proposed in Goldberg, 1989.
Fortunately, binary encodings are often clumsy, and many researchers abandoned the practice of
1
The fundamental impact of binary representations on ge- netic algorithms can be observed in the subtle 1s and 0s on the
covers of Schaffer 1989, Belew and Booker 1991, Forrest 1993 and Eshelman 1995. These were removed only as late
as Ba¨ck 1997.
using binary encodings in the late 1980s and early 1990s, finding more convenient representations
for their problems, and also better results An- tonisse, 1989; Koza, 1989; Davis, 1991, p. 63;
Michalewicz, 1992; Ba¨ck and Schwefel, 1993; Fogel and Stayton, 1994. More recently, Fogel
and Ghozeil 1997a proved that it is possible to create completely equivalent evolutionary al-
gorithms on any problem regardless of the cardi- nality of a bijective representation. Thus there is
provably no information gained or lost as a con- sequence of simply altering the cardinality of rep-
resentations cf. Holland, 1975.
1
.
2
. Crosso6er and building blocks In addition to advocating the use of binary
representations, Holland 1975 also strongly ad- vocated the use of one-point crossover as a mech-
anism to combine ‘building blocks’ from different solutions, and simultaneously minimized the im-
portance of random mutation. More recently, Holland 1992 offered that early efforts to simu-
late evolution ‘‘fared poorly because they… relied on mutation rather than mating’’. The supposed
importance of crossover and relative unimpor- tance of mutation has been echoed in Goldberg
1989, Davis 1991, Koza 1992, and many others. And yet, several problems with this ap-
proach have been discovered.
With regard to processing building blocks i.e. schemata that are associated with above-average
performance, suppose that one particular build- ing block was [1 c … c 0], that is, determining the
first and last positions in a string to be 1 and 0, respectively, defines the building block. When
one-point crossover is applied, it will always dis- rupt this building block. A natural solution to this
problem is to view individuals not as strings, but as rings, and use two-point crossover to cut and
splice segments of solutions. Once two-point crossover is invented, it is an easy step to move to
uniform crossover, where each component in an offspring is selected at random from either of two
parents.
If evolution proceeds best when it combines building blocks, it would be easy to speculate that
uniform crossover would not perform as well as one- or two-point crossover because rather than
preserve such blocks of code, it tends to disrupt them. Yet, the empirical evidence offered in
Syswerda 1989 showed uniform crossover out- performing both one- and two-point crossover on
several problems including the traveling salesman TSP and the onemax counting ones problems.
Fogel and Angeline 1998 observed similar re- sults comparing these operators in solving linear
systems of equations. Moreover, there have been many studies generating empirical evidence that
evolutionary algorithms which do not rely on crossover can outperform or perform comparably
with those that do Reed et al., 1967; Fogel and Atmar, 1990; Ba¨ck and Schwefel, 1993; Angeline,
1997; Chellapilla, 1997, 1998a; Fuchs, 1998; Luke and Spector, 1998. Jones 1995 even demon-
strated that crossing extant parents with com- pletely
random solutions
dubbed ‘‘headless
chicken crossover’’ could outperform structured recombination on several problems see Fogel and
Angeline 1998 for further supportive evidence. De Jong et al. 1995 showed unequivocally
that there is an important synergy between opera- tor and representation. They first considered the
problem of assigning the four values {1, 2, 3, 4} to binary strings {00, 01, 10, 11}, respectively,
under the fitness function:
fy = integery + 1 2
For a population of n = 5, De Jong et al. 1995 enumerated a Markov chain that completely de-
scribes the probabilistic behavior of an evolution- ary algorithm on this problem. Fig. 1 shows the
probability of having the evolving population contain the best possible solution as a function of
the number of generations both when using or not using crossover. The performance of random
search
is provided
for comparison.
Here, crossover is seen to be more effective than ran-
dom mutation alone. As noted above, however, there are 4 different binary representations that
could be used in this case. De Jong et al. 1995 found that these dissemble into three equivalence
classes. Figs. 2 and 3 indicate the relative perfor- mance on the other two classes. In these cases,
mutation alone outperforms crossover, and in one class, random search can outperform either ver-
sion of evolutionary algorithm for at least the first ten generations. In the first equivalence class,
applying crossover to the second- and third-best solutions can generate the global optimum. In the
other two classes, it cannot. Thus the chosen
Fig. 3. The exact probability of the population containing the best solution to the problem shown in Fig. 2 for the third
equivalence class of representations from De Jong et al., 1995. Here, no crossover outperforms the use of crossover by
a wider margin across all generations consistently. Note too that random search alone outperforms both crossover and the
absence of crossover for about the first ten generations. The results indicate the importance of matching the variation
operator with the representation.
Fig. 1. The exact probability of containing the best solution as a function of the number of generations when using a simple
genetic algorithm to optimize fy = integery + 1 under rep- resentation {00, 01, 10, 11} mapped to {1, 2, 3, 4} from De
Jong et al., 1995. Here crossover is seen to outperform the absence of crossover regardless of the number of generations.
However, this mapping is only one of 4 different possible mutations. Of these possibilities, three equivalent classes
emerge. The other two classes of behavior are shown in Figs. 2 and 3.
representation cannot be considered in isolation from the search operator.
This has been more broadly proved in the ‘no free lunch’ theorems of Wolpert and Macready
1997. For algorithms that do not resample points in S, across all possible problems, all al-
gorithms perform the same on average.
2
When an algorithm is tailored to a particular problem, it
will of necessity perform worse than random search on some other problem. Thus there cannot
be one best variation operator across all prob- lems. Crossover, in all its forms, can be seen in
this context simply as one possible choice that the human operator can make. Its effectiveness is
problem and representation dependent.
1
.
3
. The schema theorem
:
the fundamental theorem of genetic algorithms
Holland 1975 offered a theorem that describes the average propagation of schemata from one
generation to the next under the influence of
Fig. 2. The exact probability of the population containing the best solution to the problem shown in Fig. 1 for the second
equivalence class of representations from De Jong et al., 1995. Here, no crossover outperforms the use of crossover by
a small margin across all generations consistently.
2
Salomon 1996 studied the relevance of resampling in various evolutionary algorithms. In particular, genetic al-
gorithms tend to have a high probability of resampling points because they rely strongly on crossover. Other versions of
evolutionary computation that do not emphasize crossover do not have the same affliction.
proportional selection and variation operators such as one-point crossover and mutation. Omit-
ting the effects of variation operators, the formula is:
EPH, t + 1 = PH, t fH, t
f 3
where H is a particular schema hyperplane, fH,t is the mean fitness of solutions that contain
H at time t, f is the mean fitness of all solutions in
the population, and PH,t is the proportion of solutions that contain H at time t. Thus the
expected frequency of H in the next time step is proportional to its current frequency and its rela-
tive fitness. Extrapolating, Goldberg 1989, p. 33, concluded that above-average schemata receive
exponentially increasing trials in subsequent gen- erations and offered this result as being of such
importance that it is named the ‘Fundamental Theorem of Genetic Algorithms’.
Radcliffe 1992 noted that the theorem applies to all schemata in a population, even when the
schemata defined by a representation may not capture the properties that determine fitness. For
example, if the objective is to maximize the in- teger value of a binary string, then the strings
[1000] and [0111] are as close as possible in terms of fitness 8 vs. 7, and yet they share no sche-
mata. Thus the intuition that proportional selec- tion will tend to emphasize those schemata that
share important features related to fitness may not hold in practice.
Most work in genetic algorithms no longer uses proportional selection, so the ‘fundamental’ im-
portance of the schema theorem can immediately be questioned. Moreover, the conclusion that
above-average schemata will continue to receive exponentially increasing attention omits the con-
sideration that the theorem only describes the expected behavior in a single generation. There is
no reason to believe that the equation can be extrapolated over successive generations without
giving explicit consideration to the variance of the process as well as its expectation. But more sig-
nificantly, Fogel and Ghozeil 1997b proved that the schema theorem does not apply when the
fitness of schemata are described by random vari- ables, as is often the case in real-world applica-
tions. The theorem only applies to the specific population in question and the specific fitness
values assigned
to each
individual in
that population.
Even more importantly, the theorem cannot address the issue of how new solutions are discov-
ered; it can only indicate the statistical expecta- tion of reproducing already existing solutions in
proportion to their relative fitness. It cannot esti- mate long-term proportions of schemata with reli-
ability because this depends strongly on the likelihood of new solutions being generated by
variation.
1
.
4
. Proportional selection and the k-armed bandit
Holland 1975 made an analogy between the problem of how best to sample from competing
schemata within a population and how best to sample from a k-armed bandit i.e. a slot machine
with k arms. The payoff from each arm of the bandit has a mean and variance, and the analysis
centered on how best to sample the arms so as to minimize expected losses over those samples. The
conclusion was essentially to sample in proportion to the observed payoff from each arm, which led
to the use of proportional selection in genetic algorithms, and the resulting focus on the schema
theorem.
Unfortunately, insufficient attention was given to this analysis in two regards. The first is the
choice of criterion: Minimizing expected losses does not correspond with the typical problem of
function optimization that demands discovering the single best solution. In order to minimize
expected losses between two choices, the proper sampling is to devote all trials to the choice with
the greater average payoff. But this choice may prohibit discovering the best possible solution.
Consider the case where there are four possible solutions to a problem with corresponding fitness
values as shown:
[00] = 19, [01] = 0, [10] = 11, [11] = 9 The mean worth of sampling uniformly at ran-
dom in the schema [0 c ] is 9.5, whereas the mean worth of sampling in [1 c ] is 10. To minimize
expected losses, trials should be allocated to [1 c ], but this would then preclude discovering the best
solution [00]. The second respect is more fundamental: The
claim that the analysis in Holland 1975 leads to an optimal sampling plan has been shown to be
mathematically flawed both by counterexample Rudolph, 1997 and direct analysis Macready
and Wolpert 1998. Proportional selection does not minimize expected losses, so even if this crite-
rion is given preference, the development in Hol- land
1975 does
not support
the use
of proportional selection in evolutionary algorithms.
This form of selection is just one among many options, and the choice should be based on the
dependencies posed by the particular problem.
1
.
5
. A new direction Certainly, the above list could be extended e.g.
inversion was offered to reorder schemata for effective processing as building blocks by one-
point crossover Holland, 1975; pp. 106 – 109, but this has had no general empirical support Davis,
1991; Mitchell, 1996; Lobo et al., 1998. In light of these missteps, it would appear appropriate to
investigate new methods for assessing the funda- mental nature of evolutionary search and opti-
mization. The formulation in Eq. 1 leads directly to a Markov chain view of evolutionary al-
gorithms in which a time-invariant, memoryless probability transition matrix describes the likeli-
hood of transitioning to each possible population configuration given each possible configuration
Fogel, 1994; Rudolph, 1994 and others. Such a description immediately leads to answers regard-
ing questions about the asymptotic behavior of various algorithms e.g. typical instances of evolu-
tion strategies and evolutionary programming ex- hibit
asymptotic global
convergence Fogel,
1995a, whereas the canonical genetic algorithm Holland, 1975 is not convergent due to its re-
liance on proportional selection Rudolph, 1994. Further, as shown above, De Jong et al. 1995
used Markov chains and brute force computation to analyze the exact transient behavior of genetic
algorithms under small populations e.g. size five and small chromosomes e.g. two or three bits
concentrating on the expected waiting time until the global optimum is found for the first time. But
this procedure appears at present to be too com- putationally intensive to be useful in designing
more effective in terms of quality of evolved solution and efficient in terms of rate of conver-
gence evolutionary algorithms for real problems.
The description offered by Eq. 1, however, suggests that some level of understanding of the
behavior of an evolutionary algorithm can be garnered by examining the stochastic effects of the
operators s and 6 on a population x at time t. Of interest is the probabilistic description of the
fitness of the solutions contained in x[t + 1]. Re- cent efforts Altenberg, 1995; Fogel, 1995a;
Grefenstette, 1995; Fogel and Ghozeil, 1996 have been directed at generalized expressions describing
the relationship between offspring and parent fitness under particular variation operators, or
empirical determination of the fitness of offspring for a given random variation technique. This pa-
per offers evidence that this approach to describ- ing the behavior of an evolutionary algorithm can
be used to design more efficient and effective optimization techniques.
2. Background on methods to relate parent and offspring fitness