Introduction Directory UMM :Data Elmu:jurnal:B:Biosystems:Vol54.Issue1-2.1999:

BioSystems 54 1999 15 – 29 Fitness distributions in evolutionary computation: motivation and examples in the continuous domain Kumar Chellapilla a , David B. Fogel b, a Department of Elect. Comp. Engg., UCSD, La Jolla, CA 92037 , USA b Natural Selection, Inc., 3333 N. Torrey Pines Ct., Ste. 200 , La Jolla, CA 92037 , USA Received 25 January 1999; accepted 11 June 1999 Abstract Evolutionary algorithms are, fundamentally, stochastic search procedures. Each next population is a probabilistic function of the current population. Various controls are available to adjust the probability mass function that is used to sample the space of candidate solutions at each generation. For example, the step size of a single-parent variation operator can be adjusted with a corresponding effect on the probability of finding improved solutions and the expected improvement that will be obtained. Examining these statistics as a function of the step size leads to a ‘fitness distribution’, a function that trades off the expected improvement at each iteration for the probability of that improvement. This paper analyzes the effects of adjusting the step size of Gaussian and Cauchy mutations, as well as a mutation that is a convolution of these two distributions. The results indicate that fitness distributions can be effective in identifying suitable parameter settings for these operators. Some comments on the utility of extending this protocol toward the general diagnosis of evolutionary algorithms is also offered. © 1999 Elsevier Science Ireland Ltd. All rights reserved. Keywords : Fitness distributions; Evolutionary computation; Continuous domain www.elsevier.comlocatebiosystems

1. Introduction

When used for function optimization, evolu- tionary computation relies on a population of contending solutions to a problem at hand, where each individual is subject to random variation mutation, recombination, etc. and placed in competition with other extant solutions. Random variation provides a means for discovering nov- elty while selection serves to eliminate those trials that do not appear worthwhile in the context of the given criterion. Thus evolutionary algorithms can be seen as performing a search over a state space S of possible solutions. In essence, most evolutionary algorithms can be described by the difference equation x[t + 1] = s6x[t] 1 Corresponding author. Tel.: + 1-619-4556449; fax: + 1- 619-4551560. E-mail addresses : kchellapece.ucsd.edu K. Chellapilla, dfogelnatural-selection.com D.B. Fogel 0303-264799 - see front matter © 1999 Elsevier Science Ireland Ltd. All rights reserved. PII: S 0 3 0 3 - 2 6 4 7 9 9 0 0 0 5 7 - X where x[t] is the population at time t under representation x, 6 is the variation operators, and s is the selection operator. The stochastic elements of this difference equation include 6, and often s, and the initialization mechanism for choosing x[0]. The choices made in terms of rep- resentation, variation, selection, and initialization shape the probability mass function that describes the likelihood of choosing solutions from S at the next iteration. Alternative choices can lead to dramatically different rates and probabilities of improvement at each iteration. The question of how to design evolutionary search algorithms to improve their optimization performance has been given significant consider- ation, but few practical answers have been iden- tified. Indeed, many supposed answers have in fact been false leads, dogmatically repeated over many years such that they have almost become ‘conventional wisdom’. Only recently have several of these generally accepted central tenets of evolu- tionary algorithms been questioned and exposed as being misleading or simply incorrect. Four of these erroneous tenets are detailed here. 1 . 1 . Binary representations and maximizing implicit parallelism Holland 1975 pp. 70 – 71, speculated that bi- nary representations would provide an advantage to an evolutionary algorithm. The rationale un- derlying this claim relied on the notion of maxi- mizing implicit parallelism in schemata. To review the concept, a schema is a template from some alphabet of symbols, A, and a wild card symbol c that matches any symbol in A. For example, given a binary alphabet A, the schema [10 c c ] is a template for the strings [1000], [1001], [1010], and [1011]. Holland 1975 pp. 64 – 74, offered that any evaluated string encoding a possible solution to a task at hand actually offers partial information about the expected fitness of all pos- sible schemata in which that string resides. That is, if string [0000] is evaluated to have some fitness, then partial information is also received about the worth of sampling from variations in [ c c c c ], [0 c c c ], [ c 0 c c ], [ c 00 c ], [ c 0 c 0], and so forth. This characteristic is termed intrinsic parallelism or implicit paral- lelism, in that through a single sample, informa- tion is gained with respect to many schemata. If information were actually gained by this process of implicit sampling, it would seem rea- sonable to expect that maximizing the number of schemata that are processed in parallel would be beneficial Holland 1975. For any representation that is a bijective mapping from the state space of possible solutions to the encoded individuals, im- plicit parallelism is maximized for the smallest cardinality of alphabet. That is, given a choice between using a binary encoding or any other cardinality, binary encodings should be preferred. The emphasis on binary representations, specifi- cally within genetic algorithms GAs, was so strong that Antonisse 1989 wrote ‘‘the bit string representation has been raised beyond a common feature to almost a necessary precondition for serious work in the GA’’. 1 More careful inspection of this issue, however, indicates immediate problems with the claim that there should be an advantage to binary encodings. Of primary concern is the choice of binary encod- ing. Radcliffe 1992 noted that there are many binary representations of solutions in S. For ex- ample, if S = {1, 2, 3, 4}, then there are 4 differ- ent binary representations of these solutions. One would be {1, 2, 3, 4} “ {00, 01, 10, 11}, while another would be {1, 2, 3, 4} “ {11, 01, 00, 10}. It is easy to show empirically that the performance of evolutionary algorithms which employ different binary representations for the same problem do not exhibit similar performance see De Jong et al., 1995, described below, and for some binary representations, performance is worse that a com- pletely random search. This provides a counterex- ample to the claim of optimality of binary encodings and moreover the ‘principle of mini- mum encoding’, proposed in Goldberg, 1989. Fortunately, binary encodings are often clumsy, and many researchers abandoned the practice of 1 The fundamental impact of binary representations on ge- netic algorithms can be observed in the subtle 1s and 0s on the covers of Schaffer 1989, Belew and Booker 1991, Forrest 1993 and Eshelman 1995. These were removed only as late as Ba¨ck 1997. using binary encodings in the late 1980s and early 1990s, finding more convenient representations for their problems, and also better results An- tonisse, 1989; Koza, 1989; Davis, 1991, p. 63; Michalewicz, 1992; Ba¨ck and Schwefel, 1993; Fogel and Stayton, 1994. More recently, Fogel and Ghozeil 1997a proved that it is possible to create completely equivalent evolutionary al- gorithms on any problem regardless of the cardi- nality of a bijective representation. Thus there is provably no information gained or lost as a con- sequence of simply altering the cardinality of rep- resentations cf. Holland, 1975. 1 . 2 . Crosso6er and building blocks In addition to advocating the use of binary representations, Holland 1975 also strongly ad- vocated the use of one-point crossover as a mech- anism to combine ‘building blocks’ from different solutions, and simultaneously minimized the im- portance of random mutation. More recently, Holland 1992 offered that early efforts to simu- late evolution ‘‘fared poorly because they… relied on mutation rather than mating’’. The supposed importance of crossover and relative unimpor- tance of mutation has been echoed in Goldberg 1989, Davis 1991, Koza 1992, and many others. And yet, several problems with this ap- proach have been discovered. With regard to processing building blocks i.e. schemata that are associated with above-average performance, suppose that one particular build- ing block was [1 c … c 0], that is, determining the first and last positions in a string to be 1 and 0, respectively, defines the building block. When one-point crossover is applied, it will always dis- rupt this building block. A natural solution to this problem is to view individuals not as strings, but as rings, and use two-point crossover to cut and splice segments of solutions. Once two-point crossover is invented, it is an easy step to move to uniform crossover, where each component in an offspring is selected at random from either of two parents. If evolution proceeds best when it combines building blocks, it would be easy to speculate that uniform crossover would not perform as well as one- or two-point crossover because rather than preserve such blocks of code, it tends to disrupt them. Yet, the empirical evidence offered in Syswerda 1989 showed uniform crossover out- performing both one- and two-point crossover on several problems including the traveling salesman TSP and the onemax counting ones problems. Fogel and Angeline 1998 observed similar re- sults comparing these operators in solving linear systems of equations. Moreover, there have been many studies generating empirical evidence that evolutionary algorithms which do not rely on crossover can outperform or perform comparably with those that do Reed et al., 1967; Fogel and Atmar, 1990; Ba¨ck and Schwefel, 1993; Angeline, 1997; Chellapilla, 1997, 1998a; Fuchs, 1998; Luke and Spector, 1998. Jones 1995 even demon- strated that crossing extant parents with com- pletely random solutions dubbed ‘‘headless chicken crossover’’ could outperform structured recombination on several problems see Fogel and Angeline 1998 for further supportive evidence. De Jong et al. 1995 showed unequivocally that there is an important synergy between opera- tor and representation. They first considered the problem of assigning the four values {1, 2, 3, 4} to binary strings {00, 01, 10, 11}, respectively, under the fitness function: fy = integery + 1 2 For a population of n = 5, De Jong et al. 1995 enumerated a Markov chain that completely de- scribes the probabilistic behavior of an evolution- ary algorithm on this problem. Fig. 1 shows the probability of having the evolving population contain the best possible solution as a function of the number of generations both when using or not using crossover. The performance of random search is provided for comparison. Here, crossover is seen to be more effective than ran- dom mutation alone. As noted above, however, there are 4 different binary representations that could be used in this case. De Jong et al. 1995 found that these dissemble into three equivalence classes. Figs. 2 and 3 indicate the relative perfor- mance on the other two classes. In these cases, mutation alone outperforms crossover, and in one class, random search can outperform either ver- sion of evolutionary algorithm for at least the first ten generations. In the first equivalence class, applying crossover to the second- and third-best solutions can generate the global optimum. In the other two classes, it cannot. Thus the chosen Fig. 3. The exact probability of the population containing the best solution to the problem shown in Fig. 2 for the third equivalence class of representations from De Jong et al., 1995. Here, no crossover outperforms the use of crossover by a wider margin across all generations consistently. Note too that random search alone outperforms both crossover and the absence of crossover for about the first ten generations. The results indicate the importance of matching the variation operator with the representation. Fig. 1. The exact probability of containing the best solution as a function of the number of generations when using a simple genetic algorithm to optimize fy = integery + 1 under rep- resentation {00, 01, 10, 11} mapped to {1, 2, 3, 4} from De Jong et al., 1995. Here crossover is seen to outperform the absence of crossover regardless of the number of generations. However, this mapping is only one of 4 different possible mutations. Of these possibilities, three equivalent classes emerge. The other two classes of behavior are shown in Figs. 2 and 3. representation cannot be considered in isolation from the search operator. This has been more broadly proved in the ‘no free lunch’ theorems of Wolpert and Macready 1997. For algorithms that do not resample points in S, across all possible problems, all al- gorithms perform the same on average. 2 When an algorithm is tailored to a particular problem, it will of necessity perform worse than random search on some other problem. Thus there cannot be one best variation operator across all prob- lems. Crossover, in all its forms, can be seen in this context simply as one possible choice that the human operator can make. Its effectiveness is problem and representation dependent. 1 . 3 . The schema theorem : the fundamental theorem of genetic algorithms Holland 1975 offered a theorem that describes the average propagation of schemata from one generation to the next under the influence of Fig. 2. The exact probability of the population containing the best solution to the problem shown in Fig. 1 for the second equivalence class of representations from De Jong et al., 1995. Here, no crossover outperforms the use of crossover by a small margin across all generations consistently. 2 Salomon 1996 studied the relevance of resampling in various evolutionary algorithms. In particular, genetic al- gorithms tend to have a high probability of resampling points because they rely strongly on crossover. Other versions of evolutionary computation that do not emphasize crossover do not have the same affliction. proportional selection and variation operators such as one-point crossover and mutation. Omit- ting the effects of variation operators, the formula is: EPH, t + 1 = PH, t fH, t f 3 where H is a particular schema hyperplane, fH,t is the mean fitness of solutions that contain H at time t, f is the mean fitness of all solutions in the population, and PH,t is the proportion of solutions that contain H at time t. Thus the expected frequency of H in the next time step is proportional to its current frequency and its rela- tive fitness. Extrapolating, Goldberg 1989, p. 33, concluded that above-average schemata receive exponentially increasing trials in subsequent gen- erations and offered this result as being of such importance that it is named the ‘Fundamental Theorem of Genetic Algorithms’. Radcliffe 1992 noted that the theorem applies to all schemata in a population, even when the schemata defined by a representation may not capture the properties that determine fitness. For example, if the objective is to maximize the in- teger value of a binary string, then the strings [1000] and [0111] are as close as possible in terms of fitness 8 vs. 7, and yet they share no sche- mata. Thus the intuition that proportional selec- tion will tend to emphasize those schemata that share important features related to fitness may not hold in practice. Most work in genetic algorithms no longer uses proportional selection, so the ‘fundamental’ im- portance of the schema theorem can immediately be questioned. Moreover, the conclusion that above-average schemata will continue to receive exponentially increasing attention omits the con- sideration that the theorem only describes the expected behavior in a single generation. There is no reason to believe that the equation can be extrapolated over successive generations without giving explicit consideration to the variance of the process as well as its expectation. But more sig- nificantly, Fogel and Ghozeil 1997b proved that the schema theorem does not apply when the fitness of schemata are described by random vari- ables, as is often the case in real-world applica- tions. The theorem only applies to the specific population in question and the specific fitness values assigned to each individual in that population. Even more importantly, the theorem cannot address the issue of how new solutions are discov- ered; it can only indicate the statistical expecta- tion of reproducing already existing solutions in proportion to their relative fitness. It cannot esti- mate long-term proportions of schemata with reli- ability because this depends strongly on the likelihood of new solutions being generated by variation. 1 . 4 . Proportional selection and the k-armed bandit Holland 1975 made an analogy between the problem of how best to sample from competing schemata within a population and how best to sample from a k-armed bandit i.e. a slot machine with k arms. The payoff from each arm of the bandit has a mean and variance, and the analysis centered on how best to sample the arms so as to minimize expected losses over those samples. The conclusion was essentially to sample in proportion to the observed payoff from each arm, which led to the use of proportional selection in genetic algorithms, and the resulting focus on the schema theorem. Unfortunately, insufficient attention was given to this analysis in two regards. The first is the choice of criterion: Minimizing expected losses does not correspond with the typical problem of function optimization that demands discovering the single best solution. In order to minimize expected losses between two choices, the proper sampling is to devote all trials to the choice with the greater average payoff. But this choice may prohibit discovering the best possible solution. Consider the case where there are four possible solutions to a problem with corresponding fitness values as shown: [00] = 19, [01] = 0, [10] = 11, [11] = 9 The mean worth of sampling uniformly at ran- dom in the schema [0 c ] is 9.5, whereas the mean worth of sampling in [1 c ] is 10. To minimize expected losses, trials should be allocated to [1 c ], but this would then preclude discovering the best solution [00]. The second respect is more fundamental: The claim that the analysis in Holland 1975 leads to an optimal sampling plan has been shown to be mathematically flawed both by counterexample Rudolph, 1997 and direct analysis Macready and Wolpert 1998. Proportional selection does not minimize expected losses, so even if this crite- rion is given preference, the development in Hol- land 1975 does not support the use of proportional selection in evolutionary algorithms. This form of selection is just one among many options, and the choice should be based on the dependencies posed by the particular problem. 1 . 5 . A new direction Certainly, the above list could be extended e.g. inversion was offered to reorder schemata for effective processing as building blocks by one- point crossover Holland, 1975; pp. 106 – 109, but this has had no general empirical support Davis, 1991; Mitchell, 1996; Lobo et al., 1998. In light of these missteps, it would appear appropriate to investigate new methods for assessing the funda- mental nature of evolutionary search and opti- mization. The formulation in Eq. 1 leads directly to a Markov chain view of evolutionary al- gorithms in which a time-invariant, memoryless probability transition matrix describes the likeli- hood of transitioning to each possible population configuration given each possible configuration Fogel, 1994; Rudolph, 1994 and others. Such a description immediately leads to answers regard- ing questions about the asymptotic behavior of various algorithms e.g. typical instances of evolu- tion strategies and evolutionary programming ex- hibit asymptotic global convergence Fogel, 1995a, whereas the canonical genetic algorithm Holland, 1975 is not convergent due to its re- liance on proportional selection Rudolph, 1994. Further, as shown above, De Jong et al. 1995 used Markov chains and brute force computation to analyze the exact transient behavior of genetic algorithms under small populations e.g. size five and small chromosomes e.g. two or three bits concentrating on the expected waiting time until the global optimum is found for the first time. But this procedure appears at present to be too com- putationally intensive to be useful in designing more effective in terms of quality of evolved solution and efficient in terms of rate of conver- gence evolutionary algorithms for real problems. The description offered by Eq. 1, however, suggests that some level of understanding of the behavior of an evolutionary algorithm can be garnered by examining the stochastic effects of the operators s and 6 on a population x at time t. Of interest is the probabilistic description of the fitness of the solutions contained in x[t + 1]. Re- cent efforts Altenberg, 1995; Fogel, 1995a; Grefenstette, 1995; Fogel and Ghozeil, 1996 have been directed at generalized expressions describing the relationship between offspring and parent fitness under particular variation operators, or empirical determination of the fitness of offspring for a given random variation technique. This pa- per offers evidence that this approach to describ- ing the behavior of an evolutionary algorithm can be used to design more efficient and effective optimization techniques.

2. Background on methods to relate parent and offspring fitness