distinguish them. However, at the following stage, the threshold energy for minimal differentiation
decreases because of better enzymes. Therefore, the previous change is not the smallest any more.
The H-bonding of the second base H
2
becomes the new differentiation threshold, and so on. The
key idea is that the inequalities Eq. 1 determine a definite order in which new features are used, to
form new codon classes, which codify for the recently introduced amino acids. The new classes
are formed by splitting of existing classes, with the split based on the order given by the basic as-
sumption. In this way, a huge number of develop- mental pathways are eliminated because of the
order in which a small number of parameters are set.
A split must be triggered by a detectable physico – chemical difference between at least one
of the members of an existing codon class and the rest of the members. This event could be, e.g. the
change of the middle base from a purine to a pyrimidine. In order to guarantee incremental
evolution, the next available unused distinctive feature must be used as a point of refinement, i.e.
the splitting occurs at the leading edge of the directed graph Fig. 1 by successive refinement of
existing classes. This is a powerful constraint for the possible evolution of a coding system. Sup-
pose that this constraint did not exist, then a non-gradual evolution of the code would imply
that codons differing in two or more features would be assigned to different amino acids. It
would then be possible to form a new class parti- tion based on distinguishing both features without
first using one of them to form a codon class. Therefore, there would be no class that would
correctly accommodate a codon that has one value of the first feature and the contrary value of
the second one. One way to remedy this problem would be to allow the system to go back and
rebuild the classes that have already been formed, but this would violate the developmental order,
with the corresponding perturbations at the protein level. Therefore, as a result of following a
minimum change-coding path, at each stage the narrowest code is formed, consistent with the
capability of the system to distinguish among different codon classes. If two codons cannot be
distinguished they must codify for the same amino acid, remaining synonymous
3
.
5. Results
5
.
1
. De6elopmental pathway for the genetic code According to the model, it is necessary to con-
sider three different aspects to specify a possible developmental pathway for the genetic code:
1. Physico-chemical constraints. 2. Initial conditions, that require that simple
amino acids enter the code before more com- plex ones.
3. An optimization principle to insure ‘small’ changes in protein structure when new amino
acids are introduced. Taking into account only the third requirement
leads to ‘impossible’ pathways, in the sense that they may violate physico-chemical facts or the
assumed initial conditions. Goldman 1993 gives examples showing that the natural code is far
from optimal and easily improved, with respect to the optimization criterion proposed in Haig and
Hurst, 1991. He remarked that ‘‘the assignment of amino acids to synonymous codon sets, and
the very existence of the observed synonymous codon sets, are being constrained by some as-yet-
unmodeled factors’’. Further, he noticed that these factors might have significant bearing on the
comment Haig and Hurst 1991 made that ‘‘the translation apparatus would be expected to evolve
3
It is interesting to notice that the formal procedure de- scribed above was developed in the fields of artificial learning
and linguistics, to describe incremental acquisition. In that context Berwick 1986 has pointed out that such a procedure
would seem unwise from the stand point of error detection or error correction. The reason is the following: in order to be
able to correct errors of k bits, the code-words have to be separated by a ball of radius 2k + 1. Since to each codon
corresponds a binary code-word of length six two bits are necessary to specify each base in the hypercube, in order to
correct one error the maximum number of code-words would be nine, as can easily be shown by standard arguments of
coding theory. This implies that a maximum of eight amino acids and one termination signal could be included in the code.
Therefore, the extant genetic code is not an error-correcting code in the technical sense of this term.
an inverse relationship between the frequency and severity of an error’’. These ‘as-yet-unmodeled
factors’ are the physical constraints that we have taken into account in our model. The same concern
has been expressed by Amirnovin 1997, with respect to the theories that take into account only
biosynthetic relations between amino acids. He points out that, ‘‘the code has two main character-
istics: first, it is degenerate with respect to the amino acids i.e., there is more than one codon for all but
two amino acids but at the same time, a difference in degeneracy exists such that the number of codons
per amino acid varies from one to six. Second, some of the biosynthetically related amino acids have
closely related codons. The first characteristic of the code has been left mostly unexplained while the
second has been discussed by Wong 1981’’. He compared the codon correlations between codons
of biosynthetically related amino acids in the uni- versal genetic code with those in randomly gener-
ated codes. As a result, he found several randomly generated codes that have many more correlations
than that found in the universal code.
These unexpected results, of finding ‘superior’ codes, in agreement with requirements i and iii
above, do not imply that these codes developed in agreement with requirement ii. As explicitly stated
by Haig and Hurst 1991, their method of gener- ating variant codes is not meant to mimic the
evolutionary process, but to test their null hypoth- esis. They remark that, ‘‘Our null model places
strong constraints on the structure of the variant codes. All codes have the same level of degeneracy
and the same probability of synonymous substitu- tions as the natural code. Therefore, our results
detect error-minimizing features of the code that are additional to third and second base redun-
dancy’’. Freeland and Hurst 1998 extended the mentioned work, lifting the second constraint by
weighting
transition errors
differently from
transversion errors and weighting each base differ- ently. However, in this later work no requirement
on initial conditions is also imposed. Next, I will discuss the results that follow from
the developmental pathway of the genetic code obtained from the model. Then, I compare this
pathway with the developmental pathways of two randomly generated codes. The first one was ob-
tained by maximizing the number of correlations between codons of biosynthetically related amino
acids Amirnovin, 1997, and the second one, reported by Goldman 1993, by optimizing polar
requirement Woese et al., 1966.
The developmental pathway of the genetic code, derived from the model, is displayed in Fig. 1. The
first ramification Fig. 1a separates the codons of the Pyrimidine Branch NYN from those of the
Purine Branch NRN. From the first column of Fig. 1b it is clear that in the universal code the
codons of the class NYN codify exclusively for nonpolar hydrophobic amino acids. This was first
noticed by Klump 1993, in connection with his hypercube representation of the Genetic code.
While the codons of the NRN class Fig. 1c codify polar hydrophilic amino acids, with the exception
of tyr Y and trp W. These two amino acids are among the five more complex amino acids, actually
the most complex, according to Papentin’s struc- tural complexity measure Table 1. Therefore, they
are assumed to enter the code only in the last stage. Although, the first observations are well known,
they acquire a new meaning in the present context: the differentiation of a single feature of the middle
base of a codon implies the polarnonpolar distinc- tion of amino acids. This is an important property
to consider for understanding the evolution of the genetic code.
Recent results suggest the independence of protein folding kinetics of the details of the se-
quence Riddle et al., 1997. This finding confirms theoretical views, which hold that it is possible to
construct globular proteins by specifying the se- quence only with respect to a hydrophobic
hydrophilic alphabet Chan and Dill, 1998. Fur- thermore, Hetch and co-workers Kamtekar et al.,
1993 have shown that four-helix bundles can be designed by using the same binary alphabet, with
the appropriate sequence periodicity of polar and nonpolar residues. This periodicity is the major
determinant of secondary structure in self-assem- bling oligomeric peptides Xiong et al., 1995.
Surrogate codes can be constructed assuming different null hypotheses see Amirnovin 1997
for three methods to generate surrogates. If the ‘block-structure’ is not assumed, there are more
than 10
65
possible codes Goldman, 1993. How- ever, assuming the block-structure reduces this
huge number to ‘only’ 21 = 5 × 10
19
possible codes. The most conservative code found by
Goldman 1993, without assuming the distribu- tion of degeneracy present in the universal code,
clearly does not have the polarnonpolar distinc- tion property See Fig. 2 of Goldman’s paper.
But even the more restricted, randomly generated codes, which do have the block-structure see the
two last columns of Fig. 1, do not share the polarnonpolar distinction property.
The second stage in the development of the code is characterized by four codon classes, con-
sisting of 16 codons each. According to both complexity measures Table 1, Ala and Gly are
the simplest amino acids, followed by Asp, Pro or Val, Leu. All their codons are of the SNN class.
G, A, V and D belong to the GNN subclass and P, L to the CNN subclass. These codon classes
are located in contiguous planes in the hypercube See Figs. 2 – 4 in Jime´nez-Montan˜o et al., 1996. I
assume that the first plane GNN is the primor- dial plane because it contains A and G. This
implies that the first amino acids incorporated to the genetic code were, G, A, V and D. This result
is in agreement proposals made by others, follow- ing very different criteria Eigen and Schuster,
1979; Taylor and Coates, 1989 and references therein. However, it has only a partial overlap
with the Klump 1993 result, that assumes that the primordial amino acids were G, A, P and R
see also Hartman, 1995. The last two belong to the CNN plane.
The hypercube representation of the genetic code proposed by Klump differs from the one
assumed in our model, because he employed only the RY categorization of the bases to define the
codon classes. To explore the relations between codons he introduces a scale to rank the codons
according to their Gibbs energy of codon – anti- codon interaction in kcal per mol triplet at 25°C.
Then he defines a ‘mutational pathway’ along the shallowest slope up the codon – anticodon interac-
tion Gibbs energy levels from the least stable codon – anticodon pair to the most stable one. See
Fig. 1 in Klump 1993. He says, ‘‘as it turns out, this pass will always follow the same route. It
starts with exchanging the nucleotide in the third position as the first step, which, with only two
exceptions StopTrp and IleMet, preserves the amino acid encoded, it continues with exchanging
the nucleotide in the first position 5 as the second step, and only at the last step is the
nucleotide in the middle position exchanged’’. Therefore his developmental pathway satisfies the
inequalities:
C
2
\ C
1
\ C
3
2 These are consistent with the inequalities we
assumed in our model inequalities Eq. 1. How- ever, the inequalities involving the SW catego-
rization are not considered in Klump’s model. The assumed primordial amino acids have codons
at the end of the mutational pathways, hence they have the highest codon – anticodon Gibbs energy
and, for this reason, the more stable pairing. This point will be further discussed later on.
The chemical type of the first base, C
1
, dictates the next developmental stage
.
The system has now the capacity to make an RY differentiation of the
first base. As can be seen from Fig. 1, all the amino acids introduced in the previous stage have
codons in the class RNN. Therefore, the new amino acids should have codons of the class
YNN. Following our ‘simple-first’ rule implies the introduction of one element of each of the follow-
ing groups, P, S L STOP, R and STOP. Therefore, we assume that besides A, G, V, D and
the STOP signal, R, L, and P or S entered the amino acid alphabet. Of course, arginine R does
not fit the rule See Table 1. However, it is well known that arginine is an unusual amino acid
Taylor and Coates, 1989. It has been called an ‘‘early intruder’’ by Jukes 1973, who proposed
that it replaced the simpler amino acid ornithine. Taylor and Coates 1989 made a similar pro-
posal, suggesting that the probable explanation of this anomaly is that the codons AGN belonged to
an amino acid no longer a member of the coded set, such as norvaline, norleucine, a-aminobutyric
acid or ornithine. An even simpler candidate, b amino alanine, as well as guanidioacetic acid have
been proposed by Hartman 1995, in the place of arginine.
Between P and S it is not possible to decide which one entered earlier, because the two mea-
sures of complexity give contradictory rankings
Table 1. Essentially because each one measures different molecular aspects. However, taking into
account Klump’s results above, it is proposed that the amino acids assigned at this stage were: A, G,
V, D, P, L and ‘R’, where ‘R’ stands for the simpler basic amino acid which, it is assumed, was
later replaced by arginine. The corresponding primeval code is displayed, in the conventional
way, in Fig. 2. It leads to the important prediction that functional proteins can be coded with this
limited amino acid alphabet.
In a recent paper, Riddle et al. 1997 asked the following question, ‘‘what is the minimum num-
ber of amino acids that would have been needed to encode complex protein folds similar to those
found in nature today? They found experimen- tally that is possible to build most of a src SH3
domain with a five-letter amino acid alphabet but not with a three-letter alphabet. SH3 is a 57-
residue domain that has a complex b-barrel-like structure wherein residues spread throughout the
sequence come together to create the binding site for a proline-rich peptide. They showed that the
SH3 domain can be largely encoded by the fol- lowing amino acid alphabet, A, G, I, E, K,
which may be compared with our proposed primeval alphabet, A, G, V, L, D, ‘R’, P. The
differences between the two alphabets correspond to closely related amino acids. They can be ac-
counted by the empirically conserved substitution groups found by Wu and Brutlag 1996. The
most prominent one of which is I, V, found in 10 192
positions in
the BLOCKS
database Henikoff and Henikoff, 1991, D, E was found
in 5980, K, R in 6453 and I, L, V in 5328 positions, in the same database. They also occur
in the theoretical groupings based on amino acid physical properties or structural comparisons pro-
posed by many authors e.g. Jime´nez-Montan˜o, 1984; Taylor, 1986; Bordo and Argos, 1991. As a
matter of fact, these substitutions occur in the phylogenetic variation of src SH3 domains Table
1 in Riddle et al., 1997.
Similar results are obtained from the randomly generated codes Fig. 1, after imposing the initial
condition ii. For example, at the same level of resolution,
from the
first random
code Amirnovin, 1997, a corresponding primeval
code could be G, Q, S, C, P, L, D. However, a very different outcome is obtained with the prime-
val code proposed by Jime´nez-Sa´nchez 1995, that assumes radically different initial conditions.
According to table 2 from his paper, the corre- sponding amino acid alphabet is K, N, Y, M, I,
L, F. It is difficult to see how functional proteins could be constructed mainly with bulky, hydro-
phobic, amino acids without incorporating G or A; and D or E Riddle et al., 1997. Besides, it
includes some of the most complex amino acids Table 1.
In the next step, when the H-bond character of the first base H
1
enters the game, the codons become specified by the two first bases. Thus, all
the codons are of the form B
1
B
2
N. A primeval code of this type, with the third base, N, having
no specificity, was proposed long ago by Jukes 1965. He remarked that this code would have a
maximum of 15 amino acids and four stop codons. In the following, for the amino acids with
six codons S, L, R, I adopt the convention of Taylor and Coates 1989 to call the additional
pair of codons ‘extras’. The eight amino acids which are four-fold degenerate in the universal
code are G, A, P, S, V, T, L, R, counting the ‘extras’ separately. They are ranked among the
first ten according to, at least, one complexity measure Table 1, with the mentioned exception
of arginine. It is not possible to know for sure
Fig. 2. Proposed primeval code, displayed as a conventional codon catalogue.
which of the two amino acids that share codons with common doublets entered at this stage. How-
ever, from the ranking in Table 1, one possibility would be the following group, I, C, Q, K, S, D.
Thus, there are, altogether, 13 amino acids, {G 4, A 4, P 4, S 8, V 4, T 4, L 8, R 4,
I 4, C 4, Q 4 K 4, D 4} and STOP 4, where the numbers in parenthesis indicate the
number of codons. In other variants there could be N 4 instead of K 4, or R 8 S 4 instead of
R 4 S 8, etc.
These speculations are of little value, because N, K as other amino acids that share codon
doublets, are degenerate with respect to the Gibbs free
energy of
codon – anticodon interaction
Klump and Maeder, 1991. On the contrary, the four-fold degenerate amino acids have doublets
with the following composition, G + G = 34, A + U = 14. Therefore, they form the more sta-
ble codon – anticodon pairs. Long ago, Goldberg and Wittes 1966 pointed out that triplets with a
high GC content have greater pairing specificity and greater resistance to mutation. In their own
words ‘‘the correlations described suggest that protective mechanisms may act at two levels, i
the nucleotide level, where a high GC content may reduce the rate of mutation errors in protein
synthesis, and ii the organizational level where the effects of a base change are minimized by
degeneracy and by connectedness of codons for similar amino acids. The best-protected amino
acids would be those with maximum GC content and degeneracy’’. Six of the seven amino acids in
the proposed primeval code are among the best- protected amino acids.
In the following stage, the distinction of the chemical type C
3
leads to the RY degeneracy of the third base. At this point there are already the
20 amino acids, {G 4, A 4, P 4, S 6, V 4, T 4, L 6, F 2, R 6, I 2, M 2, C 2, W 2,
Q 2, H 2, K 2, N 2, D 2, E 2, Y 2, STOP 2}. Where the ‘extra’ codons of serine S,
leucine L and arginine R are assigned accord- ing to the universal code. However, it is only in
the final period when a distinction in the H-bond character of the third base H
3
becomes possible. This last step breaks the RY symmetry of Trp
W, Ile I and Met M codons, leading to the present assignments, W 1, I 3, M 1; and
STOP 3.
5
.
2
. De6iations from the uni6ersal code Schultz and Yarus 1996 have underlined that
the codon reassignments found in various or- ganelles, several species of ciliates and other or-
ganisms, are ‘very nonrandom’. According to Table 1 of their paper, in 14 instances involving
six codons to be reassigned, reassignment appar- ently proceed by single nucleotide changes. That
is, by one or two bit changes. As can be seen from Fig. 1, they are connected with features in the
right hand side of the set of inequalities Eq. 1. Therefore, they are associated with low values of
the codon – anticodon interaction energy, and the later stages in the evolution of the code. This is in
agreement with the view that these are relatively ‘‘modern changes’’ Andersson and Kurland,
1995; Schultz and Yarus, 1996. Some reassign- ments restore the local symmetry of a codon, e.g.
the assignment of the UGA STOP codon to tryptophan W, restores the symmetry of the
codon class UGR, which is broken in the univer- sal code. According to the results obtained by
Inagaki et al. 1998, the ancestral mitochondrion was bearing the universal genetic code and subse-
quently reassigned the codon UGA to Trp, inde- pendently, in various lineages. From the point of
view developed in this contribution, this means a return to a less differentiated code. This interpre-
tation agrees with a reductive mode of genomic evolution Andersson and Kurland, 1995. On the
contrary, the reassignment of AAA K to N breaks the symmetry of the codon class AAR,
producing a 1-3 degeneracy. This case is com- pletely equivalent to the case of AUR, which in
the universal code is split in AUA I and AUG M. The reassignment of AUA in mitochondria
brings back the symmetry. These variations clearly show that changes in the ‘least significant
feature’ of a codon do occur, without a major disruption of the cell functioning.
5
.
3
. Relationship between codon classes and aminoacyl tRNA synthetases
Although no
microscopic questions
were
Fig. 3. ‘Alignment’ of amino recognized by class I II aminoa- cyl tRNA synthetase, with two amino acid groupings: 1 by
codon type, with respect to KM categorization; 2 by end atom type Davydov, 1998. S’ and R’ stand for the ‘extra’
serine and arginine codons, respectively.
early stages of the development of the code. This assumption is supported by two independent
observations: 1. The result shown in Fig. 3, of a remarkable
correlation between the amino acids coded by NMN codons and the amino acids recognized
by Class II aaRSs. Correspondingly, between those coded by NKN codons and the amino
acids recognized by the synthetases of Class I. Already Wentzel 1995 pointed out a correla-
tion between the middle base of a codon and aaRS class, but he did not notice the relation
to the MK categorization of the bases.
2. The implications of the tRNA ‘‘gene recruit- ment’’ model Saks et al., 1997
4
for the evolu- tion of the code. According to this model, a
single anticodon mutation may affect both mappings realized by the genetic code.
An interesting example of the connection be- tween the two mappings realized by tRNA adap-
tor molecules, when there exists an anticodon identifier, is E. coli tRNA
Ile
. This tRNA, specific for the codon AUA, has the modified anticodon
LAU, where L is lysidine. This is a modification of cytosine C, whose 2-keto group is replaced by
amino acid lysine K. The L in this context pairs with A rather than G, a unique case of base
modification altering base pairing specificity. The replacement of this L with unmodified C yields a
tRNA that recognizes the codon AUG M. As noticed by Voet and Voet 1995, ‘‘Surprisingly,
treated in our model, I will recall some well know facts about transfer-RNA molecules in order to
have a proper perspective of its relation with the underlying microscopic picture. In extant organ-
isms, the physical carriers of the genetic code are, of course, tRNAs adaptor molecules. They may
be considered of as comprised of two informa- tional domains, the acceptor- TCC minihelix en-
coding the operational mapping for amino acids Schimmel et al., 1993 and the anticodon-con-
taining domain with the three nucleotides of the codon – anticodon mapping. A model to relate the
evolution of these two parts of the genetic code, assuming a tRNA ancestor, has been proposed by
Rodin et al. 1996. The connection between amino acids and specific triplets is accomplished
through the aminoacylation reactions catalyzed by aminoacyl tRNA synthetases see, e.g. Voet
and Voet, 1995. These enzymes select both an amino acid and a tRNA. As is well known, these
enzymes come in two groups, called Class I and Class II aaRSs Eriani et al., 1990. To each class
correspond exactly ten amino acids.
The structure of the hypercube, which depends on the RY and WS categorizations of the bases,
was derived from the codon – anticodon mapping. Therefore, I assume that the only unused feature
of the bases, the MK categorization, was used to distinguish the two classes of aminoacyl syn-
thetases. Since the most important base of a codon anticodon is the second base, it was natu-
ral to assume that this base was also responsible for the differentiation of the synthetases in the
4
Notwithstanding the existence of an operational mapping devoid of the anticodon nucleotides Schimmel et al., 1993, in
a recent paper Saks et al. 1997 wrote that: ‘‘as a consequence of concurrently changing tRNA identity and mRNA coupling
capacity, a single anticodon mutation could potentially result in a tRNA that would be competent to correctly translate a
new set of codons in all the essential endogenous mRNA. Thus, an anticodon mutation might recruit a tRNA gene from
one isoaccepting group to another’’. Isoacceptors are different tRNAs that accept the same amino acid. Furthermore, they
notice that ‘‘an amino acid identity corresponding to the anticodon, rather than aminoacylation efficiency, is likely to
be the key prerequisite for the appearance of a tRNA variant in a population. Once such a tRNA appeared, its efficiency of
aminoacylation could be improved by the combined forces of mutation and natural selection’’. They obtained a confirmation
of the above ‘recruitment hypothesis’ in vivo, using E. coli as a model system.
however, this altered tRNA
Ile
is also a much better substrate for MetRS than is for IleRS.
Thus, both the codon and the amino acid spe- cificity of this tRNA are changed by a single
posttranscriptional modification.’’
In this paper I have emphasized the codon – anticodon mapping, i.e. the assignment of codon
classes to
amino acids.
Recently, Davydov
1998 has proposed a set of rules to associate the amino acid end atoms ON and non-Onon-
N of 18 amino acids with codons containing weak bases AU. These rules correctly predict
all the codons in which the third base is non-re- dundant, that is, all the codons with doublets
B
1
B
2
in the set M
2
= AU, UU, UA, AA, GA,
CA, UG, AG, plus AC, GU and CU. The amino acids that, according Davydov’s rules, al-
low the correct association of codons are, I, M, L, F, Y, K, N, D, E, Q, H, C, W, T, V, R’, S’.
R’, S’ correspond to the ‘extra’ codons of R and S, respectively. Davydov’s rules give incorrect re-
sults for the codons of R, S and A. Glycine and proline were not considered in the analysis per-
formed by Davydov, for reasons explained in his article.
We have found an inverse of Davydov’s rules: 1. Codons of the form NAN + WCN have
amino acids with ON end atoms. 2. Codons of the form NUN + WGN have
amino acids with non-Onon-N end atoms. The exceptions being R and S.
3. Codons of the form SCN + SGN code for P, A, R, G. These amino acids either were
left out by Davydov or did not obey his rules.
Thus, it is seen that the amino acids beyond Davydov’s rules do not form a random set. All
of them have codons of the class SSN, which are the most stable codons Klump, 1993. With
the exception of arginine once more, the other three have no proper side-chain see Davydov’s
paper for further details. In Fig. 3 the amino acid recognized by the two classes of aminoacyl-
tRNA synthetases are ‘aligned’ with two amino acid groupings defined by, i end atom type and
ii codon type. It is clear from this figure, that the three ways of grouping amino acids are
strongly correlated.
6. Discussion