Results Directory UMM :Data Elmu:jurnal:B:Biosystems:Vol54.Issue1-2.1999:

distinguish them. However, at the following stage, the threshold energy for minimal differentiation decreases because of better enzymes. Therefore, the previous change is not the smallest any more. The H-bonding of the second base H 2 becomes the new differentiation threshold, and so on. The key idea is that the inequalities Eq. 1 determine a definite order in which new features are used, to form new codon classes, which codify for the recently introduced amino acids. The new classes are formed by splitting of existing classes, with the split based on the order given by the basic as- sumption. In this way, a huge number of develop- mental pathways are eliminated because of the order in which a small number of parameters are set. A split must be triggered by a detectable physico – chemical difference between at least one of the members of an existing codon class and the rest of the members. This event could be, e.g. the change of the middle base from a purine to a pyrimidine. In order to guarantee incremental evolution, the next available unused distinctive feature must be used as a point of refinement, i.e. the splitting occurs at the leading edge of the directed graph Fig. 1 by successive refinement of existing classes. This is a powerful constraint for the possible evolution of a coding system. Sup- pose that this constraint did not exist, then a non-gradual evolution of the code would imply that codons differing in two or more features would be assigned to different amino acids. It would then be possible to form a new class parti- tion based on distinguishing both features without first using one of them to form a codon class. Therefore, there would be no class that would correctly accommodate a codon that has one value of the first feature and the contrary value of the second one. One way to remedy this problem would be to allow the system to go back and rebuild the classes that have already been formed, but this would violate the developmental order, with the corresponding perturbations at the protein level. Therefore, as a result of following a minimum change-coding path, at each stage the narrowest code is formed, consistent with the capability of the system to distinguish among different codon classes. If two codons cannot be distinguished they must codify for the same amino acid, remaining synonymous 3 .

5. Results

5 . 1 . De6elopmental pathway for the genetic code According to the model, it is necessary to con- sider three different aspects to specify a possible developmental pathway for the genetic code: 1. Physico-chemical constraints. 2. Initial conditions, that require that simple amino acids enter the code before more com- plex ones. 3. An optimization principle to insure ‘small’ changes in protein structure when new amino acids are introduced. Taking into account only the third requirement leads to ‘impossible’ pathways, in the sense that they may violate physico-chemical facts or the assumed initial conditions. Goldman 1993 gives examples showing that the natural code is far from optimal and easily improved, with respect to the optimization criterion proposed in Haig and Hurst, 1991. He remarked that ‘‘the assignment of amino acids to synonymous codon sets, and the very existence of the observed synonymous codon sets, are being constrained by some as-yet- unmodeled factors’’. Further, he noticed that these factors might have significant bearing on the comment Haig and Hurst 1991 made that ‘‘the translation apparatus would be expected to evolve 3 It is interesting to notice that the formal procedure de- scribed above was developed in the fields of artificial learning and linguistics, to describe incremental acquisition. In that context Berwick 1986 has pointed out that such a procedure would seem unwise from the stand point of error detection or error correction. The reason is the following: in order to be able to correct errors of k bits, the code-words have to be separated by a ball of radius 2k + 1. Since to each codon corresponds a binary code-word of length six two bits are necessary to specify each base in the hypercube, in order to correct one error the maximum number of code-words would be nine, as can easily be shown by standard arguments of coding theory. This implies that a maximum of eight amino acids and one termination signal could be included in the code. Therefore, the extant genetic code is not an error-correcting code in the technical sense of this term. an inverse relationship between the frequency and severity of an error’’. These ‘as-yet-unmodeled factors’ are the physical constraints that we have taken into account in our model. The same concern has been expressed by Amirnovin 1997, with respect to the theories that take into account only biosynthetic relations between amino acids. He points out that, ‘‘the code has two main character- istics: first, it is degenerate with respect to the amino acids i.e., there is more than one codon for all but two amino acids but at the same time, a difference in degeneracy exists such that the number of codons per amino acid varies from one to six. Second, some of the biosynthetically related amino acids have closely related codons. The first characteristic of the code has been left mostly unexplained while the second has been discussed by Wong 1981’’. He compared the codon correlations between codons of biosynthetically related amino acids in the uni- versal genetic code with those in randomly gener- ated codes. As a result, he found several randomly generated codes that have many more correlations than that found in the universal code. These unexpected results, of finding ‘superior’ codes, in agreement with requirements i and iii above, do not imply that these codes developed in agreement with requirement ii. As explicitly stated by Haig and Hurst 1991, their method of gener- ating variant codes is not meant to mimic the evolutionary process, but to test their null hypoth- esis. They remark that, ‘‘Our null model places strong constraints on the structure of the variant codes. All codes have the same level of degeneracy and the same probability of synonymous substitu- tions as the natural code. Therefore, our results detect error-minimizing features of the code that are additional to third and second base redun- dancy’’. Freeland and Hurst 1998 extended the mentioned work, lifting the second constraint by weighting transition errors differently from transversion errors and weighting each base differ- ently. However, in this later work no requirement on initial conditions is also imposed. Next, I will discuss the results that follow from the developmental pathway of the genetic code obtained from the model. Then, I compare this pathway with the developmental pathways of two randomly generated codes. The first one was ob- tained by maximizing the number of correlations between codons of biosynthetically related amino acids Amirnovin, 1997, and the second one, reported by Goldman 1993, by optimizing polar requirement Woese et al., 1966. The developmental pathway of the genetic code, derived from the model, is displayed in Fig. 1. The first ramification Fig. 1a separates the codons of the Pyrimidine Branch NYN from those of the Purine Branch NRN. From the first column of Fig. 1b it is clear that in the universal code the codons of the class NYN codify exclusively for nonpolar hydrophobic amino acids. This was first noticed by Klump 1993, in connection with his hypercube representation of the Genetic code. While the codons of the NRN class Fig. 1c codify polar hydrophilic amino acids, with the exception of tyr Y and trp W. These two amino acids are among the five more complex amino acids, actually the most complex, according to Papentin’s struc- tural complexity measure Table 1. Therefore, they are assumed to enter the code only in the last stage. Although, the first observations are well known, they acquire a new meaning in the present context: the differentiation of a single feature of the middle base of a codon implies the polarnonpolar distinc- tion of amino acids. This is an important property to consider for understanding the evolution of the genetic code. Recent results suggest the independence of protein folding kinetics of the details of the se- quence Riddle et al., 1997. This finding confirms theoretical views, which hold that it is possible to construct globular proteins by specifying the se- quence only with respect to a hydrophobic hydrophilic alphabet Chan and Dill, 1998. Fur- thermore, Hetch and co-workers Kamtekar et al., 1993 have shown that four-helix bundles can be designed by using the same binary alphabet, with the appropriate sequence periodicity of polar and nonpolar residues. This periodicity is the major determinant of secondary structure in self-assem- bling oligomeric peptides Xiong et al., 1995. Surrogate codes can be constructed assuming different null hypotheses see Amirnovin 1997 for three methods to generate surrogates. If the ‘block-structure’ is not assumed, there are more than 10 65 possible codes Goldman, 1993. How- ever, assuming the block-structure reduces this huge number to ‘only’ 21 = 5 × 10 19 possible codes. The most conservative code found by Goldman 1993, without assuming the distribu- tion of degeneracy present in the universal code, clearly does not have the polarnonpolar distinc- tion property See Fig. 2 of Goldman’s paper. But even the more restricted, randomly generated codes, which do have the block-structure see the two last columns of Fig. 1, do not share the polarnonpolar distinction property. The second stage in the development of the code is characterized by four codon classes, con- sisting of 16 codons each. According to both complexity measures Table 1, Ala and Gly are the simplest amino acids, followed by Asp, Pro or Val, Leu. All their codons are of the SNN class. G, A, V and D belong to the GNN subclass and P, L to the CNN subclass. These codon classes are located in contiguous planes in the hypercube See Figs. 2 – 4 in Jime´nez-Montan˜o et al., 1996. I assume that the first plane GNN is the primor- dial plane because it contains A and G. This implies that the first amino acids incorporated to the genetic code were, G, A, V and D. This result is in agreement proposals made by others, follow- ing very different criteria Eigen and Schuster, 1979; Taylor and Coates, 1989 and references therein. However, it has only a partial overlap with the Klump 1993 result, that assumes that the primordial amino acids were G, A, P and R see also Hartman, 1995. The last two belong to the CNN plane. The hypercube representation of the genetic code proposed by Klump differs from the one assumed in our model, because he employed only the RY categorization of the bases to define the codon classes. To explore the relations between codons he introduces a scale to rank the codons according to their Gibbs energy of codon – anti- codon interaction in kcal per mol triplet at 25°C. Then he defines a ‘mutational pathway’ along the shallowest slope up the codon – anticodon interac- tion Gibbs energy levels from the least stable codon – anticodon pair to the most stable one. See Fig. 1 in Klump 1993. He says, ‘‘as it turns out, this pass will always follow the same route. It starts with exchanging the nucleotide in the third position as the first step, which, with only two exceptions StopTrp and IleMet, preserves the amino acid encoded, it continues with exchanging the nucleotide in the first position 5 as the second step, and only at the last step is the nucleotide in the middle position exchanged’’. Therefore his developmental pathway satisfies the inequalities: C 2 \ C 1 \ C 3 2 These are consistent with the inequalities we assumed in our model inequalities Eq. 1. How- ever, the inequalities involving the SW catego- rization are not considered in Klump’s model. The assumed primordial amino acids have codons at the end of the mutational pathways, hence they have the highest codon – anticodon Gibbs energy and, for this reason, the more stable pairing. This point will be further discussed later on. The chemical type of the first base, C 1 , dictates the next developmental stage . The system has now the capacity to make an RY differentiation of the first base. As can be seen from Fig. 1, all the amino acids introduced in the previous stage have codons in the class RNN. Therefore, the new amino acids should have codons of the class YNN. Following our ‘simple-first’ rule implies the introduction of one element of each of the follow- ing groups, P, S L STOP, R and STOP. Therefore, we assume that besides A, G, V, D and the STOP signal, R, L, and P or S entered the amino acid alphabet. Of course, arginine R does not fit the rule See Table 1. However, it is well known that arginine is an unusual amino acid Taylor and Coates, 1989. It has been called an ‘‘early intruder’’ by Jukes 1973, who proposed that it replaced the simpler amino acid ornithine. Taylor and Coates 1989 made a similar pro- posal, suggesting that the probable explanation of this anomaly is that the codons AGN belonged to an amino acid no longer a member of the coded set, such as norvaline, norleucine, a-aminobutyric acid or ornithine. An even simpler candidate, b amino alanine, as well as guanidioacetic acid have been proposed by Hartman 1995, in the place of arginine. Between P and S it is not possible to decide which one entered earlier, because the two mea- sures of complexity give contradictory rankings Table 1. Essentially because each one measures different molecular aspects. However, taking into account Klump’s results above, it is proposed that the amino acids assigned at this stage were: A, G, V, D, P, L and ‘R’, where ‘R’ stands for the simpler basic amino acid which, it is assumed, was later replaced by arginine. The corresponding primeval code is displayed, in the conventional way, in Fig. 2. It leads to the important prediction that functional proteins can be coded with this limited amino acid alphabet. In a recent paper, Riddle et al. 1997 asked the following question, ‘‘what is the minimum num- ber of amino acids that would have been needed to encode complex protein folds similar to those found in nature today? They found experimen- tally that is possible to build most of a src SH3 domain with a five-letter amino acid alphabet but not with a three-letter alphabet. SH3 is a 57- residue domain that has a complex b-barrel-like structure wherein residues spread throughout the sequence come together to create the binding site for a proline-rich peptide. They showed that the SH3 domain can be largely encoded by the fol- lowing amino acid alphabet, A, G, I, E, K, which may be compared with our proposed primeval alphabet, A, G, V, L, D, ‘R’, P. The differences between the two alphabets correspond to closely related amino acids. They can be ac- counted by the empirically conserved substitution groups found by Wu and Brutlag 1996. The most prominent one of which is I, V, found in 10 192 positions in the BLOCKS database Henikoff and Henikoff, 1991, D, E was found in 5980, K, R in 6453 and I, L, V in 5328 positions, in the same database. They also occur in the theoretical groupings based on amino acid physical properties or structural comparisons pro- posed by many authors e.g. Jime´nez-Montan˜o, 1984; Taylor, 1986; Bordo and Argos, 1991. As a matter of fact, these substitutions occur in the phylogenetic variation of src SH3 domains Table 1 in Riddle et al., 1997. Similar results are obtained from the randomly generated codes Fig. 1, after imposing the initial condition ii. For example, at the same level of resolution, from the first random code Amirnovin, 1997, a corresponding primeval code could be G, Q, S, C, P, L, D. However, a very different outcome is obtained with the prime- val code proposed by Jime´nez-Sa´nchez 1995, that assumes radically different initial conditions. According to table 2 from his paper, the corre- sponding amino acid alphabet is K, N, Y, M, I, L, F. It is difficult to see how functional proteins could be constructed mainly with bulky, hydro- phobic, amino acids without incorporating G or A; and D or E Riddle et al., 1997. Besides, it includes some of the most complex amino acids Table 1. In the next step, when the H-bond character of the first base H 1 enters the game, the codons become specified by the two first bases. Thus, all the codons are of the form B 1 B 2 N. A primeval code of this type, with the third base, N, having no specificity, was proposed long ago by Jukes 1965. He remarked that this code would have a maximum of 15 amino acids and four stop codons. In the following, for the amino acids with six codons S, L, R, I adopt the convention of Taylor and Coates 1989 to call the additional pair of codons ‘extras’. The eight amino acids which are four-fold degenerate in the universal code are G, A, P, S, V, T, L, R, counting the ‘extras’ separately. They are ranked among the first ten according to, at least, one complexity measure Table 1, with the mentioned exception of arginine. It is not possible to know for sure Fig. 2. Proposed primeval code, displayed as a conventional codon catalogue. which of the two amino acids that share codons with common doublets entered at this stage. How- ever, from the ranking in Table 1, one possibility would be the following group, I, C, Q, K, S, D. Thus, there are, altogether, 13 amino acids, {G 4, A 4, P 4, S 8, V 4, T 4, L 8, R 4, I 4, C 4, Q 4 K 4, D 4} and STOP 4, where the numbers in parenthesis indicate the number of codons. In other variants there could be N 4 instead of K 4, or R 8 S 4 instead of R 4 S 8, etc. These speculations are of little value, because N, K as other amino acids that share codon doublets, are degenerate with respect to the Gibbs free energy of codon – anticodon interaction Klump and Maeder, 1991. On the contrary, the four-fold degenerate amino acids have doublets with the following composition, G + G = 34, A + U = 14. Therefore, they form the more sta- ble codon – anticodon pairs. Long ago, Goldberg and Wittes 1966 pointed out that triplets with a high GC content have greater pairing specificity and greater resistance to mutation. In their own words ‘‘the correlations described suggest that protective mechanisms may act at two levels, i the nucleotide level, where a high GC content may reduce the rate of mutation errors in protein synthesis, and ii the organizational level where the effects of a base change are minimized by degeneracy and by connectedness of codons for similar amino acids. The best-protected amino acids would be those with maximum GC content and degeneracy’’. Six of the seven amino acids in the proposed primeval code are among the best- protected amino acids. In the following stage, the distinction of the chemical type C 3 leads to the RY degeneracy of the third base. At this point there are already the 20 amino acids, {G 4, A 4, P 4, S 6, V 4, T 4, L 6, F 2, R 6, I 2, M 2, C 2, W 2, Q 2, H 2, K 2, N 2, D 2, E 2, Y 2, STOP 2}. Where the ‘extra’ codons of serine S, leucine L and arginine R are assigned accord- ing to the universal code. However, it is only in the final period when a distinction in the H-bond character of the third base H 3 becomes possible. This last step breaks the RY symmetry of Trp W, Ile I and Met M codons, leading to the present assignments, W 1, I 3, M 1; and STOP 3. 5 . 2 . De6iations from the uni6ersal code Schultz and Yarus 1996 have underlined that the codon reassignments found in various or- ganelles, several species of ciliates and other or- ganisms, are ‘very nonrandom’. According to Table 1 of their paper, in 14 instances involving six codons to be reassigned, reassignment appar- ently proceed by single nucleotide changes. That is, by one or two bit changes. As can be seen from Fig. 1, they are connected with features in the right hand side of the set of inequalities Eq. 1. Therefore, they are associated with low values of the codon – anticodon interaction energy, and the later stages in the evolution of the code. This is in agreement with the view that these are relatively ‘‘modern changes’’ Andersson and Kurland, 1995; Schultz and Yarus, 1996. Some reassign- ments restore the local symmetry of a codon, e.g. the assignment of the UGA STOP codon to tryptophan W, restores the symmetry of the codon class UGR, which is broken in the univer- sal code. According to the results obtained by Inagaki et al. 1998, the ancestral mitochondrion was bearing the universal genetic code and subse- quently reassigned the codon UGA to Trp, inde- pendently, in various lineages. From the point of view developed in this contribution, this means a return to a less differentiated code. This interpre- tation agrees with a reductive mode of genomic evolution Andersson and Kurland, 1995. On the contrary, the reassignment of AAA K to N breaks the symmetry of the codon class AAR, producing a 1-3 degeneracy. This case is com- pletely equivalent to the case of AUR, which in the universal code is split in AUA I and AUG M. The reassignment of AUA in mitochondria brings back the symmetry. These variations clearly show that changes in the ‘least significant feature’ of a codon do occur, without a major disruption of the cell functioning. 5 . 3 . Relationship between codon classes and aminoacyl tRNA synthetases Although no microscopic questions were Fig. 3. ‘Alignment’ of amino recognized by class I II aminoa- cyl tRNA synthetase, with two amino acid groupings: 1 by codon type, with respect to KM categorization; 2 by end atom type Davydov, 1998. S’ and R’ stand for the ‘extra’ serine and arginine codons, respectively. early stages of the development of the code. This assumption is supported by two independent observations: 1. The result shown in Fig. 3, of a remarkable correlation between the amino acids coded by NMN codons and the amino acids recognized by Class II aaRSs. Correspondingly, between those coded by NKN codons and the amino acids recognized by the synthetases of Class I. Already Wentzel 1995 pointed out a correla- tion between the middle base of a codon and aaRS class, but he did not notice the relation to the MK categorization of the bases. 2. The implications of the tRNA ‘‘gene recruit- ment’’ model Saks et al., 1997 4 for the evolu- tion of the code. According to this model, a single anticodon mutation may affect both mappings realized by the genetic code. An interesting example of the connection be- tween the two mappings realized by tRNA adap- tor molecules, when there exists an anticodon identifier, is E. coli tRNA Ile . This tRNA, specific for the codon AUA, has the modified anticodon LAU, where L is lysidine. This is a modification of cytosine C, whose 2-keto group is replaced by amino acid lysine K. The L in this context pairs with A rather than G, a unique case of base modification altering base pairing specificity. The replacement of this L with unmodified C yields a tRNA that recognizes the codon AUG M. As noticed by Voet and Voet 1995, ‘‘Surprisingly, treated in our model, I will recall some well know facts about transfer-RNA molecules in order to have a proper perspective of its relation with the underlying microscopic picture. In extant organ- isms, the physical carriers of the genetic code are, of course, tRNAs adaptor molecules. They may be considered of as comprised of two informa- tional domains, the acceptor- TCC minihelix en- coding the operational mapping for amino acids Schimmel et al., 1993 and the anticodon-con- taining domain with the three nucleotides of the codon – anticodon mapping. A model to relate the evolution of these two parts of the genetic code, assuming a tRNA ancestor, has been proposed by Rodin et al. 1996. The connection between amino acids and specific triplets is accomplished through the aminoacylation reactions catalyzed by aminoacyl tRNA synthetases see, e.g. Voet and Voet, 1995. These enzymes select both an amino acid and a tRNA. As is well known, these enzymes come in two groups, called Class I and Class II aaRSs Eriani et al., 1990. To each class correspond exactly ten amino acids. The structure of the hypercube, which depends on the RY and WS categorizations of the bases, was derived from the codon – anticodon mapping. Therefore, I assume that the only unused feature of the bases, the MK categorization, was used to distinguish the two classes of aminoacyl syn- thetases. Since the most important base of a codon anticodon is the second base, it was natu- ral to assume that this base was also responsible for the differentiation of the synthetases in the 4 Notwithstanding the existence of an operational mapping devoid of the anticodon nucleotides Schimmel et al., 1993, in a recent paper Saks et al. 1997 wrote that: ‘‘as a consequence of concurrently changing tRNA identity and mRNA coupling capacity, a single anticodon mutation could potentially result in a tRNA that would be competent to correctly translate a new set of codons in all the essential endogenous mRNA. Thus, an anticodon mutation might recruit a tRNA gene from one isoaccepting group to another’’. Isoacceptors are different tRNAs that accept the same amino acid. Furthermore, they notice that ‘‘an amino acid identity corresponding to the anticodon, rather than aminoacylation efficiency, is likely to be the key prerequisite for the appearance of a tRNA variant in a population. Once such a tRNA appeared, its efficiency of aminoacylation could be improved by the combined forces of mutation and natural selection’’. They obtained a confirmation of the above ‘recruitment hypothesis’ in vivo, using E. coli as a model system. however, this altered tRNA Ile is also a much better substrate for MetRS than is for IleRS. Thus, both the codon and the amino acid spe- cificity of this tRNA are changed by a single posttranscriptional modification.’’ In this paper I have emphasized the codon – anticodon mapping, i.e. the assignment of codon classes to amino acids. Recently, Davydov 1998 has proposed a set of rules to associate the amino acid end atoms ON and non-Onon- N of 18 amino acids with codons containing weak bases AU. These rules correctly predict all the codons in which the third base is non-re- dundant, that is, all the codons with doublets B 1 B 2 in the set M 2 = AU, UU, UA, AA, GA, CA, UG, AG, plus AC, GU and CU. The amino acids that, according Davydov’s rules, al- low the correct association of codons are, I, M, L, F, Y, K, N, D, E, Q, H, C, W, T, V, R’, S’. R’, S’ correspond to the ‘extra’ codons of R and S, respectively. Davydov’s rules give incorrect re- sults for the codons of R, S and A. Glycine and proline were not considered in the analysis per- formed by Davydov, for reasons explained in his article. We have found an inverse of Davydov’s rules: 1. Codons of the form NAN + WCN have amino acids with ON end atoms. 2. Codons of the form NUN + WGN have amino acids with non-Onon-N end atoms. The exceptions being R and S. 3. Codons of the form SCN + SGN code for P, A, R, G. These amino acids either were left out by Davydov or did not obey his rules. Thus, it is seen that the amino acids beyond Davydov’s rules do not form a random set. All of them have codons of the class SSN, which are the most stable codons Klump, 1993. With the exception of arginine once more, the other three have no proper side-chain see Davydov’s paper for further details. In Fig. 3 the amino acid recognized by the two classes of aminoacyl- tRNA synthetases are ‘aligned’ with two amino acid groupings defined by, i end atom type and ii codon type. It is clear from this figure, that the three ways of grouping amino acids are strongly correlated.

6. Discussion