Introduction Directory UMM :Data Elmu:jurnal:B:Biosystems:Vol54.Issue1-2.1999:

BioSystems 54 1999 47 – 64 Protein evolution drives the evolution of the genetic code and vice versa Miguel A. Jime´nez-Montan˜o a,b, a Inno6ationskolleg Theoretische Biologie, Humboldt-Uni6ersita¨t zu Berlin, In6alidenstrasse 43 , D- 10115 Berlin, Germany b Departamento de Fı´sica y Matema´ticas, Uni6ersidad de las Ame´ricasPuebla, Sta. Catarina Ma´rtir, 72820 Puebla, Mexico Received 30 March 1999; received in revised form 9 July 1999; accepted 12 July 1999 Abstract A model for the developmental pathway of the genetic code, grounded on group theory and the thermodynamics of codon – anticodon interaction is presented. At variance with previous models, it takes into account not only the optimization with respect to amino acid attributes but, also physicochemical constraints and initial conditions. A ‘simple-first’ rule is introduced after ranking the amino acids with respect to two current measures of chemical complexity. It is shown that a primeval code of only seven amino acids is enough to build functional proteins. It is assumed that these proteins drive the further expansion of the code. The proposed primeval code is compared with surrogate codes randomly generated and with another proposal for primeval code found in the literature. The departures from the ‘universal’ code, observed in many organisms and cellular compartments, fit naturally in the proposed evolutionary scheme. A strong correlation is found between, on one side, the two classes of aminoacyl- tRNA synthetases, and on the other, the amino acids grouped by end-atom-type and by codon type. An inverse of Davydov’s rules, to associate the amino acid end atoms ON and non-Onon-N of 18 amino acids with codons containing a weak base AU, extended to the 20 amino acids, is derived. © 1999 Elsevier Science Ireland Ltd. All rights reserved. Keywords : Protein evolution; Developmental pathway; Primeval code; Codon reassignments; Codon – anticodon interaction; Aminoacyl-tRNA synthetases www.elsevier.comlocatebiosystems ‘‘It is not the number of available signals but rather their distinguishability that matters in communication’’ Schumacher 1991.

1. Introduction

The problem of the origin and evolution of protein synthesis constitutes one of the major transitions in evolution, which is far from being solved at the present time. In the sensibly and acute statement of Smith and Szathma´ry 1995, ‘‘The origin of the code is perhaps the most Tel.: + 52-28-29-2676; fax: + 52-28-29-2045. E-mail address : jimmmail.udlap.mx M.A. Jime´nez-Mon- tan˜o 0303-264799 - see front matter © 1999 Elsevier Science Ireland Ltd. All rights reserved. PII: S 0 3 0 3 - 2 6 4 7 9 9 0 0 0 5 8 - 1 perplexing problem in evolutionary biology. The existing translational machinery is at the same time so complex, so universal, and so essential that it is hard to see how it could have come into existence, or how life could have existed without it’’. Several authors have approached this problem by building possible scenarios in which the genetic code could have originated. See the book by Smith and Szathma´ry 1995 and the fast growing literature on the RNA world Gesteland and Atkins, 1993. Most of these models are con- cerned with the catalytic properties of ribozymes. These are RNA molecules, assumed to play the role of enzymes in a primordial self-replicating system from which, it is presumed, the modern translational machinery originated. Another ap- proach is the search for ancestors of transfer- RNA Eigen et al., 1989; Rodin et al., 1996. The regularities of the codon catalogue were recognized from the very beginning Sonneborn, 1965; Epstein, 1966; Goldberg and Wittes, 1966; Woese, 1967; Alff-Steinberg, 1969. However, the use of a not completely appropriate mathematical framework for their description had been the main drawback to obtain an understanding of the code’s possible evolution Karasev and Sorokin, 1997. The customary expression ‘organization of the code’ really refers to two different problems, i the distribution of redundancy in the code, i.e. the ‘block structure’ of synonymous codons and the positions of the three stop codons Gold- man, 1993, and ii the amino acid assign- ments. The answer to the first question is indepen- dent of the answer to the second one. It depends on the codon – anticodon interaction energy Jime´nez-Montan˜o, 1994; Jime´nez-Montan˜o et al. 1995 1 . The standard approach to these problems em- ploys a three-dimensional sequence space of codons, equipped with a Hamming distance the number of positions where the nucleotides differ in a codon pair. Because, implicitly or explicitly, it is assumed that ‘minimum change’ is synony- mous of single-nucleotide-change Goldberg and Wittes, 1966. Further, in a recent paper Xia and Li, 1998 the authors say that ‘‘it has long been proposed that the genetic code might have been arranged in such a way as to reduce the effect of non-synonymous mutations involving single nu- cleotide changes’’. This approach is inexact, be- cause not all single nucleotide changes are equivalent, mutational transitions and transver- sions complementary and non-complementary are not equal, and also because the three positions in a codon have different thermodynamic stability and mutation frequency. An appropriate mathematical framework to represent the code, as a six-dimensional Boolean hypercube, was first proposed by Jime´nez-Mon- tan˜o and De la Mora-Basan˜ez 1992. This result was obtained building on pioneering algebraic approaches by Danckwerts and Neubert 1975, Bertman and Jungck 1979, Swannson 1984. The related early work of Rumer 1968 was unknown to the authors. More complete presenta- tions of the formalism appeared in later publica- tions Jime´nez-Montan˜o et al., 1995, 1996. Independently, Klump 1993, Karasev and Sorokin 1997 made similar proposals, which il- luminate different aspects of the problem. A com- parison of the three geometrical representations of the code will be discussed in a forthcoming publi- cation Jime´nez-Montan˜o and Klump, in prepara- tion. In the present contribution I further develop this approach, grounded on thermody- namics and group theory, and propose a model for the evolution of the code. Besides the precursor – product relationships be- tween amino acids or nucleotides in biosynthetic pathways Wong, 1975, 1976; Dillon, 1978; Tay- lor and Coates, 1989; Jime´nez-Sa´nchez, 1995, the only evidence we have from the time the code originated are ‘molecular fossils’. Among these, is the structure of the ‘universal’ code itself Woese, 1965, 1967; Taylor and Coates, 1989; Jime´nez- 1 It is important to notice, that in the present contribution we refer to the codon – anticodon interaction only in the strictest sense of the concept Langerkvist, 1978. In a wider perspective, it is well known that other factors such as the conformation of the whole tRNA molecule are of great impor- tance for the specificity of codon – anticodon recognition Kur- land et al., 1975. Montan˜o, 1994. Now supplemented with well- documented deviations Jukes, 1990; Wolsten- holme, 1992, which are far from random Schultz and Yarus, 1996. Therefore, any theory to infer prior codes should be consistent with this extant evidence. As we shall see, our model satisfies this requirement. Moreover, by its very nature, our phenomenological approach has the advantage of circumventing the difficulties that have plagued the proposals on the origin of the genetic code Cedegren and Miramontes, 1997. As Eigen 1971 first pointed out, the origin of the code should have been the result of a highly non-linear selection process. Therefore, it was strongly dependent on initial conditions, but ini- tial conditions cannot be derived from dynamics of any kind. It is an essential characteristic of the evolutionary process to involve a certain degree of contingency, at one or more points. However, in our model the initial conditions are not assumed on the basis of the abundance of pre-biotically synthesized amino acids, nor on precursor – product relations in the biosynthetic pathways of pyrimidines Jime´nez-Sa´nchez, 1995. This is so, because I do not appeal to prebiotic scenarios. From the point of view developed here, the ques- tion of the initial conditions is not the question of which ingredients appeared first, but the question of which were the first amino acids to be incorpo- rated into a primordial code. I assume as others did before, that ‘simple’ amino acids were intro- duced first. To have a comparison criterion, the amino acids are ranked according to two current measures to estimate chemical complexity, i the shortest-description of structural formulas Pa- pentin, 1982, and ii the sizecomplexity score Dufton, 1997, see Table 1. The structure of the code suggests that it evolved following a minimum change coding pathway, The development of an already-working system should happen by changes in its least significant features, without disturbing the major lines of the system Swannson, 1984. Apart from this general assumption, the only additional as- sumption I make is that the codon – anticodon Gibbs free-energy of interaction induces a partial order see Eq. 1 below in the set of codon classes. This partial order defines a ‘time arrow’, in the sense that specific classes e.g. NRN, CGY, etc. correspond to the various stages of the pro- gressive differentiation of the code. These codon categories define amino acid groups, the amino acids belonging to a given category are the leaves from the node of that category in the develop- mental tree Fig. 1. This approach gives a formal expression to pioneering ideas of Woese 1965, 1973 and coworkers Woese et al., 1966, that envisioned a gradual development from a ‘sim- plest’ code. According to these authors, the first code was so imprecise because the ancestors of transfer RNAs were only able to recognize classes of similar codons an extreme form of wobble and classes of similar amino acids. As aptly sum- marized by Haig and Hurst 1991, ‘‘in this view, the modern version of the code evolved through a gradual increase in the discrimination of tRNA Table 1 Ranking of amino acids in isolation and in side-chains of proteins, according to two measures to estimate chemical complexity Shortest descrip- Sizecomplexity Shortest descrip- tion of amino score of amino tion of side-chain a acid b acid a G c G c G c A c A c A c D c V c S P L – C S I S V C T N K M E K P K T D c L V N D c Q T E I M N Q L F E I R Q H Y R H C F R F H M Y Y W W W a Papentin 1982. b Dufton 1997. c The assumed four amino acids that first entered the code. Fig. 1. Proposed developmental pathway of the genetic code. a Starting partition into YR branches. b Pyrimidine branch. c Purine branch. In the first column appear amino acid reassignments in non standard codes. The second and third columns show reassignments from randomly generated codes. The number of codons are in brackets. 1 Amirnovin 1997; 2 Goldman 1993. Fig. 1. Continued for specific amino acids and specific codons within these ancestral sets’’. By comparing a primeval code, coding only for seven amino acids, with the five letter alphabet employed by Riddle et al. 1997 to build a func- tional protein, it is shown that our proposed primeval code is enough to enable a primitive cell to produce functional proteins. These early proteins, in turn, drive the further evolution of the code. Similar simplified alphabets are obtained from randomly generated codes, under different optimization criteria, only after imposing the ini- tial conditions assumed in the model. However, a diverse outcome is attained with the primeval code proposed by Jime´nez-Sa´nchez 1995, that assumes radically different initial conditions. It is shown that the developmental pathway of the genetic code is compatible with proposals about the co-evolution of the code and amino acid synthetic pathways Wong, 1975; Dillon, 1978; Taylor and Coates, 1989, without invoking non-testable assumptions about the temporal ap- pearance of nucleotides or amino acids. The ob- served departures from the ‘universal’ code, found in many organisms and cellular compartments Jukes, 1990; Wolstenholme, 1992, fit naturally in the proposed evolutionary scheme. They can be explained in a way consistent with the ‘‘ambigu- ous intermediate’’ theory of Schultz and Yarus 1996. According to our model, not only genomic evolution drives the evolution of the translation system Andersson and Kurland, 1995, but, the converse is also true, the evolution of the transla- tion system drives genomic evolution. From this point of view, both, the non-random nature of codon reassignments they mainly occur between codons differing in one or two features, and the differences in codon usage between prokaryotes and eukaryotes Klump and Maeder, 1991, reveal the different strategies followed by ancestral cells and extant organisms see discussion. In conclusion, the proposed model provides a simple and coherent scheme for the development of a coding system. It differs from previous mod- els in that it emphasizes the importance of physic- ochemical constraints and initial conditions, to delimit the possible developmental pathways.

2. The model