Directory UMM :Data Elmu:jurnal:B:Biosystems:Vol56.Issue3.2000:
A non-statistical approach to protein mutational variability
Jacek Leluk *
Institute of Biochemistry and Molecular Biology,Uni6ersity of Wrocl*aw,Tamka2,50-137Wrocl*aw,Poland
Received 16 August 1999; received in revised form 12 February 2000; accepted 17 February 2000
Abstract
The non-statistical, non-Markovian model of protein mutational variability is described. There are presented the essential features of the algorithm of genetic semihomology and some examples of its application. The results of genetic semihomology approach are compared with the results obtained by using statistical algorithms and matrices which are assumed in widely used programs such as ClustalW, FASTA, MultAlin and BLAST. The aim of the new algorithm elaboration is to improve the accuracy of the results of protein sequence comparison, avoid the wrong assumptions and misinterpretation of the results, and increase the amount of information available from such study. © 2000 Elsevier Science Ireland Ltd. All rights reserved.
Keywords:Protein homology; Mutations; Variability; Genetic semihomology
www.elsevier.com/locate/biosystems
1. Introduction
There are over 400 amino acid indices and 42 mutation matrices described so far (Tomii and Kanehisa, 1996). Actually they are based on much fewer algorithms, most of which are subsequent modifications of the original ones. Some of them dominate in their usefulness and popularity, like the mutation data matrices derived from Dayhoff matrix (Dayhoff and Eck, 1968; Dayhoff et al., 1979) and PAM and BLOSUM indices.
The algorithm of genetic semihomology (Leluk, 1998) differs from the others by its non-statistical approach and lack of scoring scale. Widely used statistical scoring matrices are very helpful for
comparative studies of similarity and variability among different protein families. However, the values of such scoring may bring limited and confusing information. They are also dependent on the database used. For that reason the same algorithm can lead to different scoring matrix contingently upon the type and number of protein sequences used as database (Tomii and Kanehisa, 1996). The statistical approach is useful and nec-essary at certain steps of algorithm development and available knowledge. Such approach, pro-vided with scoring matrices, enables elaborations on the regularities occurring in protein variability, amino acid replacements and relationship to their structural and physico – chemical character. How-ever, they cannot support with explanation about the mechanism of variability or conservativity. It is hard to use them to predict new possible
se-* Tel.: +48-71-3402611; fax:+48-71-3402608. E-mail address:[email protected] (J. Leluk)
0303-2647/00/$ - see front matter © 2000 Elsevier Science Ireland Ltd. All rights reserved. PII: S0303-2647(00)00074-5
(2)
Table 1
Multiple alignment of the gap containing fragments of seven chicken ovoinhibitor domains obtained with Markovian and non-Markovian methods
ClustalW MultAlin Genetic semihomology
Blosum62 Blosum62
VNCSLYASGIGKDGTSWVAC I
I VNCSLYASGIGKDGTSWVAC I VNCSLYASGIGKDGTSWVAC
II IDCSPYLQVV-RDGNTMVAC II IDCSPY-LQVVRDGNTMVAC II IDCSPYLQ-VVRDGNTMVAC
VDCSKYPSTVSKDGRTLVAC
III VDCSKYPSTVSKDGRTLVAC III VDCSKYPSTVSKDGRTLVAC III
IDCDQYPTRKTTGGKLLVRC IV
IV IDCDQYPTRKTTGGKLLVRC IV IDCDQYPTRKTTGGKLLVRC
V LDCTQYLSNT-QNGEAITAC V LDCTQYLSN-TQNGEAITAC V LDCTQYLS-NTQNGEAITAC
VI LDCSKYKTSTLKDGRQVVAC VI LDCSKYKTSTLKDGRQVVAC VI LDCSKYKTSTLKDGRQVVAC
EHCREFQKVS---PIC VII
VII - - - - EHCREFQKVSPIC VII EHCREFQK-VS----P-I-C
.* : * . dcs.y . . dg . . . vaC AA SS*sSSSSSSSSS*SSSS *
NA sSSs S S sSSssss*Sss
Dayhoff(PAM) Dayhoff(PAM)
I VNCSLYASGIGKDGTSWVAC I VNCSLYASGIGKDGTSWVAC
IDCSPY-LQVVRDGNTMVAC II
IDCSPYLQVVR-DGNTMVAC II
VDCSKYPSTVSKDGRTLVAC III
III VDCSKYPSTVSKDGRTLVAC
IDCDQYPTRKTTGGKLLVRC IV
IV IDCDQYPTRKTTGGKLLVRC
V LDCTQYLSNTQ-NGEAITAC V LDCTQYLSNT-QNGEAITAC
LDCSKYKTSTLKDGRQVVAC VI
VI LDCSKYKTSTLKDGRQVVAC
EHCREF---QKVSPIC VII
VII EHCREFQKVSP---IC
.* : * .. C . . % . . . g . . . aC
Identity(unitary matrix) Identity(unitary matrix)
I VNCSLYASGIGKDGTSWVAC I VNCSLYASGIGKDGTSWVAC
II IDCSPYLQVVR-DGNTMVAC II IDCSPYLQVVR-DGNTMVAC
III VDCSKYPSTVSKDGRTLVAC III VDCSKYPSTVSKDGRTLVAC
IV
IDCDQYPTRKTTGGKLLVRC IDCDQYPTRKTTGGKLLVRC
IV
V LDCTQYLSNTQ-NGEAITAC V LDCTQYLSN-TQNGEAITAC
VI LDCSKYKTSTLKDGRQVVAC VI LDCSKYKTSTLKDGRQVVAC
EHCREFQ---KVSPIC VII
---EHCREFQKVSPIC VII
.dC..y . . . g . . . vaC *
quences in a certain protein family, even if the probability of amino acid replacement is well described.
The aim of the new algorithm elaboration is to overcome the basic disadvantages of protein se-quence analysis tools and to exclude some basic errors in the assumptions of the existing statistical methods. It is to be able to explain the mechanism and pathway of protein evolution and differentia-tion, not only limited to the description of the initial and final step of the observed changes. The algorithm of genetic semihomology departs from assuming that amino acid replacement is concor-dant with the Markovian model and explains why this model cannot be applied for these changes (Leluk, 1998). Another goal of this algorithm is to
make it applicable to any group of proteins of any nature, function and location. It can be achieved for two reasons, (1) minimization of basic as-sumptions limited to the general amino acid/
codon translation table and assuming that single point mutation is a principle, most common, mechanism of protein variability; (2) non-statisti-cal approach (a statistinon-statisti-cal mutation matrix is re-placed by three dimensional diagram representing all theoretically possible amino acid replace-ments). The approach presented in this article is to allow to study the multidomain structure of the proteins (at the primary structure level), predic-tion of the genetic code of the proteins with unknown gene sequence (useful for genetic probe construction), explaining the mutation mechanism
(3)
that took place at a certain fragment, precise location of insertion/deletion sites at non-ho-mologous fragments, confirmation that the low homology of compared proteins is actual or casual, and prediction of the future or not reported muta-tions for the related proteins.
2. Materials and methods
The amino acid and nucleotide sequences se-lected for the purpose of this study concerned proteinase inhibitors from squash seeds (Joubert, 1984; Wieczorek et al., 1985; Otlewski et al., 1987; Heitz et al., 1989; Hakateyama et al., 1991; Chen et al., 1992a,b; Matsuo et al., 1992; Nishino et al., 1992; Ling et al., 1993; Hayashi et al., 1994; Lee and Lin, 1995; Hamato et al., 1995; Haldar et al., 1996; Stachowiak et al., 1996), the Bowman – Birk inhibitors (Odani and Ikenaka, 1978; Chen et al., 1992a,b; Baek et al., 1994; McGurl et al., 1995; sequences obtained from NCBI database), 31 se-quences from the trypsin family (sese-quences ob-tained from SWISS-PROT (Bairoch and Apweiler, 1997; Bairoch and Apweiler, 1999; Apweiler et al., 1997; Bairoch, 1997), and PROSITE (Hofmann et al., 1999; Bucher and Bairoch, 1994) databases), chicken ovoinhibitor (Scott et al., 1987) and chicken ovomucoid (Catterall et al., 1980; Ger-linger et al., 1982; Lai et al., 1982). The Markovian matrices of amino acid replacement were applied by using programs ClustalW (Thompson et al., 1994), MultAlin (Corpet, 1988) and BLAST (Altschul et al., 1990; Gish and States, 1993). The algorithm of genetic semihomology was applied manually or with the program SEMIHOM (Leluk, 1998) (Table 1).
The dot matrix interpretation of the results was achieved with BLAST 2 SEQUENCES (Tatusova and Madden, 1999) and SEMIHOM (Leluk, 1998).
3. Results and discussion
3.1. General characteristics of non-statistical al
-gorithm of genetic semihomology
The important feature of the algorithm of
ge-netic semihomology concerns the minimal initial assumptions. It assumes the general codon-to-amino acid translation table, and the single point mutation as the basic and most common mecha-nism of protein differentiation (Fig. 1) (Leluk, 1998). This approach is not limited to single point mutation mechanism only. It may be used to localize those fragments, which undergo other kind of mutation.
The scoring scale can be assumed for genetic semihomology algorithm, but only for rough comparison studies. The identity of compared po-sitions has score of 3, the semihomology of type I (residues encoded by triplets which differ by one nucleotide of the same type, e.g. both are purines) is scored as 2, and semihomology of type II (residues encoded by triplets which differ by one nucleotide of different type) as 1. In the other words, type I is represented by any single transi-tion and type II by any single transversion in the nucleotide triplet. The non-semihomologous pairs are scored as 0.
The sense of this algorithm is not the scoring the comparison results but explaining the way of amino acid replacement and the mechanism of variability at both, amino acid and genetic code, levels. For that reasons the algorithm of genetic semihomology should be treated separately from other algorithms, especially statistical ones. This interpretation is often consistent with the amino acid replacement indices data (Fig. 2). Theoretical results obtained by algorithm of genetic semiho-mology and mutation data matrix scoring support each other. This coincidence is very important for correct interpretation of the mechanism of protein variability. The used mutation data matrix scoring is supported by PAM250 indices. The PAM250 values for each amino acid replacement were added. The result of such calculation is consistent with the result of theoretical genetic semihomol-ogy analysis. The higher score of the PAM in-dices’ sum is convergent with the genetic relationship of the studied set of amino acids at individual positions.
The goal of minimization of the initial assump-tions in the algorithm of genetic semihomology is to (1) make the algorithm universal for any group
(4)
of proteins and (2) avoid errors and misinterpreta-tion of the results. Basically, the algorithm as-sumes three general establishments. One of them is the codon-amino acid translation table, which is undoubtedly considered as true.
The second is the statement that single nucle-otide replacement is the most often and easiest mechanism of mutational variability. The last general establishment comprises the three-dimen-sional diagram of amino-acid replacements by single point mutation (Fig. 1).
A single nucleotide replacement concerns two types of replacements-within the same group (purine to purine or pyrimidine to pyrimidine) and between two groups (purine to pyrimidine or conversely). Thus, the first case is described here
as type I (transition) and the second as type II (transversion). It seems, that the replacement of type I is spatially more possible than the other. However, there are no distinct reports that such dependence exists. Another question is the posi-tion-dependence of replacement within the codon and its further consequence. The third nucleotide of the codon is known as the most tolerant and most flexible encoding unit. Most amino acids are encoded by first two nucleotides while the third position is optional. Therefore, in algorithm of genetic semihomology mutation at this position is considered separately and is defined as type III, which is independent of type I and II.
Amino acid residues as the products of genetic code translation are not equal units, when we consider both nucleotide and protein levels simul-taneously. They are encoded by different number of codons from 1 to 6. That implies the difference in probability of their replacement by single point mutation. Such differentiation causes, that amino acids themselves cannot be treated uniformly in terms of mutational variability. It is important which exactly codon encodes a certain amino acid for its further transformation to another one. For example arginine encoded by AGG can be re-placed by methionine (ATG) much easier than arginine encoded by CGY (Fig. 1). On the other hand, the probability of ArgCGY replacement by
Cys (TGY) is higher than it is for ArgAGG. This
leads to conclusion that the information of the amino acid residue evolutionary preceding the residue at a certain position is still present in its genetic code if the mechanism of replacement was based on point mutation. The probability of fur-ther replacement is also dependent on its codon. That reveals inconsistency of amino acid muta-tional substitution with the Markovian model which establishes all exchange probabilities as in-dependent on the previous history (George et al., 1990). Markovian model however is assumed in most statistical matrices of protein mutational variability. This could be true only if each amino acid was encoded by one codon.
The algorithms assuming statistical mutation data matrix and Markovian model omit the pro-cess of cryptic mutations. These are the muta-tions, which considerably change the codon
Fig. 1. Semihomologous relationships among proteinaceous amino acids. The codons of residues along each axis differ by only one nucleotide (Leluk, 1998). The diagram setting shows the codon changes at first (axis 1), second (axis 2) or third (axis 3) position in order AGCU. This diagram forms
the basis of the genetic semihomology algorithm instead of a matrix of statistical parameters or replacement indices, which support statistical algorithms.
(5)
Fig. 2. The semihomologous correlation between amino acids occurring at selected non-homologous positions of inhibitors from squash seeds. The solid lines indicate the transition type of semihomology; the dashed lines refer to transversion type of semihomology. The values in brackets indicate the additive Dayhoff mutation data matrix (MDM) score calculated for each set. Note the consistency between semihomologous relationship and MDM scores.
sequence without changing the amino acid. Such phenomenon refers especially to residues en-coded by up to six codons (Arg, Leu and Ser). Cryptic mutations enable code differentiation not restricted by conditions of environmental se-lection. They enlarge the spectrum of further possible replacements. Generally, most muta-tions of the third nucleotide of the codon are cryptic. For arginine, leucine and serine cryptic mutations at first or second positions are also allowed. Thus, they can be considered as inter-mediates between other two residues, exchange of which require more mutations. These three amino acids often occur at the positions, which are very variable and are occupied by six or more residues within homologous proteins (Figs. 3 and 4).
It is admitted that from evolutionary point of view the genetic code and amino acid ‘language’ have evolved simultaneously. They also act with strict coherence with each other. Therefore, in analysis of protein differentiation and variabil-ity, both of them should be considered as well. The approach of Siemion and Stefanowicz (1992) presents the periodical relation between chemical reactivity of amino acids and the pri-mary structure of their genetic code. It also refers to one point mutation mechanism which is the assumption for amino acid reactivity properties and Chou – Fasman conformational parameters relationship (Siemion, 1994, 1995). Although, those elaborations concern the reac-tivity and secondary structure parameters, the theoretical strategy of approach is consistent
(6)
with non-statistical approach of genetic semiho-mology.
3.2. Examples of application of genetic semihomology algorithm
3.2.1. Gap location
The commonly used alignment programs such as ClustalW, MultAlin or FASTA based on PAM and BLOSUM matrices generate gaps and locate them in the manner that aligns remaining frag-ments with each other. The accuracy of such strategy is good for the fragments of high homol-ogy. With the lower similarity degree of aligned sequences the accuracy also decreases. The strat-egy of these tools tends to combine the deleted positions into continuous fragments, which are set at most privileged site for entire alignment. The results are often inconsistent when provided with different comparison matrices and even when the same matrix is applied in different programs. An example of such discordant results are obtained when ClustalW and MultAlin are used for the alignment of chicken ovoinhibitor domains (Scott
Fig. 4. Frequency of six-codon amino acids as a function of position variability in homologous sequences. Results obtained from aligned trypsins (219 positions occupied by 1150 residues). Note that the serine occurrence at variable positions is more significant than arginine and especially leucine. See the text for the source and calculation method (Section 3.2.6).
Fig. 3. A simplified planar diagram of genetic relationship between the amino acids. The significance of the third position of the codon is omitted. The dashed curves show the cryptic mutational passages that do not cause change at the amino acid level.
et al., 1987) (Fig. 5). The gap-containing fragment was thoroughly analyzed for genetic semihomol-ogy relationship at both levels — genetic and amino acid. The obtained results suggest discon-tinuous alignment of deletions with genetically reasonable location of non-homologous residues (Fig. 5). Their location seems to be proper with respect to the probability of one point mutation exchange between aligned residues.
3.2.2. Alignment of non-homologous fragments
For statistical matrices if aligned fragments of related proteins are not homologous, the observed probability of amino acid replacement is taken under consideration. This strategy works properly for protein families for which the mechanism of differentiation is similar. It concerns their general physico – chemical character, reactivity and bio-logical function. But for very different proteins the matrices of replacement likeliness may have
(7)
not corresponding values. Genetic semihomology shows only the mutational probability of replace-ment, which is independent on the character of a protein and its biological function. It aligns the positions that are most similar with respect to their known or predicted genetic code. Taking the semihomologous positions into account completes the correct alignment initially obtained for identi-cal positions.
3.2.3. Low, but significant homologies
The multiple alignment of evolutionary distant sequences is sometimes confusing and uncertain. It is hard to estimate if such sequences show actual relationship, or their similarity is casual.
The problem is additionally increased by lack of unified methodology for actual homology estima-tion. The program SEMIHOM equipped with genetic semihomology analysis tool applicable also for proteins of unknown gene sequence is very useful for estimation whether the obtained low homology is actual or not. The double frame analysis (Leluk, 1998) allows to get rid of casual noise resulting from accidental single matches at different settings without erasing the essential ones (Fig. 5).
3.2.4. Internal homology of multidomain sequences
The similarity searching tools available in most protein databases and related servers are designed for fast database scanning for related sequences to the query sequence. The given results concern the optimal alignment and statistical parameters of comparison effect (FASTA, BLAST, ClustalW, MultAlin). The BLAST servers additionally provide with dot matrix interpretation of the re-sults. Although, these tools are powerful for all types of proteins (and nucleic acids), they do not analyze the internal homology of multidomain sequences properly (Fig. 6). The domain align-ment and analysis can be obtained by using pro-gram SEMIHOM with good dot matrix results, which can be additionally improved by semiho-mology analysis (Fig. 6). Even very distant do-mains (low similarity), which are still related to each other are visualized with this program. The results are achieved by comparison of a protein sequence with itself.
3.2.5. Location of fragments which are products of other mechanism than single point mutation
The single point mutation is considered as the most frequent and easiest mechanism of protein differentiation. However, there are other mecha-nisms such as deletion, insertion, duplication, in-version etc. The use of genetic semihomology algorithm for sequence multiple alignment and further analysis makes possible to identify the fragments which are the consequence of other type of mutation than single transition/ transver-sion. The identification consists in finding the fragments where aligned residues are not
semiho-Fig. 5. Sequence comparison of the fragments of two Bow-man – Birk inhibitors. Semihomology application and casual noise reduction for dot matrix results of low homology se-quence comparison. The double frame setting shows ho-mologous and semihoho-mologous positions only for at least four identities along 12-unit string.
(8)
Fig. 6. Internal domain homology of chicken ovoinhibitor (7 domains) and ovomucoid (3 domains) with BLAST 2 SEQUENCES and SEMIHOM.
Fig. 7. Application of genetic semihomology algorithm for identification the fragments of possible different mechanism than single point mutation. The listing shows the amino acid variability at corresponding positions within 7 domains of chicken ovoinhibitor. The non-semihomologous residues (differing by more than one nucleotide in their codon) are separated by square brackets. The accumulation of these positions (in shadows) suggests different mechanism of differentiation from single transition/transversion.
mologous. If there are such positions occurring consecutively one after another it might indicate that the mechanism of differentiation is different from single point mutation (Fig. 7).
3.2.6. Cryptic mutations
There are mutations that change composition of the gene without effect on amino acid se-quence. Such cryptic mutations concern the
(9)
codons for amino acids that can be encoded by two or more triplets. These mutations seemingly do not have any effect at the protein level. How-ever, they are important mechanism to the very variable sites in the manner that they increase the degree of variability available by single point mu-tation. For example a SerTCAcan be converted by
single point mutation to Leu, Pro, Ala and Thr (Fig. 1). A cryptic mutation of SerTCAto SerAGT
extends the spectrum of its conversion to Asn, Ile, Arg, Gly and Cys by single step mutation. The role of cryptic mutations is especially important for amino acids encoded by up to six codons (Ser, Arg and Leu). Among them serine is particularly interesting since its codons are the most differenti-ated (the genetic code matrix scores for them vary from 0 to +2). In fact serine participation in the most variable positions is outstanding as it is shown on the example of trypsins (Fig. 4). The sequences of 31 trypsins from different organisms were aligned. There were compared 219 positions (out of 247 corresponding to anionic precursor of bovine trypsinogen, SWISS-PROT accession number Q29463) occupied by 1150 residues. The study did not include the positions which were not represented in more than 70% of compared se-quences (positions that were the result of inser-tion/deletion). The percentage of occurrence of
the individual residues at a variable position was calculated in the following way:
pj=
nj
j×100
where p is the percentage of occurrence of an individual residue in positions occupied by j dif-ferent residues; nj the number of positions where an individual residue appears among j occurring residues; and jis the number of positions where j
different residues occur.
The planar diagram of genetic semihomology relationship between amino acids, reduced to the first two positions of the codon, shows the cryptic passages for six-codon amino acids (Fig. 3).
3.2.7. Genetic code prediction
Originally the algorithm of genetic semihomol-ogy has been designed for studies on proteins whose gene sequence is unknown. Similarly to PAM or BLOSUM methods the nucleotide se-quence of the gene is not required (although, it is very helpful if we know it). One of the possible options of the genetic semihomology analysis is the prediction of possible codons for particular amino acids aligned. The method consists in com-parison of all possible codons of the amino acids occurring at the corresponding position of the
Fig. 8. Genetic semihomology prediction of genetic code for chicken ovoinhibitor domains. The predicted gene sequence is compared with the known DNA sequence. Only those positions where prediction reduces possible codons are considered. The predicted codons that are consistent with those present in the ovoinhibitor gene are shadowed.
(10)
aligned sequences and choosing the codons which are most related to each other, as the most likely. Usually they can be converted to each other by single transition/transversion process. The predic-tion method gives promising results (Fig. 8). This option may be used to design sequence fragments for the unknown genes of the proteins. It may be helpful in searching the genes by using hybridiza-tion methods.
The simplified version of the program SEMI-HOM with dot matrix graphic interpretation for comparative and genetic semihomology studies of protein sequences is available to anyone upon request to the author ([email protected]).
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403 – 410.
Baek, J.M., Song, J.C., Choi, Y.D., Kim, S.I., 1994. Nucle-otide sequence homology of cDNAs encoding soybean Bowman – Birk type proteinase inhibitor and its isoin-hibitors. Biosci. Biotechnol. Biochem. 58, 843 – 846. Catterall, J.F., Stein, J.P., Kristo, P., Means, A.R., O’Malley,
B.W., 1980. Primary sequence of ovomucoid messenger RNA as determined from cloned complementary DNA. J. Cell Biol. 87, 480 – 487.
Chen, X., Qian, Y., Chi, C., Gan, K., Zhang, M., Chang-Qing, C., 1992a. Chemical synthesis, molecular cloning, and expression of the gene coding for the Trichosanthes trypsin inhibitor — a squash family inhibitor. J. Biochem. 112, 45 – 51.
Chen, P., Rose, J., Love, R., Wei, C.H., Wang, B.C., 1992b. Reactive sites of an anticarcinogenic Bowman – Birk proteinase inhibitor are similar to other trypsin inhibitors. J. Biol. Chem. 267, 1990 – 1994.
Corpet, F., 1988. Multiple sequence alignment with hierarchi-cal clustering. Nucleic Acids Res. 16, 10881 – 10890. Dayhoff, M.O., Eck, R.V., 1968. In: Dayhoff, M.O., Eck,
R.V. (Eds.), Atlas of Protein Sequence and Structure, vol. 3. National Biomedical Research Foundation, Silver Spring, MD, p. 33.
Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C., 1979. In: Day-hoff, M.O. (Ed.), Atlas of Protein Sequence and Structure, vol. 5. National Biomedical Research Foundation, Wash-ington, DC, p. 345.
George, D.G., Barker, W.C., Hunt, L.T., 1990. Mutation data matrix and its uses. In: Doolittle, R.F. (Ed.), Methods in Enzymology, vol. 183. Academic Press, San Diego, CA, p. 337.
Gerlinger, P., Krust, A., Lemeur, M., Perrin, F., Cochet, M., Gannon, F., Dupret, D., Chambon, P., 1982. Multiple
initiation and polyadenylation sites for the chicken ovomu-coid transcription unit. J. Mol. Biol. 162, 345 – 364. Gish, W., States, D.J., 1993. Identification of protein coding
regions by database similarity search. Nat. Genet. 3, 266 – 272.
Hakateyama, T., Hiraoka, M., Funatsu, G., 1991. Amino acid sequence of the two smallest trypsin inhibitors from sponge gourd seeds. Agric. Biol. Chem. 10, 2641 – 2642.
Haldar, U.C., Saha, S.K., Beavis, R.C., Sinha, N.K., 1996. Trypsin inhibitors from ridged gourd (Luffa acutangula Linn.) seeds: purification, properties, and amino acid se-quences. J. Protein Chem. 15, 177 – 184.
Hamato, N., Koshiba, T., Pham, T., Tatsumi, Y., Nakamura, D., Takano, R., Hayashi, K., Hong, Y., Hara, S., 1995. Trypsin and elastase inhibitors from bitter gourd (Mo -mordica charantia Linn.) seeds: purification, amino acid sequences, and inhibitory activities of four new inhibitors. J. Biochem. 117, 432 – 437.
Hayashi, K., Takehisa, T., Hamato, N., Takano, R., Hara, S., Miyata, T., Kato, H., 1994. Inhibition of serine proteinases of the blood coagulation system by squash family protease inhibitors. J. Biochem. 116, 1013 – 1018.
Heitz, A., Chiche, L., Le-Nguyen, D., Castro, B., 1989. 1H 2D NMR and distance geometry study of the folding ofEcbal -ium elater-ium trypsin inhibitor, a member of the squash inhibitor family. Biochemistry 28, 2392 – 2398.
Joubert, F.J., 1984. Trypsin inhibitors fromMomordica repens seeds. Phytochemistry 23, 1401 – 1406.
Lai, E.C., Roop, D.R., Tsai, M.-J., Woo, S.L.C., O’Malley, B.W., 1982. Heterogeneous initiation regions for transcrip-tion of the chicken ovomucoid gene. Nucleic Acids Res. 10, 5553 – 5567.
Lee, C., Lin, J., 1995. Amino acid sequences of trypsin in-hibitors from the melonCucumis melo. J. Biochem. 118, 18 – 22.
Leluk, J., 1998. A new algorithm for analysis of the homology in protein primary structure. Comput. Chem. 22, 123 – 131. Ling, M.-H., Qi, H., Chi, C., 1993. Protein, cDNA and genomic DNA sequences of the towel gourd trypsin in-hibitor. J. Biol. Chem. 268, 810 – 814.
Matsuo, M., Hamato, N., Takano, R., Kamei-Hayashi, K., Yasuda-Kamatani, Y., Nomoto, K., Hara, S., 1992. Trypsin inhibitors from bottle gourd (Lagenaria leucantha Rusby var.Depressa makino) seeds. Purification and amino acid sequences. Biochim. Biophys. Acta 1120, 187 – 192. McGurl, B., Mukherjee, S., Kahn, M., Ryan, C.A., 1995.
Characterization of two proteinase inhibitor (ATI) cDNAs from alfalfa leaves (Medicago sati6a var. Vernema): the
expression of ATI genes in response to wounding and soil microorganisms. Plant Mol. Biol. 27, 995 – 1001.
Nishino, J., Takano, R., Kamei-Hayashi, K., Minakata, H., Nomoto, K., Hara, S., 1992. Amino-acid sequencers of trypsin inhibitors from oriental pickling melon (Cucumis meloL. var. Conomon makino) seeds. Biosci. Biotechnol. Biochem. 56, 1241 – 1246.
(11)
Odani, S., Ikenaka, T., 1978. Studies on soybean trypsin inhibitors, XII. Linear sequences of two soybean double-headed trypsin inhibitors, D-II and E-I. J. Biochem. 83, 737 – 745.
Otlewski, J., Whatley, H., Polanowski, A., Wilusz, T., 1987. Amino-acid sequences of trypsin inhibitors from watermelon (Citrullus 6ulgaris) and red bryony (Bryonia dioica). Biol.
Chem. Hoppe-Seyler 368, 1505 – 1507.
Scott, M.J., Huckaby, C.S., Kato, I., Kohr, W., Laskowski, M. Jr., Tsai, M.-J., O’Malley, B.W., 1987. Ovoinhibitor introns specify functional domains as in the related and linked ovomucoid gene. J. Biol. Chem. 262, 5899 – 5907. Siemion, I.Z., 1994. Compositional frequencies of amino acids
in the proteins and the genetic code. Biosystems 32, 163 – 170. Siemion, I.Z., 1995. The regularities of the changes of amino acid physico – chemical properties within the genetic code. Amino Acids 8, 1 – 13.
Siemion, I.Z., Stefanowicz, P., 1992. Periodical changes of amino acid reactivity within genetic code. Biosystems 27, 77 – 84.
Stachowiak, D., Polanowski, A., Bieniarz, G., Wilusz, T., 1996. Isolation and amino-acid sequence of two inhibitors of serine
proteinases, members of the squash inhibitor family, from Echinocystis lobataseeds. Acta Biochim. Polon. 43, 507 – 514.
Tatusova, T.A., Madden, T.A., 1999. BLAST 2 SEQUENCES — a new tool for comparing protein and nucleotide se-quences. FEMS Microbiol. Lett. 174, 247 – 250.
Thompson, J.D., Higgins, D.G., Gibson, T.J., 1994. CLUSTAL W: improving the sensitivity of progressive multiple se-quence alignment through sese-quence weighting, position-spe-cific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673 – 4680.
Tomii, K., Kanehisa, M., 1996. Analysis of amino acid indices and mutation matrices for sequence comparison and struc-ture prediction of proteins. Protein Eng. 9, 27 – 36. Wieczorek, M., Otlewski, J., Cook, J., Parks, K., Leluk, J.,
Wilimowska-Pelc, A., Polanowski, A., Wilusz, T., Laskowski, M. Jr., 1985. The squash inhibitor family of serine proteinase inhibitors. Amino acid sequences and association equilibrium constants of inhibitors from squash, summer squash, zucchini, and cucumber seeds. Biochem. Biophys. Res. Commun. 126, 646 – 652.
(1)
with non-statistical approach of genetic
semiho-mology.
3
.
2
.
Examples of application of genetic
semihomology algorithm
3
.
2
.
1
.
Gap location
The commonly used alignment programs such
as ClustalW, MultAlin or FASTA based on PAM
and BLOSUM matrices generate gaps and locate
them in the manner that aligns remaining
frag-ments with each other. The accuracy of such
strategy is good for the fragments of high
homol-ogy. With the lower similarity degree of aligned
sequences the accuracy also decreases. The
strat-egy of these tools tends to combine the deleted
positions into continuous fragments, which are set
at most privileged site for entire alignment. The
results are often inconsistent when provided with
different comparison matrices and even when the
same matrix is applied in different programs. An
example of such discordant results are obtained
when ClustalW and MultAlin are used for the
alignment of chicken ovoinhibitor domains (Scott
Fig. 4. Frequency of six-codon amino acids as a function of position variability in homologous sequences. Results obtained from aligned trypsins (219 positions occupied by 1150 residues). Note that the serine occurrence at variable positions is more significant than arginine and especially leucine. See the text for the source and calculation method (Section 3.2.6).
Fig. 3. A simplified planar diagram of genetic relationship between the amino acids. The significance of the third position of the codon is omitted. The dashed curves show the cryptic mutational passages that do not cause change at the amino acid level.
et al., 1987) (Fig. 5). The gap-containing fragment
was thoroughly analyzed for genetic
semihomol-ogy relationship at both levels — genetic and
amino acid. The obtained results suggest
discon-tinuous alignment of deletions with genetically
reasonable location of non-homologous residues
(Fig. 5). Their location seems to be proper with
respect to the probability of one point mutation
exchange between aligned residues.
3
.
2
.
2
.
Alignment of non
-
homologous fragments
For statistical matrices if aligned fragments of
related proteins are not homologous, the observed
probability of amino acid replacement is taken
under consideration. This strategy works properly
for protein families for which the mechanism of
differentiation is similar. It concerns their general
physico – chemical character, reactivity and
bio-logical function. But for very different proteins
the matrices of replacement likeliness may have
(2)
their known or predicted genetic code. Taking the
semihomologous positions into account completes
the correct alignment initially obtained for
identi-cal positions.
3
.
2
.
3
.
Low
,
but significant homologies
The multiple alignment of evolutionary distant
sequences is sometimes confusing and uncertain.
It is hard to estimate if such sequences show
actual relationship, or their similarity is casual.
very useful for estimation whether the obtained
low homology is actual or not. The double frame
analysis (Leluk, 1998) allows to get rid of casual
noise resulting from accidental single matches at
different settings without erasing the essential
ones (Fig. 5).
3
.
2
.
4
.
Internal homology of multidomain
sequences
The similarity searching tools available in most
protein databases and related servers are designed
for fast database scanning for related sequences to
the query sequence. The given results concern the
optimal alignment and statistical parameters of
comparison effect (FASTA, BLAST, ClustalW,
MultAlin).
The
BLAST
servers
additionally
provide with dot matrix interpretation of the
re-sults. Although, these tools are powerful for all
types of proteins (and nucleic acids), they do not
analyze the internal homology of multidomain
sequences properly (Fig. 6). The domain
align-ment and analysis can be obtained by using
pro-gram SEMIHOM with good dot matrix results,
which can be additionally improved by
semiho-mology analysis (Fig. 6). Even very distant
do-mains (low similarity), which are still related to
each other are visualized with this program. The
results are achieved by comparison of a protein
sequence with itself.
3
.
2
.
5
.
Location of fragments which are products
of other mechanism than single point mutation
The single point mutation is considered as the
most frequent and easiest mechanism of protein
differentiation. However, there are other
mecha-nisms such as deletion, insertion, duplication,
in-version etc. The use of genetic semihomology
algorithm for sequence multiple alignment and
further analysis makes possible to identify the
fragments which are the consequence of other
type of mutation than single transition
/
transver-sion. The identification consists in finding the
fragments where aligned residues are not
semiho-Fig. 5. Sequence comparison of the fragments of twoBow-man – Birk inhibitors. Semihomology application and casual noise reduction for dot matrix results of low homology se-quence comparison. The double frame setting shows ho-mologous and semihoho-mologous positions only for at least four identities along 12-unit string.
(3)
Fig. 6. Internal domain homology of chicken ovoinhibitor (7 domains) and ovomucoid (3 domains) with BLAST 2 SEQUENCES and SEMIHOM.
Fig. 7. Application of genetic semihomology algorithm for identification the fragments of possible different mechanism than single point mutation. The listing shows the amino acid variability at corresponding positions within 7 domains of chicken ovoinhibitor. The non-semihomologous residues (differing by more than one nucleotide in their codon) are separated by square brackets. The accumulation of these positions (in shadows) suggests different mechanism of differentiation from single transition/transversion.
mologous. If there are such positions occurring
consecutively one after another it might indicate
that the mechanism of differentiation is different
from single point mutation (Fig. 7).
3
.
2
.
6
.
Cryptic mutations
There are mutations that change composition
of the gene without effect on amino acid
se-quence. Such cryptic mutations concern the
(4)
degree of variability available by single point
mu-tation. For example a Ser
TCAcan be converted by
single point mutation to Leu, Pro, Ala and Thr
(Fig. 1). A cryptic mutation of Ser
TCAto Ser
AGTextends the spectrum of its conversion to Asn, Ile,
Arg, Gly and Cys by single step mutation. The
role of cryptic mutations is especially important
for amino acids encoded by up to six codons (Ser,
Arg and Leu). Among them serine is particularly
interesting since its codons are the most
differenti-ated (the genetic code matrix scores for them vary
from 0 to
+
2). In fact serine participation in the
most variable positions is outstanding as it is
shown on the example of trypsins (Fig. 4). The
sequences of 31 trypsins from different organisms
were aligned. There were compared 219 positions
(out of 247 corresponding to anionic precursor of
bovine
trypsinogen,
SWISS-PROT
accession
number Q29463) occupied by 1150 residues. The
study did not include the positions which were not
represented in more than 70% of compared
se-quences (positions that were the result of
inser-tion
/
deletion). The percentage of occurrence of
where
p
is the percentage of occurrence of an
individual residue in positions occupied by
j
dif-ferent residues;
n
jthe number of positions where
an individual residue appears among
j
occurring
residues; and
j
is the number of positions where
j
different residues occur.
The planar diagram of genetic semihomology
relationship between amino acids, reduced to the
first two positions of the codon, shows the cryptic
passages for six-codon amino acids (Fig. 3).
3
.
2
.
7
.
Genetic code prediction
Originally the algorithm of genetic
semihomol-ogy has been designed for studies on proteins
whose gene sequence is unknown. Similarly to
PAM or BLOSUM methods the nucleotide
se-quence of the gene is not required (although, it is
very helpful if we know it). One of the possible
options of the genetic semihomology analysis is
the prediction of possible codons for particular
amino acids aligned. The method consists in
com-parison of all possible codons of the amino acids
occurring at the corresponding position of the
Fig. 8. Genetic semihomology prediction of genetic code for chicken ovoinhibitor domains. The predicted gene sequence is compared with the known DNA sequence. Only those positions where prediction reduces possible codons are considered. The predicted codons that are consistent with those present in the ovoinhibitor gene are shadowed.
(5)
aligned sequences and choosing the codons which
are most related to each other, as the most likely.
Usually they can be converted to each other by
single transition
/
transversion process. The
predic-tion method gives promising results (Fig. 8). This
option may be used to design sequence fragments
for the unknown genes of the proteins. It may be
helpful in searching the genes by using
hybridiza-tion methods.
The simplified version of the program
SEMI-HOM with dot matrix graphic interpretation for
comparative and genetic semihomology studies of
protein sequences is available to anyone upon
request to the author ([email protected]).
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403 – 410.
Baek, J.M., Song, J.C., Choi, Y.D., Kim, S.I., 1994. Nucle-otide sequence homology of cDNAs encoding soybean Bowman – Birk type proteinase inhibitor and its isoin-hibitors. Biosci. Biotechnol. Biochem. 58, 843 – 846. Catterall, J.F., Stein, J.P., Kristo, P., Means, A.R., O’Malley,
B.W., 1980. Primary sequence of ovomucoid messenger RNA as determined from cloned complementary DNA. J. Cell Biol. 87, 480 – 487.
Chen, X., Qian, Y., Chi, C., Gan, K., Zhang, M., Chang-Qing, C., 1992a. Chemical synthesis, molecular cloning, and expression of the gene coding for the Trichosanthes
trypsin inhibitor — a squash family inhibitor. J. Biochem. 112, 45 – 51.
Chen, P., Rose, J., Love, R., Wei, C.H., Wang, B.C., 1992b. Reactive sites of an anticarcinogenic Bowman – Birk proteinase inhibitor are similar to other trypsin inhibitors. J. Biol. Chem. 267, 1990 – 1994.
Corpet, F., 1988. Multiple sequence alignment with hierarchi-cal clustering. Nucleic Acids Res. 16, 10881 – 10890. Dayhoff, M.O., Eck, R.V., 1968. In: Dayhoff, M.O., Eck,
R.V. (Eds.), Atlas of Protein Sequence and Structure, vol. 3. National Biomedical Research Foundation, Silver Spring, MD, p. 33.
Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C., 1979. In: Day-hoff, M.O. (Ed.), Atlas of Protein Sequence and Structure, vol. 5. National Biomedical Research Foundation, Wash-ington, DC, p. 345.
George, D.G., Barker, W.C., Hunt, L.T., 1990. Mutation data matrix and its uses. In: Doolittle, R.F. (Ed.), Methods in Enzymology, vol. 183. Academic Press, San Diego, CA, p. 337.
Gerlinger, P., Krust, A., Lemeur, M., Perrin, F., Cochet, M., Gannon, F., Dupret, D., Chambon, P., 1982. Multiple
initiation and polyadenylation sites for the chicken ovomu-coid transcription unit. J. Mol. Biol. 162, 345 – 364. Gish, W., States, D.J., 1993. Identification of protein coding
regions by database similarity search. Nat. Genet. 3, 266 – 272.
Hakateyama, T., Hiraoka, M., Funatsu, G., 1991. Amino acid sequence of the two smallest trypsin inhibitors from sponge gourd seeds. Agric. Biol. Chem. 10, 2641 – 2642.
Haldar, U.C., Saha, S.K., Beavis, R.C., Sinha, N.K., 1996. Trypsin inhibitors from ridged gourd (Luffa acutangula
Linn.) seeds: purification, properties, and amino acid se-quences. J. Protein Chem. 15, 177 – 184.
Hamato, N., Koshiba, T., Pham, T., Tatsumi, Y., Nakamura, D., Takano, R., Hayashi, K., Hong, Y., Hara, S., 1995. Trypsin and elastase inhibitors from bitter gourd (Mo
-mordica charantia Linn.) seeds: purification, amino acid sequences, and inhibitory activities of four new inhibitors. J. Biochem. 117, 432 – 437.
Hayashi, K., Takehisa, T., Hamato, N., Takano, R., Hara, S., Miyata, T., Kato, H., 1994. Inhibition of serine proteinases of the blood coagulation system by squash family protease inhibitors. J. Biochem. 116, 1013 – 1018.
Heitz, A., Chiche, L., Le-Nguyen, D., Castro, B., 1989. 1H 2D NMR and distance geometry study of the folding ofEcbal
-ium elater-ium trypsin inhibitor, a member of the squash inhibitor family. Biochemistry 28, 2392 – 2398.
Joubert, F.J., 1984. Trypsin inhibitors fromMomordica repens
seeds. Phytochemistry 23, 1401 – 1406.
Lai, E.C., Roop, D.R., Tsai, M.-J., Woo, S.L.C., O’Malley, B.W., 1982. Heterogeneous initiation regions for transcrip-tion of the chicken ovomucoid gene. Nucleic Acids Res. 10, 5553 – 5567.
Lee, C., Lin, J., 1995. Amino acid sequences of trypsin in-hibitors from the melonCucumis melo. J. Biochem. 118, 18 – 22.
Leluk, J., 1998. A new algorithm for analysis of the homology in protein primary structure. Comput. Chem. 22, 123 – 131. Ling, M.-H., Qi, H., Chi, C., 1993. Protein, cDNA and genomic DNA sequences of the towel gourd trypsin in-hibitor. J. Biol. Chem. 268, 810 – 814.
Matsuo, M., Hamato, N., Takano, R., Kamei-Hayashi, K., Yasuda-Kamatani, Y., Nomoto, K., Hara, S., 1992. Trypsin inhibitors from bottle gourd (Lagenaria leucantha
Rusby var.Depressa makino) seeds. Purification and amino acid sequences. Biochim. Biophys. Acta 1120, 187 – 192. McGurl, B., Mukherjee, S., Kahn, M., Ryan, C.A., 1995.
Characterization of two proteinase inhibitor (ATI) cDNAs from alfalfa leaves (Medicago sati6a var. Vernema): the
expression of ATI genes in response to wounding and soil microorganisms. Plant Mol. Biol. 27, 995 – 1001.
Nishino, J., Takano, R., Kamei-Hayashi, K., Minakata, H., Nomoto, K., Hara, S., 1992. Amino-acid sequencers of trypsin inhibitors from oriental pickling melon (Cucumis meloL. var. Conomon makino) seeds. Biosci. Biotechnol. Biochem. 56, 1241 – 1246.
(6)
(Citrullus ulgaris) and red bryony (Bryonia dioica). Biol. Chem. Hoppe-Seyler 368, 1505 – 1507.
Scott, M.J., Huckaby, C.S., Kato, I., Kohr, W., Laskowski, M. Jr., Tsai, M.-J., O’Malley, B.W., 1987. Ovoinhibitor introns specify functional domains as in the related and linked ovomucoid gene. J. Biol. Chem. 262, 5899 – 5907. Siemion, I.Z., 1994. Compositional frequencies of amino acids
in the proteins and the genetic code. Biosystems 32, 163 – 170. Siemion, I.Z., 1995. The regularities of the changes of amino acid physico – chemical properties within the genetic code. Amino Acids 8, 1 – 13.
Siemion, I.Z., Stefanowicz, P., 1992. Periodical changes of amino acid reactivity within genetic code. Biosystems 27, 77 – 84.
Stachowiak, D., Polanowski, A., Bieniarz, G., Wilusz, T., 1996. Isolation and amino-acid sequence of two inhibitors of serine
Thompson, J.D., Higgins, D.G., Gibson, T.J., 1994. CLUSTAL W: improving the sensitivity of progressive multiple se-quence alignment through sese-quence weighting, position-spe-cific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673 – 4680.
Tomii, K., Kanehisa, M., 1996. Analysis of amino acid indices and mutation matrices for sequence comparison and struc-ture prediction of proteins. Protein Eng. 9, 27 – 36. Wieczorek, M., Otlewski, J., Cook, J., Parks, K., Leluk, J.,
Wilimowska-Pelc, A., Polanowski, A., Wilusz, T., Laskowski, M. Jr., 1985. The squash inhibitor family of serine proteinase inhibitors. Amino acid sequences and association equilibrium constants of inhibitors from squash, summer squash, zucchini, and cucumber seeds. Biochem. Biophys. Res. Commun. 126, 646 – 652.