Directory UMM :Data Elmu:jurnal:B:Biosystems:Vol56.Issue3.2000:

(1)

A non-statistical approach to protein mutational variability

Jacek Leluk *

Institute of Biochemistry and Molecular Biology,Uni6ersity of Wrocl*aw,Tamka2,50-137Wrocl*aw,Poland

Received 16 August 1999; received in revised form 12 February 2000; accepted 17 February 2000

Abstract

The non-statistical, non-Markovian model of protein mutational variability is described. There are presented the essential features of the algorithm of genetic semihomology and some examples of its application. The results of genetic semihomology approach are compared with the results obtained by using statistical algorithms and matrices which are assumed in widely used programs such as ClustalW, FASTA, MultAlin and BLAST. The aim of the new algorithm elaboration is to improve the accuracy of the results of protein sequence comparison, avoid the wrong assumptions and misinterpretation of the results, and increase the amount of information available from such study. © 2000 Elsevier Science Ireland Ltd. All rights reserved.

Keywords:Protein homology; Mutations; Variability; Genetic semihomology

www.elsevier.com/locate/biosystems

1. Introduction

There are over 400 amino acid indices and 42 mutation matrices described so far (Tomii and Kanehisa, 1996). Actually they are based on much fewer algorithms, most of which are subsequent modifications of the original ones. Some of them dominate in their usefulness and popularity, like the mutation data matrices derived from Dayhoff matrix (Dayhoff and Eck, 1968; Dayhoff et al., 1979) and PAM and BLOSUM indices.

The algorithm of genetic semihomology (Leluk, 1998) differs from the others by its non-statistical approach and lack of scoring scale. Widely used statistical scoring matrices are very helpful for

comparative studies of similarity and variability among different protein families. However, the values of such scoring may bring limited and confusing information. They are also dependent on the database used. For that reason the same algorithm can lead to different scoring matrix contingently upon the type and number of protein sequences used as database (Tomii and Kanehisa, 1996). The statistical approach is useful and nec-essary at certain steps of algorithm development and available knowledge. Such approach, pro-vided with scoring matrices, enables elaborations on the regularities occurring in protein variability, amino acid replacements and relationship to their structural and physico – chemical character. How-ever, they cannot support with explanation about the mechanism of variability or conservativity. It is hard to use them to predict new possible

se-* Tel.: +48-71-3402611; fax:+48-71-3402608. E-mail address:[email protected] (J. Leluk)

(2)

Table 1

Multiple alignment of the gap containing fragments of seven chicken ovoinhibitor domains obtained with Markovian and non-Markovian methods

ClustalW MultAlin Genetic semihomology

Blosum62 Blosum62

VNCSLYASGIGKDGTSWVAC I

I VNCSLYASGIGKDGTSWVAC I VNCSLYASGIGKDGTSWVAC

II IDCSPYLQVV-RDGNTMVAC II IDCSPY-LQVVRDGNTMVAC II IDCSPYLQ-VVRDGNTMVAC

VDCSKYPSTVSKDGRTLVAC

III VDCSKYPSTVSKDGRTLVAC III VDCSKYPSTVSKDGRTLVAC III

IDCDQYPTRKTTGGKLLVRC IV

IV IDCDQYPTRKTTGGKLLVRC IV IDCDQYPTRKTTGGKLLVRC

V LDCTQYLSNT-QNGEAITAC V LDCTQYLSN-TQNGEAITAC V LDCTQYLS-NTQNGEAITAC

VI LDCSKYKTSTLKDGRQVVAC VI LDCSKYKTSTLKDGRQVVAC VI LDCSKYKTSTLKDGRQVVAC

EHCREFQKVS---PIC VII

VII - - - - EHCREFQKVSPIC VII EHCREFQK-VS----P-I-C

.* : * . dcs.y . . dg . . . vaC AA SS*sSSSSSSSSS*SSSS *

NA sSSs S S sSSssss*Sss

Dayhoff(PAM) Dayhoff(PAM)

I VNCSLYASGIGKDGTSWVAC I VNCSLYASGIGKDGTSWVAC

IDCSPY-LQVVRDGNTMVAC II

IDCSPYLQVVR-DGNTMVAC II

VDCSKYPSTVSKDGRTLVAC III

III VDCSKYPSTVSKDGRTLVAC

IDCDQYPTRKTTGGKLLVRC IV

IV IDCDQYPTRKTTGGKLLVRC

V LDCTQYLSNTQ-NGEAITAC V LDCTQYLSNT-QNGEAITAC

LDCSKYKTSTLKDGRQVVAC VI

VI LDCSKYKTSTLKDGRQVVAC

EHCREF---QKVSPIC VII

VII EHCREFQKVSP---IC

.* : * .. C . . % . . . g . . . aC

Identity(unitary matrix) Identity(unitary matrix)

I VNCSLYASGIGKDGTSWVAC I VNCSLYASGIGKDGTSWVAC

II IDCSPYLQVVR-DGNTMVAC II IDCSPYLQVVR-DGNTMVAC

III VDCSKYPSTVSKDGRTLVAC III VDCSKYPSTVSKDGRTLVAC

IDCDQYPTRKTTGGKLLVRC IDCDQYPTRKTTGGKLLVRC

V LDCTQYLSNTQ-NGEAITAC V LDCTQYLSN-TQNGEAITAC

VI LDCSKYKTSTLKDGRQVVAC VI LDCSKYKTSTLKDGRQVVAC

EHCREFQ---KVSPIC VII

---EHCREFQKVSPIC VII

.dC..y . . . g . . . vaC *

quences in a certain protein family, even if the probability of amino acid replacement is well described.

The aim of the new algorithm elaboration is to overcome the basic disadvantages of protein se-quence analysis tools and to exclude some basic errors in the assumptions of the existing statistical methods. It is to be able to explain the mechanism and pathway of protein evolution and differentia-tion, not only limited to the description of the initial and final step of the observed changes. The algorithm of genetic semihomology departs from assuming that amino acid replacement is concor-dant with the Markovian model and explains why this model cannot be applied for these changes (Leluk, 1998). Another goal of this algorithm is to

make it applicable to any group of proteins of any nature, function and location. It can be achieved for two reasons, (1) minimization of basic as-sumptions limited to the general amino acid/

codon translation table and assuming that single point mutation is a principle, most common, mechanism of protein variability; (2) non-statisti-cal approach (a statistinon-statisti-cal mutation matrix is re-placed by three dimensional diagram representing all theoretically possible amino acid replace-ments). The approach presented in this article is to allow to study the multidomain structure of the proteins (at the primary structure level), predic-tion of the genetic code of the proteins with unknown gene sequence (useful for genetic probe construction), explaining the mutation mechanism

(3)

that took place at a certain fragment, precise location of insertion/deletion sites at non-ho-mologous fragments, confirmation that the low homology of compared proteins is actual or casual, and prediction of the future or not reported muta-tions for the related proteins.

2. Materials and methods

The amino acid and nucleotide sequences se-lected for the purpose of this study concerned proteinase inhibitors from squash seeds (Joubert, 1984; Wieczorek et al., 1985; Otlewski et al., 1987; Heitz et al., 1989; Hakateyama et al., 1991; Chen et al., 1992a,b; Matsuo et al., 1992; Nishino et al., 1992; Ling et al., 1993; Hayashi et al., 1994; Lee and Lin, 1995; Hamato et al., 1995; Haldar et al., 1996; Stachowiak et al., 1996), the Bowman – Birk inhibitors (Odani and Ikenaka, 1978; Chen et al., 1992a,b; Baek et al., 1994; McGurl et al., 1995; sequences obtained from NCBI database), 31 se-quences from the trypsin family (sese-quences ob-tained from SWISS-PROT (Bairoch and Apweiler, 1997; Bairoch and Apweiler, 1999; Apweiler et al., 1997; Bairoch, 1997), and PROSITE (Hofmann et al., 1999; Bucher and Bairoch, 1994) databases), chicken ovoinhibitor (Scott et al., 1987) and chicken ovomucoid (Catterall et al., 1980; Ger-linger et al., 1982; Lai et al., 1982). The Markovian matrices of amino acid replacement were applied by using programs ClustalW (Thompson et al., 1994), MultAlin (Corpet, 1988) and BLAST (Altschul et al., 1990; Gish and States, 1993). The algorithm of genetic semihomology was applied manually or with the program SEMIHOM (Leluk, 1998) (Table 1).

The dot matrix interpretation of the results was achieved with BLAST 2 SEQUENCES (Tatusova and Madden, 1999) and SEMIHOM (Leluk, 1998).

3. Results and discussion

3.1. General characteristics of non-statistical al

-gorithm of genetic semihomology

The important feature of the algorithm of

ge-netic semihomology concerns the minimal initial assumptions. It assumes the general codon-to-amino acid translation table, and the single point mutation as the basic and most common mecha-nism of protein differentiation (Fig. 1) (Leluk, 1998). This approach is not limited to single point mutation mechanism only. It may be used to localize those fragments, which undergo other kind of mutation.

The scoring scale can be assumed for genetic semihomology algorithm, but only for rough comparison studies. The identity of compared po-sitions has score of 3, the semihomology of type I (residues encoded by triplets which differ by one nucleotide of the same type, e.g. both are purines) is scored as 2, and semihomology of type II (residues encoded by triplets which differ by one nucleotide of different type) as 1. In the other words, type I is represented by any single transi-tion and type II by any single transversion in the nucleotide triplet. The non-semihomologous pairs are scored as 0.

The sense of this algorithm is not the scoring the comparison results but explaining the way of amino acid replacement and the mechanism of variability at both, amino acid and genetic code, levels. For that reasons the algorithm of genetic semihomology should be treated separately from other algorithms, especially statistical ones. This interpretation is often consistent with the amino acid replacement indices data (Fig. 2). Theoretical results obtained by algorithm of genetic semiho-mology and mutation data matrix scoring support each other. This coincidence is very important for correct interpretation of the mechanism of protein variability. The used mutation data matrix scoring is supported by PAM250 indices. The PAM250 values for each amino acid replacement were added. The result of such calculation is consistent with the result of theoretical genetic semihomol-ogy analysis. The higher score of the PAM in-dices’ sum is convergent with the genetic relationship of the studied set of amino acids at individual positions.

The goal of minimization of the initial assump-tions in the algorithm of genetic semihomology is to (1) make the algorithm universal for any group

(4)

of proteins and (2) avoid errors and misinterpreta-tion of the results. Basically, the algorithm as-sumes three general establishments. One of them is the codon-amino acid translation table, which is undoubtedly considered as true.

The second is the statement that single nucle-otide replacement is the most often and easiest mechanism of mutational variability. The last general establishment comprises the three-dimen-sional diagram of amino-acid replacements by single point mutation (Fig. 1).

A single nucleotide replacement concerns two types of replacements-within the same group (purine to purine or pyrimidine to pyrimidine) and between two groups (purine to pyrimidine or conversely). Thus, the first case is described here

as type I (transition) and the second as type II (transversion). It seems, that the replacement of type I is spatially more possible than the other. However, there are no distinct reports that such dependence exists. Another question is the posi-tion-dependence of replacement within the codon and its further consequence. The third nucleotide of the codon is known as the most tolerant and most flexible encoding unit. Most amino acids are encoded by first two nucleotides while the third position is optional. Therefore, in algorithm of genetic semihomology mutation at this position is considered separately and is defined as type III, which is independent of type I and II.

Amino acid residues as the products of genetic code translation are not equal units, when we consider both nucleotide and protein levels simul-taneously. They are encoded by different number of codons from 1 to 6. That implies the difference in probability of their replacement by single point mutation. Such differentiation causes, that amino acids themselves cannot be treated uniformly in terms of mutational variability. It is important which exactly codon encodes a certain amino acid for its further transformation to another one. For example arginine encoded by AGG can be re-placed by methionine (ATG) much easier than arginine encoded by CGY (Fig. 1). On the other hand, the probability of ArgCGY replacement by

Cys (TGY) is higher than it is for ArgAGG. This

leads to conclusion that the information of the amino acid residue evolutionary preceding the residue at a certain position is still present in its genetic code if the mechanism of replacement was based on point mutation. The probability of fur-ther replacement is also dependent on its codon. That reveals inconsistency of amino acid muta-tional substitution with the Markovian model which establishes all exchange probabilities as in-dependent on the previous history (George et al., 1990). Markovian model however is assumed in most statistical matrices of protein mutational variability. This could be true only if each amino acid was encoded by one codon.

The algorithms assuming statistical mutation data matrix and Markovian model omit the pro-cess of cryptic mutations. These are the muta-tions, which considerably change the codon

Fig. 1. Semihomologous relationships among proteinaceous amino acids. The codons of residues along each axis differ by only one nucleotide (Leluk, 1998). The diagram setting shows the codon changes at first (axis 1), second (axis 2) or third (axis 3) position in order AGCU. This diagram forms

the basis of the genetic semihomology algorithm instead of a matrix of statistical parameters or replacement indices, which support statistical algorithms.

(5)

Fig. 2. The semihomologous correlation between amino acids occurring at selected non-homologous positions of inhibitors from squash seeds. The solid lines indicate the transition type of semihomology; the dashed lines refer to transversion type of semihomology. The values in brackets indicate the additive Dayhoff mutation data matrix (MDM) score calculated for each set. Note the consistency between semihomologous relationship and MDM scores.

sequence without changing the amino acid. Such phenomenon refers especially to residues en-coded by up to six codons (Arg, Leu and Ser). Cryptic mutations enable code differentiation not restricted by conditions of environmental se-lection. They enlarge the spectrum of further possible replacements. Generally, most muta-tions of the third nucleotide of the codon are cryptic. For arginine, leucine and serine cryptic mutations at first or second positions are also allowed. Thus, they can be considered as inter-mediates between other two residues, exchange of which require more mutations. These three amino acids often occur at the positions, which are very variable and are occupied by six or more residues within homologous proteins (Figs. 3 and 4).

It is admitted that from evolutionary point of view the genetic code and amino acid ‘language’ have evolved simultaneously. They also act with strict coherence with each other. Therefore, in analysis of protein differentiation and variabil-ity, both of them should be considered as well. The approach of Siemion and Stefanowicz (1992) presents the periodical relation between chemical reactivity of amino acids and the pri-mary structure of their genetic code. It also refers to one point mutation mechanism which is the assumption for amino acid reactivity properties and Chou – Fasman conformational parameters relationship (Siemion, 1994, 1995). Although, those elaborations concern the reac-tivity and secondary structure parameters, the theoretical strategy of approach is consistent

(6)

with non-statistical approach of genetic semiho-mology.

3.2. Examples of application of genetic semihomology algorithm

3.2.1. Gap location

The commonly used alignment programs such as ClustalW, MultAlin or FASTA based on PAM and BLOSUM matrices generate gaps and locate them in the manner that aligns remaining frag-ments with each other. The accuracy of such strategy is good for the fragments of high homol-ogy. With the lower similarity degree of aligned sequences the accuracy also decreases. The strat-egy of these tools tends to combine the deleted positions into continuous fragments, which are set at most privileged site for entire alignment. The results are often inconsistent when provided with different comparison matrices and even when the same matrix is applied in different programs. An example of such discordant results are obtained when ClustalW and MultAlin are used for the alignment of chicken ovoinhibitor domains (Scott

Fig. 4. Frequency of six-codon amino acids as a function of position variability in homologous sequences. Results obtained from aligned trypsins (219 positions occupied by 1150 residues). Note that the serine occurrence at variable positions is more significant than arginine and especially leucine. See the text for the source and calculation method (Section 3.2.6).

Fig. 3. A simplified planar diagram of genetic relationship between the amino acids. The significance of the third position of the codon is omitted. The dashed curves show the cryptic mutational passages that do not cause change at the amino acid level.

et al., 1987) (Fig. 5). The gap-containing fragment was thoroughly analyzed for genetic semihomol-ogy relationship at both levels — genetic and amino acid. The obtained results suggest discon-tinuous alignment of deletions with genetically reasonable location of non-homologous residues (Fig. 5). Their location seems to be proper with respect to the probability of one point mutation exchange between aligned residues.

3.2.2. Alignment of non-homologous fragments

For statistical matrices if aligned fragments of related proteins are not homologous, the observed probability of amino acid replacement is taken under consideration. This strategy works properly for protein families for which the mechanism of differentiation is similar. It concerns their general physico – chemical character, reactivity and bio-logical function. But for very different proteins the matrices of replacement likeliness may have

(7)

not corresponding values. Genetic semihomology shows only the mutational probability of replace-ment, which is independent on the character of a protein and its biological function. It aligns the positions that are most similar with respect to their known or predicted genetic code. Taking the semihomologous positions into account completes the correct alignment initially obtained for identi-cal positions.

3.2.3. Low, but significant homologies

The multiple alignment of evolutionary distant sequences is sometimes confusing and uncertain. It is hard to estimate if such sequences show actual relationship, or their similarity is casual.

The problem is additionally increased by lack of unified methodology for actual homology estima-tion. The program SEMIHOM equipped with genetic semihomology analysis tool applicable also for proteins of unknown gene sequence is very useful for estimation whether the obtained low homology is actual or not. The double frame analysis (Leluk, 1998) allows to get rid of casual noise resulting from accidental single matches at different settings without erasing the essential ones (Fig. 5).

3.2.4. Internal homology of multidomain sequences

The similarity searching tools available in most protein databases and related servers are designed for fast database scanning for related sequences to the query sequence. The given results concern the optimal alignment and statistical parameters of comparison effect (FASTA, BLAST, ClustalW, MultAlin). The BLAST servers additionally provide with dot matrix interpretation of the re-sults. Although, these tools are powerful for all types of proteins (and nucleic acids), they do not analyze the internal homology of multidomain sequences properly (Fig. 6). The domain align-ment and analysis can be obtained by using pro-gram SEMIHOM with good dot matrix results, which can be additionally improved by semiho-mology analysis (Fig. 6). Even very distant do-mains (low similarity), which are still related to each other are visualized with this program. The results are achieved by comparison of a protein sequence with itself.

3.2.5. Location of fragments which are products of other mechanism than single point mutation

The single point mutation is considered as the most frequent and easiest mechanism of protein differentiation. However, there are other mecha-nisms such as deletion, insertion, duplication, in-version etc. The use of genetic semihomology algorithm for sequence multiple alignment and further analysis makes possible to identify the fragments which are the consequence of other type of mutation than single transition/ transver-sion. The identification consists in finding the fragments where aligned residues are not

semiho-Fig. 5. Sequence comparison of the fragments of two Bow-man – Birk inhibitors. Semihomology application and casual noise reduction for dot matrix results of low homology se-quence comparison. The double frame setting shows ho-mologous and semihoho-mologous positions only for at least four identities along 12-unit string.

(8)

Fig. 6. Internal domain homology of chicken ovoinhibitor (7 domains) and ovomucoid (3 domains) with BLAST 2 SEQUENCES and SEMIHOM.

Fig. 7. Application of genetic semihomology algorithm for identification the fragments of possible different mechanism than single point mutation. The listing shows the amino acid variability at corresponding positions within 7 domains of chicken ovoinhibitor. The non-semihomologous residues (differing by more than one nucleotide in their codon) are separated by square brackets. The accumulation of these positions (in shadows) suggests different mechanism of differentiation from single transition/transversion.

mologous. If there are such positions occurring consecutively one after another it might indicate that the mechanism of differentiation is different from single point mutation (Fig. 7).

3.2.6. Cryptic mutations

There are mutations that change composition of the gene without effect on amino acid se-quence. Such cryptic mutations concern the

(9)

codons for amino acids that can be encoded by two or more triplets. These mutations seemingly do not have any effect at the protein level. How-ever, they are important mechanism to the very variable sites in the manner that they increase the degree of variability available by single point mu-tation. For example a SerTCAcan be converted by

single point mutation to Leu, Pro, Ala and Thr (Fig. 1). A cryptic mutation of SerTCAto SerAGT

extends the spectrum of its conversion to Asn, Ile, Arg, Gly and Cys by single step mutation. The role of cryptic mutations is especially important for amino acids encoded by up to six codons (Ser, Arg and Leu). Among them serine is particularly interesting since its codons are the most differenti-ated (the genetic code matrix scores for them vary from 0 to +2). In fact serine participation in the most variable positions is outstanding as it is shown on the example of trypsins (Fig. 4). The sequences of 31 trypsins from different organisms were aligned. There were compared 219 positions (out of 247 corresponding to anionic precursor of bovine trypsinogen, SWISS-PROT accession number Q29463) occupied by 1150 residues. The study did not include the positions which were not represented in more than 70% of compared se-quences (positions that were the result of inser-tion/deletion). The percentage of occurrence of

the individual residues at a variable position was calculated in the following way:

pj=

j×100

where p is the percentage of occurrence of an individual residue in positions occupied by j dif-ferent residues; nj the number of positions where an individual residue appears among j occurring residues; and jis the number of positions where j

different residues occur.

The planar diagram of genetic semihomology relationship between amino acids, reduced to the first two positions of the codon, shows the cryptic passages for six-codon amino acids (Fig. 3).

3.2.7. Genetic code prediction

Originally the algorithm of genetic semihomol-ogy has been designed for studies on proteins whose gene sequence is unknown. Similarly to PAM or BLOSUM methods the nucleotide se-quence of the gene is not required (although, it is very helpful if we know it). One of the possible options of the genetic semihomology analysis is the prediction of possible codons for particular amino acids aligned. The method consists in com-parison of all possible codons of the amino acids occurring at the corresponding position of the

Fig. 8. Genetic semihomology prediction of genetic code for chicken ovoinhibitor domains. The predicted gene sequence is compared with the known DNA sequence. Only those positions where prediction reduces possible codons are considered. The predicted codons that are consistent with those present in the ovoinhibitor gene are shadowed.

(10)

aligned sequences and choosing the codons which are most related to each other, as the most likely. Usually they can be converted to each other by single transition/transversion process. The predic-tion method gives promising results (Fig. 8). This option may be used to design sequence fragments for the unknown genes of the proteins. It may be helpful in searching the genes by using hybridiza-tion methods.

The simplified version of the program SEMI-HOM with dot matrix graphic interpretation for comparative and genetic semihomology studies of protein sequences is available to anyone upon request to the author ([email protected]).

References

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403 – 410.

Baek, J.M., Song, J.C., Choi, Y.D., Kim, S.I., 1994. Nucle-otide sequence homology of cDNAs encoding soybean Bowman – Birk type proteinase inhibitor and its isoin-hibitors. Biosci. Biotechnol. Biochem. 58, 843 – 846. Catterall, J.F., Stein, J.P., Kristo, P., Means, A.R., O’Malley,

B.W., 1980. Primary sequence of ovomucoid messenger RNA as determined from cloned complementary DNA. J. Cell Biol. 87, 480 – 487.

Chen, X., Qian, Y., Chi, C., Gan, K., Zhang, M., Chang-Qing, C., 1992a. Chemical synthesis, molecular cloning, and expression of the gene coding for the Trichosanthes trypsin inhibitor — a squash family inhibitor. J. Biochem. 112, 45 – 51.

Chen, P., Rose, J., Love, R., Wei, C.H., Wang, B.C., 1992b. Reactive sites of an anticarcinogenic Bowman – Birk proteinase inhibitor are similar to other trypsin inhibitors. J. Biol. Chem. 267, 1990 – 1994.

Corpet, F., 1988. Multiple sequence alignment with hierarchi-cal clustering. Nucleic Acids Res. 16, 10881 – 10890. Dayhoff, M.O., Eck, R.V., 1968. In: Dayhoff, M.O., Eck,

R.V. (Eds.), Atlas of Protein Sequence and Structure, vol. 3. National Biomedical Research Foundation, Silver Spring, MD, p. 33.

Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C., 1979. In: Day-hoff, M.O. (Ed.), Atlas of Protein Sequence and Structure, vol. 5. National Biomedical Research Foundation, Wash-ington, DC, p. 345.

George, D.G., Barker, W.C., Hunt, L.T., 1990. Mutation data matrix and its uses. In: Doolittle, R.F. (Ed.), Methods in Enzymology, vol. 183. Academic Press, San Diego, CA, p. 337.

Gerlinger, P., Krust, A., Lemeur, M., Perrin, F., Cochet, M., Gannon, F., Dupret, D., Chambon, P., 1982. Multiple

initiation and polyadenylation sites for the chicken ovomu-coid transcription unit. J. Mol. Biol. 162, 345 – 364. Gish, W., States, D.J., 1993. Identification of protein coding

regions by database similarity search. Nat. Genet. 3, 266 – 272.

Hakateyama, T., Hiraoka, M., Funatsu, G., 1991. Amino acid sequence of the two smallest trypsin inhibitors from sponge gourd seeds. Agric. Biol. Chem. 10, 2641 – 2642.

Haldar, U.C., Saha, S.K., Beavis, R.C., Sinha, N.K., 1996. Trypsin inhibitors from ridged gourd (Luffa acutangula Linn.) seeds: purification, properties, and amino acid se-quences. J. Protein Chem. 15, 177 – 184.

Hamato, N., Koshiba, T., Pham, T., Tatsumi, Y., Nakamura, D., Takano, R., Hayashi, K., Hong, Y., Hara, S., 1995. Trypsin and elastase inhibitors from bitter gourd (Mo -mordica charantia Linn.) seeds: purification, amino acid sequences, and inhibitory activities of four new inhibitors. J. Biochem. 117, 432 – 437.

Hayashi, K., Takehisa, T., Hamato, N., Takano, R., Hara, S., Miyata, T., Kato, H., 1994. Inhibition of serine proteinases of the blood coagulation system by squash family protease inhibitors. J. Biochem. 116, 1013 – 1018.

Heitz, A., Chiche, L., Le-Nguyen, D., Castro, B., 1989. 1H 2D NMR and distance geometry study of the folding ofEcbal -ium elater-ium trypsin inhibitor, a member of the squash inhibitor family. Biochemistry 28, 2392 – 2398.

Joubert, F.J., 1984. Trypsin inhibitors fromMomordica repens seeds. Phytochemistry 23, 1401 – 1406.

Lai, E.C., Roop, D.R., Tsai, M.-J., Woo, S.L.C., O’Malley, B.W., 1982. Heterogeneous initiation regions for transcrip-tion of the chicken ovomucoid gene. Nucleic Acids Res. 10, 5553 – 5567.

Lee, C., Lin, J., 1995. Amino acid sequences of trypsin in-hibitors from the melonCucumis melo. J. Biochem. 118, 18 – 22.

Leluk, J., 1998. A new algorithm for analysis of the homology in protein primary structure. Comput. Chem. 22, 123 – 131. Ling, M.-H., Qi, H., Chi, C., 1993. Protein, cDNA and genomic DNA sequences of the towel gourd trypsin in-hibitor. J. Biol. Chem. 268, 810 – 814.

Matsuo, M., Hamato, N., Takano, R., Kamei-Hayashi, K., Yasuda-Kamatani, Y., Nomoto, K., Hara, S., 1992. Trypsin inhibitors from bottle gourd (Lagenaria leucantha Rusby var.Depressa makino) seeds. Purification and amino acid sequences. Biochim. Biophys. Acta 1120, 187 – 192. McGurl, B., Mukherjee, S., Kahn, M., Ryan, C.A., 1995.

Characterization of two proteinase inhibitor (ATI) cDNAs from alfalfa leaves (Medicago sati6a var. Vernema): the

expression of ATI genes in response to wounding and soil microorganisms. Plant Mol. Biol. 27, 995 – 1001.

Nishino, J., Takano, R., Kamei-Hayashi, K., Minakata, H., Nomoto, K., Hara, S., 1992. Amino-acid sequencers of trypsin inhibitors from oriental pickling melon (Cucumis meloL. var. Conomon makino) seeds. Biosci. Biotechnol. Biochem. 56, 1241 – 1246.

(11)

Odani, S., Ikenaka, T., 1978. Studies on soybean trypsin inhibitors, XII. Linear sequences of two soybean double-headed trypsin inhibitors, D-II and E-I. J. Biochem. 83, 737 – 745.

Otlewski, J., Whatley, H., Polanowski, A., Wilusz, T., 1987. Amino-acid sequences of trypsin inhibitors from watermelon (Citrullus 6ulgaris) and red bryony (Bryonia dioica). Biol.

Chem. Hoppe-Seyler 368, 1505 – 1507.

Scott, M.J., Huckaby, C.S., Kato, I., Kohr, W., Laskowski, M. Jr., Tsai, M.-J., O’Malley, B.W., 1987. Ovoinhibitor introns specify functional domains as in the related and linked ovomucoid gene. J. Biol. Chem. 262, 5899 – 5907. Siemion, I.Z., 1994. Compositional frequencies of amino acids

in the proteins and the genetic code. Biosystems 32, 163 – 170. Siemion, I.Z., 1995. The regularities of the changes of amino acid physico – chemical properties within the genetic code. Amino Acids 8, 1 – 13.

Siemion, I.Z., Stefanowicz, P., 1992. Periodical changes of amino acid reactivity within genetic code. Biosystems 27, 77 – 84.

Stachowiak, D., Polanowski, A., Bieniarz, G., Wilusz, T., 1996. Isolation and amino-acid sequence of two inhibitors of serine

proteinases, members of the squash inhibitor family, from Echinocystis lobataseeds. Acta Biochim. Polon. 43, 507 – 514.

Tatusova, T.A., Madden, T.A., 1999. BLAST 2 SEQUENCES — a new tool for comparing protein and nucleotide se-quences. FEMS Microbiol. Lett. 174, 247 – 250.

Thompson, J.D., Higgins, D.G., Gibson, T.J., 1994. CLUSTAL W: improving the sensitivity of progressive multiple se-quence alignment through sese-quence weighting, position-spe-cific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673 – 4680.

Tomii, K., Kanehisa, M., 1996. Analysis of amino acid indices and mutation matrices for sequence comparison and struc-ture prediction of proteins. Protein Eng. 9, 27 – 36. Wieczorek, M., Otlewski, J., Cook, J., Parks, K., Leluk, J.,

Wilimowska-Pelc, A., Polanowski, A., Wilusz, T., Laskowski, M. Jr., 1985. The squash inhibitor family of serine proteinase inhibitors. Amino acid sequences and association equilibrium constants of inhibitors from squash, summer squash, zucchini, and cucumber seeds. Biochem. Biophys. Res. Commun. 126, 646 – 652.

(1)

with non-statistical approach of genetic

semiho-mology.

3 .

2 .

Examples of application of genetic

semihomology algorithm

3 .

2 .

1 .

Gap location

The commonly used alignment programs such

as ClustalW, MultAlin or FASTA based on PAM

and BLOSUM matrices generate gaps and locate

them in the manner that aligns remaining

frag-ments with each other. The accuracy of such

strategy is good for the fragments of high

homol-ogy. With the lower similarity degree of aligned

sequences the accuracy also decreases. The

strat-egy of these tools tends to combine the deleted

positions into continuous fragments, which are set

at most privileged site for entire alignment. The

results are often inconsistent when provided with

different comparison matrices and even when the

same matrix is applied in different programs. An

example of such discordant results are obtained

when ClustalW and MultAlin are used for the

alignment of chicken ovoinhibitor domains (Scott

et al., 1987) (Fig. 5). The gap-containing fragment

was thoroughly analyzed for genetic

semihomol-ogy relationship at both levels — genetic and

amino acid. The obtained results suggest

discon-tinuous alignment of deletions with genetically

reasonable location of non-homologous residues

(Fig. 5). Their location seems to be proper with

respect to the probability of one point mutation

exchange between aligned residues.

3 .

2 .

Alignment of non

-

homologous fragments

For statistical matrices if aligned fragments of

related proteins are not homologous, the observed

probability of amino acid replacement is taken

under consideration. This strategy works properly

for protein families for which the mechanism of

differentiation is similar. It concerns their general

physico – chemical character, reactivity and

bio-logical function. But for very different proteins

the matrices of replacement likeliness may have

(2)

their known or predicted genetic code. Taking the

semihomologous positions into account completes

the correct alignment initially obtained for

identi-cal positions.

3 .

2 .

3 .

Low

,

but significant homologies

The multiple alignment of evolutionary distant

sequences is sometimes confusing and uncertain.

It is hard to estimate if such sequences show

actual relationship, or their similarity is casual.

very useful for estimation whether the obtained

low homology is actual or not. The double frame

analysis (Leluk, 1998) allows to get rid of casual

noise resulting from accidental single matches at

different settings without erasing the essential

ones (Fig. 5).

3 .

2 .

4 .

Internal homology of multidomain

sequences

The similarity searching tools available in most

protein databases and related servers are designed

for fast database scanning for related sequences to

the query sequence. The given results concern the

optimal alignment and statistical parameters of

comparison effect (FASTA, BLAST, ClustalW,

MultAlin).

The

BLAST

servers

additionally

provide with dot matrix interpretation of the

re-sults. Although, these tools are powerful for all

types of proteins (and nucleic acids), they do not

analyze the internal homology of multidomain

sequences properly (Fig. 6). The domain

align-ment and analysis can be obtained by using

pro-gram SEMIHOM with good dot matrix results,

which can be additionally improved by

semiho-mology analysis (Fig. 6). Even very distant

do-mains (low similarity), which are still related to

each other are visualized with this program. The

results are achieved by comparison of a protein

sequence with itself.

3 .

2 .

5 .

Location of fragments which are products

of other mechanism than single point mutation

The single point mutation is considered as the

most frequent and easiest mechanism of protein

differentiation. However, there are other

mecha-nisms such as deletion, insertion, duplication,

in-version etc. The use of genetic semihomology

algorithm for sequence multiple alignment and

further analysis makes possible to identify the

fragments which are the consequence of other

type of mutation than single transition

/

transver-sion. The identification consists in finding the

fragments where aligned residues are not

semiho-Fig. 5. Sequence comparison of the fragments of two

Bow-man – Birk inhibitors. Semihomology application and casual noise reduction for dot matrix results of low homology se-quence comparison. The double frame setting shows ho-mologous and semihoho-mologous positions only for at least four identities along 12-unit string.

(3)

Fig. 6. Internal domain homology of chicken ovoinhibitor (7 domains) and ovomucoid (3 domains) with BLAST 2 SEQUENCES and SEMIHOM.

mologous. If there are such positions occurring

consecutively one after another it might indicate

that the mechanism of differentiation is different

from single point mutation (Fig. 7).

3 .

2 .

6 .

Cryptic mutations

There are mutations that change composition

of the gene without effect on amino acid

se-quence. Such cryptic mutations concern the

(4)

degree of variability available by single point

mu-tation. For example a Ser

TCA

can be converted by

single point mutation to Leu, Pro, Ala and Thr

(Fig. 1). A cryptic mutation of Ser

TCA

to Ser

AGT

extends the spectrum of its conversion to Asn, Ile,

Arg, Gly and Cys by single step mutation. The

role of cryptic mutations is especially important

for amino acids encoded by up to six codons (Ser,

Arg and Leu). Among them serine is particularly

interesting since its codons are the most

differenti-ated (the genetic code matrix scores for them vary

from 0 to

+

2). In fact serine participation in the

most variable positions is outstanding as it is

shown on the example of trypsins (Fig. 4). The

sequences of 31 trypsins from different organisms

were aligned. There were compared 219 positions

(out of 247 corresponding to anionic precursor of

bovine

trypsinogen,

SWISS-PROT

accession

number Q29463) occupied by 1150 residues. The

study did not include the positions which were not

represented in more than 70% of compared

se-quences (positions that were the result of

inser-tion

/

deletion). The percentage of occurrence of

where

p

is the percentage of occurrence of an

individual residue in positions occupied by

j

dif-ferent residues;

n

the number of positions where

an individual residue appears among

j

occurring

residues; and

j

is the number of positions where

j

different residues occur.

The planar diagram of genetic semihomology

relationship between amino acids, reduced to the

first two positions of the codon, shows the cryptic

passages for six-codon amino acids (Fig. 3).

3 .

2 .

7 .

Genetic code prediction

Originally the algorithm of genetic

semihomol-ogy has been designed for studies on proteins

whose gene sequence is unknown. Similarly to

PAM or BLOSUM methods the nucleotide

se-quence of the gene is not required (although, it is

very helpful if we know it). One of the possible

options of the genetic semihomology analysis is

the prediction of possible codons for particular

amino acids aligned. The method consists in

com-parison of all possible codons of the amino acids

occurring at the corresponding position of the

(5)

aligned sequences and choosing the codons which

are most related to each other, as the most likely.

Usually they can be converted to each other by

single transition

/

transversion process. The

predic-tion method gives promising results (Fig. 8). This

option may be used to design sequence fragments

for the unknown genes of the proteins. It may be

helpful in searching the genes by using

hybridiza-tion methods.

The simplified version of the program

SEMI-HOM with dot matrix graphic interpretation for

comparative and genetic semihomology studies of

protein sequences is available to anyone upon

request to the author ([email protected]).

References

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403 – 410.

B.W., 1980. Primary sequence of ovomucoid messenger RNA as determined from cloned complementary DNA. J. Cell Biol. 87, 480 – 487.

Chen, X., Qian, Y., Chi, C., Gan, K., Zhang, M., Chang-Qing, C., 1992a. Chemical synthesis, molecular cloning, and expression of the gene coding for the Trichosanthes

trypsin inhibitor — a squash family inhibitor. J. Biochem. 112, 45 – 51.