Local Sequence Alignment Tools

4.1. Local Sequence Alignment Tools

The local sequence alignment tools specifically permit and help one to identify the exact loca- tion of query sequence* (or input sequence) amongst the available sequences in the database. Hence, it really aids in finding out the ‘identity’ of a query sequence. Besides, it also helps in the discovery of : (a) structural genes ; and (b) regulatory sequences ; even though the query sequence con- tributes merely a segment of the parent gene or of a polypeptide duly encoded by this particular gene.

Advantages. The various glaring advantages of the local sequence alignment tools are as fol- lows :

1. Conserved repetitive sequences, such as : short-sequence repeats (SSRs) or retrotransposons may be searched conveniently in the databases.

2. Help in carrying out intensive and extensive basic studies e.g., ‘taxonomy and evolution’ to enormous usages in ‘crop breeding’ and ‘human health care’.

A few typical as well as specific examples of the ‘local sequence alignment tools’ are as given below :

(a) The FASTA Family. FASTA is nothing but a package invariably employed for local se- quence alignment that particularly predates BLAST (i.e., basic local alignment sequence tools). It es- sentially comprises of search programmes which are found to be analogous to the main BLAST-modes. In fact, the FASTA programmes mentioned here may be compiled with great convenience on Linux system, but also can operate on Windows ; and, therefore, may be freely downloaded from the respective web site :

ftp://ftp.virginia.edu/pub/fasta. However, FASTA 3.3 package includes all these programmes. Applications. The various application of FASTA are as given below :

(1) FASTA is presently being employed more commonly as an appropriate format, for data ex- ploration and transaction, rather than a sequence alignment tool i.e., as a software.

Examples :

(1) [fastaa] and [ssearch]. Each of these programmes compares favourably a protein sequence Vs a protein database or a DNA sequence Vs a DNA database meticulously.

FASTA Algorithm. In fact, fastaa makes use of the FASTA algorithm ; and ssearch uses

Smith-Waterman algorithm.**

* The specific sequence that is actually fed to the computer, and about which information is sought e.g., DNA, protein.

** This package programme is still being used profusely at the University of Virginia, and its services are duly

PHARMACEUTICAL BIOTECHNOLOGY

(2) [fastx/fasty]. It aids in the translation of a DNA sequence into amino acid sequence of a protein and compares favourably such a protein against a protein database, thus permitting search of unknown nucleic acid sequence against a protein database.

(3) [tfastx/tfasty]. It is regarded to be just the reverse of [fastx/fasty] i.e., (2) above, since it categorically compares a protein sequence againsat a DNA sequence translated duly in 3 forward and 3 reverse frames.

(b) Pat Match. Pat Match enables its users to perform the precise analysis of a given sequence against either all or a few selected ‘sequence datasets’ invariably available in the ensuing databases. It may be accomplished conveniently via the Web Sites.

However, the datasets that are commonly in use for this purpose essentially include the follow- ing ranges of applications, namely :

(1) Protein sequences and their respective structures. (2) Genomic DNA sequences. (3) Bacterial artificial chromosomes (BACs). (4) EST sequences. In fact, Pat Match, was initially developed in 1999 as an Arabidopsis database, which enables its

users to find motifs by entering a ‘regular expression pattern’ or simply a ‘string of less than 20 characters’. (c) BLAST [Basic Local Alignment Sequence Tool]. In true sense, BLAST is invariably en-

countered in two versions, namely : (a) earlier versions ; and (b) later versions. Now, these two identi- fied versions of BLAST shall be treated individually in the sections that follows :

(1) BLAST-Earlier Versions. The most dynamic, reliable, and popular tool for searching and subsequently identifying ‘sequence databases’ is accomplished by a package termed as BLAST. The earlier versions are solely based on an algorithm, which is strategically located at the core of a plethora of ‘on-line sequence search servers’. It acts judiciously by carrying out pair-wise comparisons of such sequences with specific emphasis upon seeking reasons of ‘local similarity’, in lieu of the global alignment having almost all sequencies. Impor- tantly, BLAST categorically permits pair wise : DNA-protein, DNA-DNA, protein-DNA, and protein-protein alignments employing the corresponding databases. In actual practice, however, BLAST users may engage and submit upto almost 20,000 nucleotides in multi- FASTA format, and the prevailing programme carries out the desired search in the database for identical sequences. One may gainfully make utilization of the following ‘earlier ver- sions of BLAST ’ efficaciously :

(i) BLAST p. It compares favourably an amino acid query sequence against a protein sequence database,

(ii) BLAST n. It amply compares a nucleotide query sequence against a nucleotide se- quence database, (iii) BLAST x. It effectively compares a nucleotide query sequence translated in all read- ing frames against an ensuing protein sequence database,

(iv) t BLAST n. It compares efficaciously a protein query sequence against a nucleotide database that has been duly translated in all reading frames,

(v) t BLAST x. It compares gainfully the six frame translations of a nucleotide query sequence against the corresponding six frame translations of a nucleotide sequence database,*

BIOSENSOR TECHNOLOGY

(vi) BLAST z. It compares adequately long stretches of nucleotides usually more than 2 kb (i.e., kilo base pairs),

(vii) BLAST bgp. It categorically permits the usage of two almost new BLAST modes viz., (a) PHI BLAST—that specifically makes use of protein motifs e.g., found in PROSITE plus other motif databases predominantly, so as to enhance the possibility of locating ‘biologically significant matches’ ; and (b) PSI BLAST—that utilizers particularly an interactive alignment technique to detect and identify weak pattern matches, and

(viii) bl 2 seq. It clearly permits a distinct comparison of two known sequences specifically making use of BLAST p and BLAST n programmes (viz., ‘i’ and ‘ii’ above)

(2) BLAST — Later Versions. A constant vigorous and concerted effort is on towards the development of newer and latest versions of BLAST to meet a variety of definitive aims and objectives in order to enhance progressively the ‘penetration strength’ of the prevailing pro- gramme.

Precisely, the later versions of BLAST were duly introduced in the year 2001-2002 having cer- tain new sequence alignment tools which are extremely beneficial for extracting useful and newer information(s) derived specifically from the query (input) sequence adequately.

The various tools being introduced under the BLAST-later versions are described briefly as under :

(i) Vee Screen. It evidently provides an output, that lists all segments pertaining to the query which intimately match any of the sequence present in the Univec database including those of plasmids, phage, cosmids, BACs, PACs, and YACs.

(ii) IgBLAST. It predominantly facilitates the analysis of immunoglobulin sequences in Gen Bank. It essentially records the three basic germlines, such as : (a) V-genes ; (b) two D- genes ; and (c) two J-genes, that distinctly exhibit the nearest possible match to the prevail- ing query sequence,

(iii) Mega BLAST. It is indeed a ‘multiple alignment tool’ with the aid of which a set of ESTs is conveniently compared with a set of genes along with quite identical EST sequences that are invariably grouped together in the form of ‘clusters’ for further investigations associated with EST mapping,

(iv) SNP BLAST. It essentially provides an output that enlists all available high confidence SNPs available in the SNP database (db SNP) in the sequences that usually correspond to the query sequence, and

(v) Power BLAST. It represents an altogether new network blast application for the automatic analysis of ‘genomic sequences’. It essentially combines blast searching with probably addi- tional screening phenomenon for ‘low complexity regions’ and ‘repeats’. Importantly, Power BLAST gives rise to one to several alignment outputs displaying alignment of the query sequence along with all matching sequences, quite contrary to the earlier versions that might yield just only one-to-one alignment output.

(d) Par Align [http://dna.uio.no/search/]. The year 2001 witnessed the introduction of Par Align programme which is characterized by initial exploitation of ‘parallelism’ to perform a very rapid computation of the exact optimal ungapped alignment of all diagonals existing in the alignment matrix. As a result, Par Align is proved to be more effective than the programmes of the so called BLAST family

with respect to speed. Besides, a good number of facilities that are adequately available in BLAST are

PHARMACEUTICAL BIOTECHNOLOGY

indeed not available in Par Align by virtue of the fact that parallelism essentially gives rise to division of a ‘major tasks’ into relatively ‘small tasks’ which are critically carried out in parallel to enhance the speed overwhelmingly.

(e) Protein Engine and Transeq [http//www.ebi.ac.uk/]. Protein Engine and Transeq i.e., the two programmes are readily available as vital and important tools on EBI (European Bioinformatics Institute) homepage. Infact, these two programmes are first and foremost employed for translating the DNA sequences into the corresponding language of proteins. Consequently, the protein sequences thus obtained may be used for carrying out further sequence similarity searches ; besides, the detailed inves- tigation of 3D-structures duly encoded by the query DNA sequence. Interestingly, t BLAST x may also permit the usage of a DNA query sequence for looking at the very corresponding protein segments encoded by it. It has been duly observed that t BLAST x fails to utilize the entire amino acid sequence encoded by the query sequence.

(f) INTERPRO. The INTERPRO helps to analyse the protein sequences that have been ob- tained directly either from proteins or from those duly predicted on the basis of genomic sequences from several organisms.

Example. Predicted gene products were duly analyzed from the genomes of four organisms, namely : nematode (C. elegans), yeast, Arabidopsis, and fruitfly to actually determine the valid and important differences present in conserved protein domains.