IT-SC 312
available if you want to run BLAST on your own computer. Pennsylvania State University also develops some BLAST programs, available at
http:bio.cse.psu.edu . In addition to NCBI and WU-BLAST, many other BLAST
server web sites are available. A Google search http:www.google.com
on BLAST server will bring up many hits.
A big question that faces researchers when they use BLAST is whether to use a public BLAST server or to run it locally. There are significant advantages to using a public
server, the largest being that the databases such as GenBank used by the BLAST server are always up to date. To keep your own up-to-date copy of these databases requires a
significant amount of hard-disk space, a computer with a fairly high-end processor and a lot of memory to run the BLAST engine, a high-capacity network link, and a lot of time
setting up and overseeing the software that updates the databases. On the other hand, perhaps you have your own library of sequences that you want to use in BLAST searches,
you do frequent or large searches, or you have other reasons to run your own in-house BLAST engine. If thats the case, it makes sense to invest in the hardware and run it
locally.
The online documentation for BLAST is fairly extensive and includes details on the statistical methods the program uses to calculate similarity. In the next section, I touch
briefly on some of those points, but you should refer to the BLAST home page and to the excellent material at the NCBI web site for the whole story and detailed references. Our
interest here is not the theory, but rather to parse the output of the program.
12.2 String Matching and Homology
String matching is the computer-science term for algorithms that find one string embedded in another. It has a fairly long and fruitful history, and many string-matching
algorithms have been developed using a variety of techniques and for different cases. See the Gusfield book in
Appendix A for an excellent treatment with a biological
emphasis. Weve already done a fair amount of string matching, using the binding operator to search for motifs and other text with regular expressions.
BLAST is basically a string-matching program. Details of the string-matching algorithms, and of the algorithms used in BLAST in particular, are beyond the scope of this book.
But first I want to define some terms that are frequently confused or used interchangeably. I also briefly introduce the BLAST statistics.
Biological string matching looks for similarity as an indication of homology. Similarity between the query and the sequences in the database may be measured by the percent
identity, or the number of bases in the query that exactly match a corresponding region of a sequence from the database. It may also be measured by the degree of conservation,
which finds matches between equivalent redundant codons or between amino acid residues with similar properties that dont alter the function of a protein see
Chapter 8 .
Homology between sequences means the sequences are related evolutionarily. Two sequences are or are not homologous; theres no degree of homology.
IT-SC 313
At the risk of oversimplifying a complex topic, Ill summarize a few facts about BLAST statistics. See the BLAST documentation for a complete picture. The output of a
BLAST search reports a set of scores and statistics on the matches it has found based on the raw score S, various parameters of the scoring algorithm, and properties of the query
and database. The raw score S is a measure of similarity and the size of the match. The BLAST output lists the hits ranked by their E value. The E expect value of a match
measures, roughly, the chances that the string matching allowing for gaps occurs in a randomly generated database of the same size and composition. The closer to 0 the E
value is, the less likely it occurred by chance. In other words, the lower the E value, the better the match. As a general rule of thumb for BLASTN, an E value less than 1 may be
a solid hit, and an E value of less than 10 may be worth looking at, but this is not a hard and fast rule. Of course, proteins can be homologous with even a very small percent
identity; the percent similarity is typically higher for homologous DNA.
Now that you have the basics, lets write code to parse BLAST output. First, you separate the hits, then extract the sequence, and finally, you find the annotation showing the E
value statistic.
12.3 BLAST Output Files