String Matching and Homology

IT-SC 312 available if you want to run BLAST on your own computer. Pennsylvania State University also develops some BLAST programs, available at http:bio.cse.psu.edu . In addition to NCBI and WU-BLAST, many other BLAST server web sites are available. A Google search http:www.google.com on BLAST server will bring up many hits. A big question that faces researchers when they use BLAST is whether to use a public BLAST server or to run it locally. There are significant advantages to using a public server, the largest being that the databases such as GenBank used by the BLAST server are always up to date. To keep your own up-to-date copy of these databases requires a significant amount of hard-disk space, a computer with a fairly high-end processor and a lot of memory to run the BLAST engine, a high-capacity network link, and a lot of time setting up and overseeing the software that updates the databases. On the other hand, perhaps you have your own library of sequences that you want to use in BLAST searches, you do frequent or large searches, or you have other reasons to run your own in-house BLAST engine. If thats the case, it makes sense to invest in the hardware and run it locally. The online documentation for BLAST is fairly extensive and includes details on the statistical methods the program uses to calculate similarity. In the next section, I touch briefly on some of those points, but you should refer to the BLAST home page and to the excellent material at the NCBI web site for the whole story and detailed references. Our interest here is not the theory, but rather to parse the output of the program.

12.2 String Matching and Homology

String matching is the computer-science term for algorithms that find one string embedded in another. It has a fairly long and fruitful history, and many string-matching algorithms have been developed using a variety of techniques and for different cases. See the Gusfield book in Appendix A for an excellent treatment with a biological emphasis. Weve already done a fair amount of string matching, using the binding operator to search for motifs and other text with regular expressions. BLAST is basically a string-matching program. Details of the string-matching algorithms, and of the algorithms used in BLAST in particular, are beyond the scope of this book. But first I want to define some terms that are frequently confused or used interchangeably. I also briefly introduce the BLAST statistics. Biological string matching looks for similarity as an indication of homology. Similarity between the query and the sequences in the database may be measured by the percent identity, or the number of bases in the query that exactly match a corresponding region of a sequence from the database. It may also be measured by the degree of conservation, which finds matches between equivalent redundant codons or between amino acid residues with similar properties that dont alter the function of a protein see Chapter 8 . Homology between sequences means the sequences are related evolutionarily. Two sequences are or are not homologous; theres no degree of homology. IT-SC 313 At the risk of oversimplifying a complex topic, Ill summarize a few facts about BLAST statistics. See the BLAST documentation for a complete picture. The output of a BLAST search reports a set of scores and statistics on the matches it has found based on the raw score S, various parameters of the scoring algorithm, and properties of the query and database. The raw score S is a measure of similarity and the size of the match. The BLAST output lists the hits ranked by their E value. The E expect value of a match measures, roughly, the chances that the string matching allowing for gaps occurs in a randomly generated database of the same size and composition. The closer to 0 the E value is, the less likely it occurred by chance. In other words, the lower the E value, the better the match. As a general rule of thumb for BLASTN, an E value less than 1 may be a solid hit, and an E value of less than 10 may be worth looking at, but this is not a hard and fast rule. Of course, proteins can be homologous with even a very small percent identity; the percent similarity is typically higher for homologous DNA. Now that you have the basics, lets write code to parse BLAST output. First, you separate the hits, then extract the sequence, and finally, you find the annotation showing the E value statistic.

12.3 BLAST Output Files