IT-SC 309
Next, you want to save just those positions or columns of these lines that have the sequence or structure information; you dont need the keywords, position numbers, or the
PDB entry name at the end of the lines.
Finally, join the arrays into single strings. Here, theres one detail to handle; you need to remove any unneeded spaces from the ends of the strings. Notice that stride sometimes
leaves spaces in the structure prediction, and in this example, has left some at the end of the structure prediction. So you shouldnt throw away all the spaces at the ends of the
strings. Instead, throw away all the spaces at the end of the sequence string, because they are just superfluous spaces on the line. Now, see how many spaces that was, and throw
the equal amount away at the end of the structure prediction string, thus preserving spaces that correspond to undetermined secondary structure.
Example 11-7 contains a main program that calls two subroutines, which, since they
are short, are all included so theres no need here for the BeginPerlBioinfo module. Heres the output of
Example 11-7 :
GGLQVKNFDFTVGKFLTVGGFINNSPQRFSVNVGESMNSLSLHLDHRFNYGADQNTIVM NSTLKGDNGWETEQRSTNFTL
TTTTTTBTTT EEEEEEETTTT EEEEEEEEETTEEEEEEEEEEEETTEEEEEEEEEETTGGG B EEE
The first line shows the amino acids, and the second line shows the prediction of the secondary structure. Check the next section for a subroutine that will improve that output.
11.6 Exercises
Exercise 11.1 Use File::Find and the file test operators to find the oldest and largest files on
the hard drive of your computer. You can delete them or store them elsewhere if youre running short on disk space.
Exercise 11.2 Find all the Perl programs on your computer.
Hint: Use File::Find. What do all Perl programs have in common? Exercise 11.3
Parse the HEADER, TITLE, and KEYWORDS record types of all PDB files on your computer. Make a hash with key as a word from those record types and
value as a list of filenames that contained that word. Save it as a DBM file and build a query program for it. In the end, you should be able to ask for, say, sugar,
and get a list of all PDB files that contain that word in the HEADER, TITLE, or KEYWORDS records.
Exercise 11.4 Parse out the record types of a PDB file using regular expressions as used in
Chapter 10 instead of iterating through an array of input lines as in this
chapter.
IT-SC 310
Exercise 11.5 Write a program that extracts the secondary structure information contained in the
HELIX, SHEET, and TURN record types of PDB files. Print out the secondary structure and the primary sequence together, so that its easy to see by what
secondary structure a given residue is included. Consider using a special alphabet for secondary structure, so that every residue in a helix is represented by H, for
example.
Exercise 11.6 Write a program that finds all PDB files under a given folder and runs a program
such as stride, or the program you wrote in Exercise 11.5 that reports on the secondary structure of each PDB file. Store the results in a DBM file keyed on the
filename.
Exercise 11.7 Write a subroutine that, given two strings, prints them out one over the other, but
with line breaks similar to the stride program output. Use this subroutine to print out the strings from
Example 11-7 .
Exercise 11-8 Write a recursive subroutine to determine the size of an array. You may want to
use the pop
or unshift
functions. Ignore the fact that the scalar array
returns the size of array
Exercise 11.9 Write a recursive subroutine that extracts the primary amino acid sequence from
the SEQRES record type of a PDB file. Exercise 11.10
Extra credit Given an atom and a distance, find all other atoms in a PDB file that are within that distance of the atom.
Exercise 11.11 Extra credit Write a program to find some correlation between the primary
amino acid sequence and the location of alpha helices.
IT-SC 311
Chapter 12. BLAST
In biological research, the search for sequence similarity is very important. For instance, a researcher who has discovered a potentially important DNA or protein sequence wants
to know if its already been identified and characterized by another researcher. If it hasnt, the researcher wants to know if it resembles any known sequence from any organism.
This information can provide vital clues as to the role of the sequence in the organism.
The Basic Local Alignment Search Tool BLAST is one of the most popular software tools in biological research. It tests a query sequence against a library of known
sequences in order to find similarity. BLAST is actually a collection of programs with versions for query-to-database pairs such as nucleotide-nucleotide, protein-nucleotide,
protein-protein, nucleotide-protein, and more.
This chapter examines the output from the nucleotide-nucleotide version of the program, BLASTN . For simplicitys sake, Ill simply refer to it here as BLAST. The main goal of
this chapter is to show how to write code to parse a BLAST output file using regular expressions. The code is simple and basic, but it does the job. Once you understand the
basics, you can build more features into your parser or obtain one of the fancier BLAST output parsers thats available via the Web. In either case, youll know enough about
output parsers to use or extend them.
This chapter also gives you a brief introduction to Bioperl, which is a collection of Perl bioinformatics modules. The Bioperl project is an example of an open source project that
you, the Perl bioinformatics programmer, can put to good use. The Perl programming language is itself an open source project. The program and its source code are available
for use and modification with only very reasonable restrictions and at no cost.
12.1 Obtaining BLAST
There are a several implementations of BLAST. The most popular is probably the one offered free of charge by the National Center for Biotechnology Information NCBI:
http:www.ncbi.nlm.nih.govBLAST . The NCBI web site features a publicly
available BLAST server, a comprehensive set of databases, and a well-organized collection of documents and tutorials, in addition to the BLAST software available for
downloading.
Also popular is the WU-BLAST implementation from Washington University. The main web site, including a list of other WU-BLAST servers, can be found at
http:blast.wustl.edu . Older versions of WU-BLAST are available at no charge.
Newer versions are free if you qualify as a research or nonprofit organization and agree to the licensing arrangements from Washington University where the program is
developed and maintained. If you work at a major research organization, you may already have a site license for the WU-BLAST program. If you are a for-profit company, there is
a rather hefty charge for the newer WU-BLAST program older versions are freely