IT-SC 243
line =~ s\sORGANISM\s; organism = line;
} }
print LOCUS \n; print locus;
print DEFINITION \n; print definition;
print ACCESSION \n; print accession;
print ORGANISM \n; print organism;
exit;
Example 10-4 outputs:
LOCUS AB031069 2487 bp mRNA PRI 27-MAY-
2000 DEFINITION
Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1, complete cds.
ACCESSION AB031069
ORGANISM Homo sapiens
This use of flags to remember which part of the file youre in, from one iteration of a loop to the next, is a common technique when extracting information from files that have
multiline sections. As the files and their fields get more complex, the code must keep track of many flags at a time to remember which part of the file its in and what
information needs to be extracted. It works, but as the files become more complex, so does the code. It becomes hard to read and hard to modify. So lets look at regular
expressions as a vehicle for parsing annotations.
10.4.2 When to Use Regular Expressions
Weve used two methods to parse GenBank files: regular expressions and looping through arrays of lines and setting flags. We used both methods to separate the annotation
from the sequence in a previous section of this chapter. Both methods were equally well suited, since in GenBank files, the annotation is followed by the sequence, clearly
delimited by an
ORIGIN line: a simple structure. However, parsing the annotations
seems a bit more complicated; therefore, lets try to use regular expressions to accomplish the task.
To begin, lets wrap the code weve been working on into some convenient subroutines to focus on parsing the annotations. Youll want to fetch GenBank records one at a time
IT-SC 244
from a library a file containing one or more GenBank records, extract the annotations and the sequence, and then if desired parse the annotations. This would be useful if, say,
you were looking for some motif in a GenBank library. Then you can search for the motif, and, if found, you can parse the annotations to look for additional information about the
sequence.
As mentioned previously, well use the file library.gb, which you can download from this books web site.
Since dealing with annotation data is somewhat complex, lets take a minute to break our tasks into convenient subroutines. Heres the pseudocode:
sub open_file given the filename, return the filehandle
sub get_next_record given the filehandle, get the record
we can get the offset by first calling tell sub get_annotation_and_dna
given a record, split it into annotation and cleaned-up sequence
sub search_sequence given a sequence and a regular expression,
return array of locations of hits sub search_annotation
given a GenBank annotation and a regular expression, return array of locations of hits
sub parse_annotation separate out the fields of the annotation in a
convenient form sub parse_features
given the features field, separate out the components
The idea is to make a subroutine for each important task you want to accomplish and then combine them into useful programs. Some of these can be combined into other
subroutines: for instance, perhaps you want to open a file and get the record from it, all in one subroutine call.
Youre designing these subroutines to work with library files, that is, files with multiple GenBank records. You pass the filehandle into the subroutines as an argument, so that
your subroutines can access open library files as represented by the filehandles. Doing so enables you to have a
get_next_record function, which is handy in a loop. Using
IT-SC 245
the Perl function tell also allows you to save the byte offset of any record of interest, and then return later and extract the record at that byte offset very quickly. A byte offset is
just the number of characters into the file where the information of interest lies. The operating system supports Perl in letting you go immediately to any byte offset location
in even huge files, thus bypassing the usual way of opening the file and reading from the beginning until you get where you want to be.
Using a byte offset is important when youre dealing with large files. Perl gives you built- in functions such as
seek that allow you, on an open file, to go immediately to any
location in the file. The idea is that when you find something in a file, you can save the byte offset using the Perl function tell. Then, when you want to return to that point in the
file, you can just call the Perl function seek with the byte offset as an argument. Youll see this later in this chapter when you build a DBM file to look up records based on their
accession numbers. But the main point is that with a 250-MB file, it takes too long to find something by searching from the beginning, and there are ways of getting around it.
The parsing of the data is done in three steps, according to the design: Youll separate out the annotation and the sequence which youll clean up by removing
whitespace, etc., and making it a simple string of sequence. Even at this step, you can search for motifs in the sequence, as well as look for text in the annotation.
Extract out the fields. Parse the features table.
These steps seem natural, and, depending on what you want to do, allow you to parse to whatever depth is needed.
Heres a main program in pseudocode that shows how to use those subroutines: open_file
while get_next_record get_annotation_and_dna
if search_sequence for a motif AND search_annotation for chromosome 22
parse_annotation parse_features to get sizes of exons, look for
small sizes }
} return accession numbers of records meeting the criteria
IT-SC 246
This example shows how to use subroutines to answer a question such as: what are the genes on chromosome 22 that contain a given motif and have small exons?
10.4.3 Main Program