IT-SC 240
LOCUS.ORIGIN\s\n This whole expression is anchored at the beginning of the string by preceding it with a
metacharacter. s
doesnt change the meaning of the character in a regular expression.
Inside the parentheses, you match from where the string LOCUS
appears at the beginning of the GenBank record, followed by any number of characters including newlines
with .
, followed by the string ORIGIN
, followed by possibly some whitespace with \s
, followed by a newline \n
. This matches the annotation part of the GenBank record. Now, look at the second parentheses and the remainder:
.\\\n This is easier. The
. matches any character, including newlines because of the
s pattern modifier at the end of the pattern match. The parentheses are followed by the end-
of-record line, , including the newline at the end, with the slashes preceded by
backslashes to show that you want to match them exactly. Theyre not delimiters of the pattern matching operator. The end result is the GenBank record with the annotation and
the sequence separated into the variables annotation
and sequence
. Although the regular expression I used requires a bit of explanation, the attractive thing about this
approach is that it took only one line of Perl code to extract both annotation and sequence.
10.4 Parsing Annotations
Now that youve successfully extracted the sequence, lets look at parsing the annotations of a GenBank file.
Looking at a GenBank record, its interesting to think about how to extract the useful information. The FEATURES table is certainly a key part of the story. It has considerable
structure: what should be preserved, and what is unnecessary? For instance, sometimes you just want to see if a word such as endonuclease appears anywhere in the record.
For this, you just need a subroutine that searches for any regular expression in the annotation. Sometimes this is enough, but when detailed surgery is necessary, Perl has
the necessary tools to make the operation successful.
10.4.1 Using Arrays
Example 10-3 parses a few pieces of information from the annotations in a GenBank
file. It does this using the data in the form of an array.
Example 10-3. Parsing GenBank annotations using arrays
usrbinperl Parsing GenBank annotations using arrays
use strict;
IT-SC 241
use warnings; use BeginPerlBioinfo; see Chapter 6 about this module
Declare and initialize variables my genbank = ;
my locus = ; my accession = ;
my organism = ; Get GenBank file data
genbank = get_file_datarecord.gb; Lets start with something simple. Lets get some of the
identifying information, lets say the locus and accession number
here the same thing and the definition and the organism.
for my line genbank { ifline =~ LOCUS {
line =~ sLOCUS\s; locus = line;
}elsifline =~ ACCESSION { line =~ sACCESSION\s;
accession = line; }elsifline =~ ORGANISM {
line =~ s\sORGANISM\s; organism = line;
} }
print LOCUS \n; print locus;
print ACCESSION \n; print accession;
print ORGANISM \n; print organism;
exit;
Heres the output from Example 10-3
: LOCUS
AB031069 2487 bp mRNA PRI 27-MAY- 2000
ACCESSION AB031069
ORGANISM Homo sapiens
IT-SC 242
Now lets slightly extend that program to handle the DEFINITION field. Notice that the DEFINITION field can extend over more than one line. To collect that field, use a trick
youve already seen in Example 10-1
: set a flag when youre in the state of collecting a definition. The flag variable is called, unsurprisingly,
flag .
Example 10-4. Parsing GenBank annotations using arrays, take 2
usrbinperl Parsing GenBank annotations using arrays, take 2
use strict; use warnings;
use BeginPerlBioinfo; see Chapter 6 about this module Declare and initialize variables
my genbank = ; my locus = ;
my accession = ; my organism = ;
my definition = ; my flag = 0;
Get GenBank file data genbank = get_file_datarecord.gb;
Lets start with something simple. Lets get some of the identifying
information, lets say the locus and accession number here the same
thing and the definition and the organism. for my line genbank {
ifline =~ LOCUS { line =~ sLOCUS\s;
locus = line; }elsifline =~ DEFINITION {
line =~ sDEFINITION\s; definition = line;
flag = 1; }elsifline =~ ACCESSION {
line =~ sACCESSION\s; accession = line;
flag = 0; }elsifflag {
chompdefinition; definition .= line;
}elsifline =~ ORGANISM {
IT-SC 243
line =~ s\sORGANISM\s; organism = line;
} }
print LOCUS \n; print locus;
print DEFINITION \n; print definition;
print ACCESSION \n; print accession;
print ORGANISM \n; print organism;
exit;
Example 10-4 outputs:
LOCUS AB031069 2487 bp mRNA PRI 27-MAY-
2000 DEFINITION
Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1, complete cds.
ACCESSION AB031069
ORGANISM Homo sapiens
This use of flags to remember which part of the file youre in, from one iteration of a loop to the next, is a common technique when extracting information from files that have
multiline sections. As the files and their fields get more complex, the code must keep track of many flags at a time to remember which part of the file its in and what
information needs to be extracted. It works, but as the files become more complex, so does the code. It becomes hard to read and hard to modify. So lets look at regular
expressions as a vehicle for parsing annotations.
10.4.2 When to Use Regular Expressions