IT-SC 238
10.3.2.3 Separating annotations from sequence
Now that youve met the pattern-matching modifiers and regular expressions that will be your main tools for parsing a GenBank file as a scalar, lets try separating the annotations
from the sequence.
The first step is to get the GenBank record stored as a scalar variable. Recall that a GenBank record starts with a line beginning with the word LOCUS and ends with the
end-of-record separator: a line containing two forward slashes.
First you want to read a GenBank record and store it in a scalar variable. Theres a device called an input record separator denoted by the special variable
that lets you do exactly that. The input record separator is usually set to a newline, so each call to read a
scalar from a filehandle gets one line. Set it to the GenBank end-of-record separator like so:
= \n; A call to read a scalar from a filehandle takes all the data up to the GenBank end-of-
record separator. So the line record
= GBFILE
in Example 10-2
stores the multiline GenBank record into the scalar variable
record . Later, youll see that you
can keep repeating this call in order to read in successive GenBank records from a GenBank library file.
After reading in the record, youll parse it into the annotation and sequence parts making use of
s and
m pattern modifiers. Extracting the annotation and sequence is the easy
part; parsing the annotation will occupy most of the remainder of the chapter.
Example 10-2. Extract annotation and sequence from Genbank record
usrbinperl Extract the annotation and sequence sections from the
first record of a GenBank library
use strict; use warnings;
use BeginPerlBioinfo; see Chapter 6 about this module Declare and initialize variables
my annotation = ; my dna = ;
my record = ; my filename = record.gb;
my save_input_separator = ; Open GenBank library file
unless openGBFILE, filename {
IT-SC 239
print Cannot open GenBank file \filename\\n\n; exit;
} Set input separator to \n and read in a record to a
scalar = \n;
record = GBFILE; reset input separator
= save_input_separator; Now separate the annotation from the sequence data
annotation, dna = record =~ LOCUS.ORIGIN\s\n.\\\ns;
Print the two pieces, which should give us the same as the
original GenBank file, minus the at the end print annotation, dna;
exit;
The output from this program is the same as the GenBank file listed previously, minus the last line, which is the end-of-record separator
. Lets focus on the regular expression that parses the annotation and sequence out of the
record variable. This is the most complicated regular expression so far:
record = LOCUS.ORIGIN\s\n.\\\ns. There are two pairs of parentheses in the regular expression:
LOCUS.ORIGIN\s\n and
. . The parentheses are metacharacters whose purpose is to remember the parts of
the data that match the pattern within the parentheses, namely, the annotation and the sequence. Also note that the pattern match returns an array whose elements are the
matched parenthetical patterns. After you match the annotation and the sequence within the pairs of parentheses in the regular expression, you simply assign the matched patterns
to the two variables
annotation and
dna , like so:
annotation, dna = record =~ LOCUS.ORIGIN\s\n.\\\ns;
Notice that at the end of the pattern, weve added the s
pattern matching modifier, which, as youve seen earlier, allows a dot to match any character including an embedded
newline. Of course, since weve got a whole GenBank record in the record
scalar, there are a lot of embedded newlines.
Next, look at the first pair of parentheses:
IT-SC 240
LOCUS.ORIGIN\s\n This whole expression is anchored at the beginning of the string by preceding it with a
metacharacter. s
doesnt change the meaning of the character in a regular expression.
Inside the parentheses, you match from where the string LOCUS
appears at the beginning of the GenBank record, followed by any number of characters including newlines
with .
, followed by the string ORIGIN
, followed by possibly some whitespace with \s
, followed by a newline \n
. This matches the annotation part of the GenBank record. Now, look at the second parentheses and the remainder:
.\\\n This is easier. The
. matches any character, including newlines because of the
s pattern modifier at the end of the pattern match. The parentheses are followed by the end-
of-record line, , including the newline at the end, with the slashes preceded by
backslashes to show that you want to match them exactly. Theyre not delimiters of the pattern matching operator. The end result is the GenBank record with the annotation and
the sequence separated into the variables annotation
and sequence
. Although the regular expression I used requires a bit of explanation, the attractive thing about this
approach is that it took only one line of Perl code to extract both annotation and sequence.
10.4 Parsing Annotations