IT-SC 259
5 untranslated region leader These feature keys can have their own additional features, which youll see here and in
the exercises.
10.4.5.2 Parsing
Example 10-7 finds whatever features are present and returns an array populated with
them. It doesnt look for the complete list of features as presented in the last section; it finds just the features that are actually present in the GenBank record and returns them
for further use.
Its often the case that there are multiple instances of the same feature in a record. For instance, there may be several exons specified in the FEATURES table of a GenBank
record. For this reason well store the features as elements in an array, rather than in a hash keyed on the feature name as this allows you to store, for instance, only one
instance of an exon.
Example 10-7. Testing subroutine parse_features
usrbinperl - main program to test parse_features
use strict; use warnings;
use BeginPerlBioinfo; see Chapter 6 about this module Declare and initialize variables
my fh; my record;
my dna; my annotation;
my fields; my features;
my library = library.gb; Get the fields from the first GenBank record in a library
fh = open_filelibrary; record = get_next_recordfh;
annotation, dna = get_annotation_and_dnarecord; fields = parse_annotationannotation;
Extract the features from the FEATURES table features = parse_featuresfields{FEATURES};
IT-SC 260
Print out the features foreach my feature features {
extract the name of the feature or feature key myfeaturename = feature =~ {5}\S+;
print featurename \n; print feature;
} exit;
Subroutine parse_features
extract the features from the FEATURES field of a GenBank record
sub parse_features { myfeatures = _; entire FEATURES field in a
scalar variable Declare and initialize variables
myfeatures = ; used to store the individual features
Extract the features while features =~ {5}\S.\n {21}\S.\ngm {
my feature = ; pushfeatures, feature;
} return features;
}
Example 10-7 gives the output:
source source 1..2487
organism=Homo sapiens db_xref=taxon:9606
sex=male
IT-SC 261
cell_line=HuS-L12 cell_type=lung fibroblast
dev_stage=embryo gene
gene 229..2199 gene=PCCX1
CDS CDS 229..2199
gene=PCCX1 note=a nuclear protein carrying a
PHD finger and a CXXC domain
codon_start=1 product=protein containing CXXC
domain 1 protein_id=BAA96307.1
db_xref=GI:8100075 translation=MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD
NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ
QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP
EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS
DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK
YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT
AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR
In subroutine parse_features
of Example 10-7
, the regular expression that extracts the features is much like the regular expression used in
Example 10-6 to parse
the top level of the annotations. Lets look at the essential parsing code of Example 10-
7 :
while features =~ {5}\S.\n {21}\S.\ngm { On the whole, and in brief, this regular expression finds features formatted with the first
IT-SC 262
lines beginning with 5 spaces, and optional continuation lines beginning with 21 spaces. First, note that the pattern modifier
m enables the
metacharacter to match after embedded newlines. Also, the
{5} and
{21} are quantifiers that specify there should be
exactly 5, or 21, of the preceding item, which in both cases is a space. The regular expression is in two parts, corresponding to the first line and optional
continuation lines of the feature. The first part {5}\S.\n
means that the beginning of a line
has 5 spaces {5}
, followed by a non-whitespace character \S
followed by any number of non-newlines
. followed by a newline
\n . The second part of the
regular expression, {21}\S.\n
means the beginning of a line has 21 spaces
{21} followed by a non-whitespace character
\S followed by any number of non-
newlines .
followed by a newline \n
; and there may be or more such lines,
indicated by the around the whole expression.
The main program has a short regular expression along similar lines to extract the feature name also called the feature key from the feature.
So, again, success. The FEATURES table is now decomposed or parsed in some detail, down to the level of separating the individual features. The next stage in parsing the
FEATURES table is to extract the detailed information for each feature. This includes the location given on the same line as the feature name, and possibly on additional lines;
and the qualifiers indicated by a slash, a qualifier name, and if applicable, an equals sign and additional information of various kinds, possibly continued on additional lines.
Ill leave this final step for the exercises. Its a fairly straightforward extension of the approach weve been using to parse the features. You will want to consult the
documentation from the NCBI web site for complete details about the structure of the FEATURES table before trying to parse the location and qualifiers from a feature.
The method Ive used to parse the FEATURES table maintains the structure of the information. However, sometimes you just want to see if some word such as
endonulease appears anywhere in the record. For this, recall that you created a
search_annotation subroutine in
Example 10-5 that searches for any regular
expression in the entire annotation; very often, this is all you really need. As youve now seen, however, when you really need to dig into the FEATURES table, Perl has its own
features that make the job possible and even fairly easy.
10.5 Indexing GenBank with DBM