Parsing Parsing the FEATURES Table

IT-SC 259 5 untranslated region leader These feature keys can have their own additional features, which youll see here and in the exercises.

10.4.5.2 Parsing

Example 10-7 finds whatever features are present and returns an array populated with them. It doesnt look for the complete list of features as presented in the last section; it finds just the features that are actually present in the GenBank record and returns them for further use. Its often the case that there are multiple instances of the same feature in a record. For instance, there may be several exons specified in the FEATURES table of a GenBank record. For this reason well store the features as elements in an array, rather than in a hash keyed on the feature name as this allows you to store, for instance, only one instance of an exon. Example 10-7. Testing subroutine parse_features usrbinperl - main program to test parse_features use strict; use warnings; use BeginPerlBioinfo; see Chapter 6 about this module Declare and initialize variables my fh; my record; my dna; my annotation; my fields; my features; my library = library.gb; Get the fields from the first GenBank record in a library fh = open_filelibrary; record = get_next_recordfh; annotation, dna = get_annotation_and_dnarecord; fields = parse_annotationannotation; Extract the features from the FEATURES table features = parse_featuresfields{FEATURES}; IT-SC 260 Print out the features foreach my feature features { extract the name of the feature or feature key myfeaturename = feature =~ {5}\S+; print featurename \n; print feature; } exit; Subroutine parse_features extract the features from the FEATURES field of a GenBank record sub parse_features { myfeatures = _; entire FEATURES field in a scalar variable Declare and initialize variables myfeatures = ; used to store the individual features Extract the features while features =~ {5}\S.\n {21}\S.\ngm { my feature = ; pushfeatures, feature; } return features; } Example 10-7 gives the output: source source 1..2487 organism=Homo sapiens db_xref=taxon:9606 sex=male IT-SC 261 cell_line=HuS-L12 cell_type=lung fibroblast dev_stage=embryo gene gene 229..2199 gene=PCCX1 CDS CDS 229..2199 gene=PCCX1 note=a nuclear protein carrying a PHD finger and a CXXC domain codon_start=1 product=protein containing CXXC domain 1 protein_id=BAA96307.1 db_xref=GI:8100075 translation=MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR In subroutine parse_features of Example 10-7 , the regular expression that extracts the features is much like the regular expression used in Example 10-6 to parse the top level of the annotations. Lets look at the essential parsing code of Example 10- 7 : while features =~ {5}\S.\n {21}\S.\ngm { On the whole, and in brief, this regular expression finds features formatted with the first IT-SC 262 lines beginning with 5 spaces, and optional continuation lines beginning with 21 spaces. First, note that the pattern modifier m enables the metacharacter to match after embedded newlines. Also, the {5} and {21} are quantifiers that specify there should be exactly 5, or 21, of the preceding item, which in both cases is a space. The regular expression is in two parts, corresponding to the first line and optional continuation lines of the feature. The first part {5}\S.\n means that the beginning of a line has 5 spaces {5} , followed by a non-whitespace character \S followed by any number of non-newlines . followed by a newline \n . The second part of the regular expression, {21}\S.\n means the beginning of a line has 21 spaces {21} followed by a non-whitespace character \S followed by any number of non- newlines . followed by a newline \n ; and there may be or more such lines, indicated by the around the whole expression. The main program has a short regular expression along similar lines to extract the feature name also called the feature key from the feature. So, again, success. The FEATURES table is now decomposed or parsed in some detail, down to the level of separating the individual features. The next stage in parsing the FEATURES table is to extract the detailed information for each feature. This includes the location given on the same line as the feature name, and possibly on additional lines; and the qualifiers indicated by a slash, a qualifier name, and if applicable, an equals sign and additional information of various kinds, possibly continued on additional lines. Ill leave this final step for the exercises. Its a fairly straightforward extension of the approach weve been using to parse the features. You will want to consult the documentation from the NCBI web site for complete details about the structure of the FEATURES table before trying to parse the location and qualifiers from a feature. The method Ive used to parse the FEATURES table maintains the structure of the information. However, sometimes you just want to see if some word such as endonulease appears anywhere in the record. For this, recall that you created a search_annotation subroutine in Example 10-5 that searches for any regular expression in the entire annotation; very often, this is all you really need. As youve now seen, however, when you really need to dig into the FEATURES table, Perl has its own features that make the job possible and even fairly easy.

10.5 Indexing GenBank with DBM