Separating annotations from sequence

IT-SC 238

10.3.2.3 Separating annotations from sequence

Now that youve met the pattern-matching modifiers and regular expressions that will be your main tools for parsing a GenBank file as a scalar, lets try separating the annotations from the sequence. The first step is to get the GenBank record stored as a scalar variable. Recall that a GenBank record starts with a line beginning with the word LOCUS and ends with the end-of-record separator: a line containing two forward slashes. First you want to read a GenBank record and store it in a scalar variable. Theres a device called an input record separator denoted by the special variable that lets you do exactly that. The input record separator is usually set to a newline, so each call to read a scalar from a filehandle gets one line. Set it to the GenBank end-of-record separator like so: = \n; A call to read a scalar from a filehandle takes all the data up to the GenBank end-of- record separator. So the line record = GBFILE in Example 10-2 stores the multiline GenBank record into the scalar variable record . Later, youll see that you can keep repeating this call in order to read in successive GenBank records from a GenBank library file. After reading in the record, youll parse it into the annotation and sequence parts making use of s and m pattern modifiers. Extracting the annotation and sequence is the easy part; parsing the annotation will occupy most of the remainder of the chapter. Example 10-2. Extract annotation and sequence from Genbank record usrbinperl Extract the annotation and sequence sections from the first record of a GenBank library use strict; use warnings; use BeginPerlBioinfo; see Chapter 6 about this module Declare and initialize variables my annotation = ; my dna = ; my record = ; my filename = record.gb; my save_input_separator = ; Open GenBank library file unless openGBFILE, filename { IT-SC 239 print Cannot open GenBank file \filename\\n\n; exit; } Set input separator to \n and read in a record to a scalar = \n; record = GBFILE; reset input separator = save_input_separator; Now separate the annotation from the sequence data annotation, dna = record =~ LOCUS.ORIGIN\s\n.\\\ns; Print the two pieces, which should give us the same as the original GenBank file, minus the at the end print annotation, dna; exit; The output from this program is the same as the GenBank file listed previously, minus the last line, which is the end-of-record separator . Lets focus on the regular expression that parses the annotation and sequence out of the record variable. This is the most complicated regular expression so far: record = LOCUS.ORIGIN\s\n.\\\ns. There are two pairs of parentheses in the regular expression: LOCUS.ORIGIN\s\n and . . The parentheses are metacharacters whose purpose is to remember the parts of the data that match the pattern within the parentheses, namely, the annotation and the sequence. Also note that the pattern match returns an array whose elements are the matched parenthetical patterns. After you match the annotation and the sequence within the pairs of parentheses in the regular expression, you simply assign the matched patterns to the two variables annotation and dna , like so: annotation, dna = record =~ LOCUS.ORIGIN\s\n.\\\ns; Notice that at the end of the pattern, weve added the s pattern matching modifier, which, as youve seen earlier, allows a dot to match any character including an embedded newline. Of course, since weve got a whole GenBank record in the record scalar, there are a lot of embedded newlines. Next, look at the first pair of parentheses: IT-SC 240 LOCUS.ORIGIN\s\n This whole expression is anchored at the beginning of the string by preceding it with a metacharacter. s doesnt change the meaning of the character in a regular expression. Inside the parentheses, you match from where the string LOCUS appears at the beginning of the GenBank record, followed by any number of characters including newlines with . , followed by the string ORIGIN , followed by possibly some whitespace with \s , followed by a newline \n . This matches the annotation part of the GenBank record. Now, look at the second parentheses and the remainder: .\\\n This is easier. The . matches any character, including newlines because of the s pattern modifier at the end of the pattern match. The parentheses are followed by the end- of-record line, , including the newline at the end, with the slashes preceded by backslashes to show that you want to match them exactly. Theyre not delimiters of the pattern matching operator. The end result is the GenBank record with the annotation and the sequence separated into the variables annotation and sequence . Although the regular expression I used requires a bit of explanation, the attractive thing about this approach is that it took only one line of Perl code to extract both annotation and sequence.

10.4 Parsing Annotations