Using Arrays Parsing Annotations

IT-SC 240 LOCUS.ORIGIN\s\n This whole expression is anchored at the beginning of the string by preceding it with a metacharacter. s doesnt change the meaning of the character in a regular expression. Inside the parentheses, you match from where the string LOCUS appears at the beginning of the GenBank record, followed by any number of characters including newlines with . , followed by the string ORIGIN , followed by possibly some whitespace with \s , followed by a newline \n . This matches the annotation part of the GenBank record. Now, look at the second parentheses and the remainder: .\\\n This is easier. The . matches any character, including newlines because of the s pattern modifier at the end of the pattern match. The parentheses are followed by the end- of-record line, , including the newline at the end, with the slashes preceded by backslashes to show that you want to match them exactly. Theyre not delimiters of the pattern matching operator. The end result is the GenBank record with the annotation and the sequence separated into the variables annotation and sequence . Although the regular expression I used requires a bit of explanation, the attractive thing about this approach is that it took only one line of Perl code to extract both annotation and sequence.

10.4 Parsing Annotations

Now that youve successfully extracted the sequence, lets look at parsing the annotations of a GenBank file. Looking at a GenBank record, its interesting to think about how to extract the useful information. The FEATURES table is certainly a key part of the story. It has considerable structure: what should be preserved, and what is unnecessary? For instance, sometimes you just want to see if a word such as endonuclease appears anywhere in the record. For this, you just need a subroutine that searches for any regular expression in the annotation. Sometimes this is enough, but when detailed surgery is necessary, Perl has the necessary tools to make the operation successful.

10.4.1 Using Arrays

Example 10-3 parses a few pieces of information from the annotations in a GenBank file. It does this using the data in the form of an array. Example 10-3. Parsing GenBank annotations using arrays usrbinperl Parsing GenBank annotations using arrays use strict; IT-SC 241 use warnings; use BeginPerlBioinfo; see Chapter 6 about this module Declare and initialize variables my genbank = ; my locus = ; my accession = ; my organism = ; Get GenBank file data genbank = get_file_datarecord.gb; Lets start with something simple. Lets get some of the identifying information, lets say the locus and accession number here the same thing and the definition and the organism. for my line genbank { ifline =~ LOCUS { line =~ sLOCUS\s; locus = line; }elsifline =~ ACCESSION { line =~ sACCESSION\s; accession = line; }elsifline =~ ORGANISM { line =~ s\sORGANISM\s; organism = line; } } print LOCUS \n; print locus; print ACCESSION \n; print accession; print ORGANISM \n; print organism; exit; Heres the output from Example 10-3 : LOCUS AB031069 2487 bp mRNA PRI 27-MAY- 2000 ACCESSION AB031069 ORGANISM Homo sapiens IT-SC 242 Now lets slightly extend that program to handle the DEFINITION field. Notice that the DEFINITION field can extend over more than one line. To collect that field, use a trick youve already seen in Example 10-1 : set a flag when youre in the state of collecting a definition. The flag variable is called, unsurprisingly, flag . Example 10-4. Parsing GenBank annotations using arrays, take 2 usrbinperl Parsing GenBank annotations using arrays, take 2 use strict; use warnings; use BeginPerlBioinfo; see Chapter 6 about this module Declare and initialize variables my genbank = ; my locus = ; my accession = ; my organism = ; my definition = ; my flag = 0; Get GenBank file data genbank = get_file_datarecord.gb; Lets start with something simple. Lets get some of the identifying information, lets say the locus and accession number here the same thing and the definition and the organism. for my line genbank { ifline =~ LOCUS { line =~ sLOCUS\s; locus = line; }elsifline =~ DEFINITION { line =~ sDEFINITION\s; definition = line; flag = 1; }elsifline =~ ACCESSION { line =~ sACCESSION\s; accession = line; flag = 0; }elsifflag { chompdefinition; definition .= line; }elsifline =~ ORGANISM { IT-SC 243 line =~ s\sORGANISM\s; organism = line; } } print LOCUS \n; print locus; print DEFINITION \n; print definition; print ACCESSION \n; print accession; print ORGANISM \n; print organism; exit; Example 10-4 outputs: LOCUS AB031069 2487 bp mRNA PRI 27-MAY- 2000 DEFINITION Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1, complete cds. ACCESSION AB031069 ORGANISM Homo sapiens This use of flags to remember which part of the file youre in, from one iteration of a loop to the next, is a common technique when extracting information from files that have multiline sections. As the files and their fields get more complex, the code must keep track of many flags at a time to remember which part of the file its in and what information needs to be extracted. It works, but as the files become more complex, so does the code. It becomes hard to read and hard to modify. So lets look at regular expressions as a vehicle for parsing annotations.

Using Arrays Parsing Annotations

10.4 Parsing Annotations

10.4.1 Using Arrays

10.4.2 When to Use Regular Expressions

Parts

Dokumen yang terkait

medinfo 04 bioinformatics

Bioinformatics Education in Greece: A Survey

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

Comparative genome sequencing of Drosophila pseudoobscura: Chromosomal, gene, and cis-element evolution

Pengembangan Database Genbank UAI-Bioinformatics Menggunakan Sistem Terdistribusi

Applied Statistics for Bioinformatics using R

Big Data Analysis for Bioinformatics and Biomedical Discoveries pdf pdf

A Bioinformatics Workflow for Genetic Association Studies of Traits in Indonesian Rice

[Michael Moorhouse, Paul Barry,] Bioinformatics Bi(BookFi.org)

Wiley Bioinformatics Biocomputing And Perl An Introduction To Bioinformatics Computing Skills And Practice Jul 2004 ISBN 047085331X pdf

Dukungan

Links

Using Arrays Parsing Annotations

10.4 Parsing Annotations

10.4.1 Using Arrays

10.4.2 When to Use Regular Expressions

Parts

Dokumen yang terkait

medinfo 04 bioinformatics

Bioinformatics Education in Greece: A Survey

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

Comparative genome sequencing of Drosophila pseudoobscura: Chromosomal, gene, and cis-element evolution

Pengembangan Database Genbank UAI-Bioinformatics Menggunakan Sistem Terdistribusi

Applied Statistics for Bioinformatics using R

Big Data Analysis for Bioinformatics and Biomedical Discoveries pdf pdf

A Bioinformatics Workflow for Genetic Association Studies of Traits in Indonesian Rice

[Michael Moorhouse, Paul Barry,] Bioinformatics Bi(BookFi.org)

Wiley Bioinformatics Biocomputing And Perl An Introduction To Bioinformatics Computing Skills And Practice Jul 2004 ISBN 047085331X pdf

Dokumen yang Anda mencari sudah siap untuk unduhkan