Restriction Enzyme Data Restriction Maps and Restriction Enzymes

IT-SC 210 of restriction enzymes. The appropriate regular expression will be retrieved from the hash, and well search for all instances of that regular expression, plus their locations. Finally, the list of locations found will be returned.

9.2.3 Restriction Enzyme Data

The restriction enzyme data is available in a variety of formats, as a visit to the REBASE web site will show you. After looking around, you decide to get the information from the bionet file, which has a fairly simple layout. Heres the header and a few restriction enzymes from that file: REBASE version 104 bionet.104 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-=-=-=-=-=-= REBASE, The Restriction Enzyme Database http:rebase.neb.com Copyright c Dr. Richard J. Roberts, 2001. All rights reserved. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-=-=-=-=-=-= Rich Roberts Mar 30 2001 AaaI XmaIII CGGCCG AacI BamHI GGATCC AaeI BamHI GGATCC AagI ClaI ATCGAT AaqI ApaLI GTGCAC AarI CACCTGCNNNN AarI NNNNNNNNGCAGGTG AatI StuI AGGCCT AatII GACGTC AauI Bsp1407I TGTACA AbaI BclI TGATCA AbeI BbvCI CCTCAGC AbeI BbvCI GCTGAGG AbrI XhoI CTCGAG AcaI AsuII TTCGAA AcaII BamHI GGATCC AcaIII MstI TGCGCA AcaIV HaeIII GGCC AccI GTMKAC AccII FnuDII CGCG AccIII BspMII TCCGGA Acc16I MstI TGCGCA IT-SC 211 Acc36I BspMI ACCTGCNNNN Acc36I BspMI NNNNNNNNGCAGGT Acc38I EcoRII CCWGG Acc65I KpnI GGTACC Acc113I ScaI AGTACT AccB1I HgiCI GGYRCC AccB2I HaeII RGCGCY AccB7I PflMI CCANNNNNTGG AccBSI BsrBI CCGCTC AccBSI BsrBI GAGCGG AccEBI BamHI GGATCC AceI TseI GCWGC AceII NheI GCTAGC AceIII CAGCTCNNNNNNN AceIII NNNNNNNNNNNGAGCTG AciI CCGC AciI GCGG AclI AACGTT AclNI SpeI ACTAGT AclWI BinI GGATCNNNN Your first task is to read this file and get the names and the recognition site or restriction site for each enzyme.To simplify matters for now, simply discard the parenthesized enzyme names. How can this data be read? Discard header lines For each data line: remove parenthesized names, for simplicitys sake get and store the name and the recognition site Translate the recognition sites to regular expressions --but keep the recognition site, for printing out results } return the names, recognition sites, and the regular expressions This is high-level undetailed pseudocode, so lets refine and expand it. Notice that the curly brace isnt properly matched. Thats okay, because there are no syntax rules for pseudocode; do whatever works for you Heres some pseudocode that discards the header lines: IT-SC 212 foreach line if Rich Roberts break out of the foreach loop } This is based on the format of the file, in which the string youre looking for is the last text before the data lines start. Of course, if the format of the file should change, this might no longer work. Now lets further expand the pseudocode, thinking how to do the tasks involved: Discard header lines This keeps reading lines, up to a line containing Rich Roberts foreach line if Rich Roberts break out of the foreach loop } For each data line: Split the two or three if theres a parenthesized name fields fields = split , _; Get and store the name and the recognition site name = shift fields; site = pop fields; Translate the recognition sites to regular expressions --but keep the recognition site, for printing out results } return the names, recognition sites, and the regular expressions This isnt the translation, but lets look at what youve done. First, you want to extract the name and recognition site data from a string. The most common way to separate words in a line of Perl, especially if the string is nicely formatted, is with the Perl built-in function split . IT-SC 213 If you have two or three per line that have whitespace and are separated from each other by whitespace, you can get them into an array with the following simple call to split which acts on the line as stored in the special variable _.: name, site = split The fields array may have two or three elements depending on whether there was a parenthesized alternate enzyme named. But you always want the first and the last elements: name = shiftfields; site = popfields; You now have the problem of translating the recognition site to a regular expression. Looking over the recognition sites and having read the documentation on REBASE you found on its web site, you know that the cut site is represented by the caret . This doesnt help make a regular expression that finds the site in sequence, so you should remove it see Exercise 9.6 in the Section 9.4 section. Also notice that the bases given in the recognition sites are not just the bases A, C, G, and T, but they also use the more extended alphabet presented in Table 4-1 . These additional letters include a letter for every possible group of two, three, or four bases. Theyre really like abbreviations for character classes in that respect. Aha Lets write a subroutine that substitutes character classes for these codes, and then well have our regular expression. Of course, REBASE uses them, because a given restriction enzyme might well match a few different recognition sites. Example 9-1 is a subroutine that, given a string, translates these codes into character classes. Example 9-1. Translate IUB ambiguity codes to regular expressions IUB_to_regexp A subroutine that, given a sequence with IUB ambiguity codes, outputs a translation with IUB codes changed to regular expressions These are the IUB ambiguity codes Eur. J. Biochem. 150: 1-5, 1985: R = G or A Y = C or T M = A or C K = G or T S = G or C IT-SC 214 W = A or T B = not A C or G or T D = not C A or G or T H = not G A or C or T V = not T A or C or G N = A or C or G or T sub IUB_to_regexp { myiub = _; my regular_expression = ; my iub2character_class = A = A, C = C, G = G, T = T, R = [GA], Y = [CT], M = [AC], K = [GT], S = [GC], W = [AT], B = [CGT], D = [AGT], H = [ACT], V = [ACG], N = [ACGT], ; Remove the signs from the recognition sites iub =~ s\g; Translate each character in the iub sequence for my i = 0 ; i lengthiub ; ++i { regular_expression .= iub2character_class{substriub, i, 1}; } return regular_expression; } It seems youre almost ready to write a subroutine to get the data from the REBASE datafile. But theres one important item you havent addressed: what exactly is the data you want to return? IT-SC 215 You plan to return three data items per line of the original REBASE file: the enzyme name, the recognition site, and the regular expression. This doesnt fit easily into a hash. You can return an array that stores these three data items in three consecutive slots. This can work: to read the data, youd have to read groups of three items from the array. Its doable but might make lookup a little difficult. As you get into more advanced Perl, youll find that you can create your own complex data structures. Since youve learned about split, maybe you can have a hash in which the key is the enzyme name, and the value is a string with the recognition site and the regular expression separated by whitespace. Then you can look up the data fast and just extract the desired values using split. Example 9-2 shows this method. Example 9-2. Subroutine to parse a REBASE datafile parseREBASE--Parse REBASE bionet file A subroutine to return a hash where key = restriction enzyme name value = whitespace-separated recognition site and regular expression sub parseREBASE { myrebasefile = _; use strict; use warnings; use BeginPerlBioinfo; see Chapter 6 about this module Declare variables my rebasefile = ; my rebase_hash = ; my name; my site; my regexp; Read in the REBASE file rebasefile = get_file_datarebasefile; foreach rebasefile { Discard header lines 1 .. Rich Roberts and next; Discard blank lines \s and next; IT-SC 216 Split the two or three if includes parenthesized name fields my fields = split , _; Get and store the name and the recognition site Remove parenthesized names, for simplicitys sake, by not saving the middle field, if any, just the first and last name = shift fields; site = pop fields; Translate the recognition sites to regular expressions regexp = IUB_to_regexpsite; Store the data into the hash rebase_hash{name} = site regexp; } Return the hash containing the reformatted REBASE data return rebase_hash; } This parseREBASE subroutine does quite a lot. Is there, however, too much in one subroutine; should it be rewritten? Its a good question to ask yourself as youre writing code. In this case, lets leave it as it is. However, in addition to doing a lot, it also does it in a few new ways, which well look at now.

Restriction Enzyme Data Restriction Maps and Restriction Enzymes

9.2.3 Restriction Enzyme Data

9.2.4 Logical Operators and the Range Operator

Parts

Dokumen yang terkait

medinfo 04 bioinformatics

Bioinformatics Education in Greece: A Survey

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

Comparative genome sequencing of Drosophila pseudoobscura: Chromosomal, gene, and cis-element evolution

Pengembangan Database Genbank UAI-Bioinformatics Menggunakan Sistem Terdistribusi

Applied Statistics for Bioinformatics using R

Big Data Analysis for Bioinformatics and Biomedical Discoveries pdf pdf

A Bioinformatics Workflow for Genetic Association Studies of Traits in Indonesian Rice

[Michael Moorhouse, Paul Barry,] Bioinformatics Bi(BookFi.org)

Wiley Bioinformatics Biocomputing And Perl An Introduction To Bioinformatics Computing Skills And Practice Jul 2004 ISBN 047085331X pdf

Dukungan

Links

Restriction Enzyme Data Restriction Maps and Restriction Enzymes

9.2.3 Restriction Enzyme Data

9.2.4 Logical Operators and the Range Operator

Parts

Dokumen yang terkait

medinfo 04 bioinformatics

Bioinformatics Education in Greece: A Survey

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

Comparative genome sequencing of Drosophila pseudoobscura: Chromosomal, gene, and cis-element evolution

Pengembangan Database Genbank UAI-Bioinformatics Menggunakan Sistem Terdistribusi

Applied Statistics for Bioinformatics using R

Big Data Analysis for Bioinformatics and Biomedical Discoveries pdf pdf

A Bioinformatics Workflow for Genetic Association Studies of Traits in Indonesian Rice

[Michael Moorhouse, Paul Barry,] Bioinformatics Bi(BookFi.org)

Wiley Bioinformatics Biocomputing And Perl An Introduction To Bioinformatics Computing Skills And Practice Jul 2004 ISBN 047085331X pdf

Dokumen yang Anda mencari sudah siap untuk unduhkan