IT-SC 210
of restriction enzymes. The appropriate regular expression will be retrieved from the hash, and well search for all instances of that regular expression, plus their locations. Finally,
the list of locations found will be returned.
9.2.3 Restriction Enzyme Data
The restriction enzyme data is available in a variety of formats, as a visit to the REBASE web site will show you. After looking around, you decide to get the information from the
bionet file, which has a fairly simple layout. Heres the header and a few restriction enzymes from that file:
REBASE version 104 bionet.104
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-=-=-=-=-=-=
REBASE, The Restriction Enzyme Database http:rebase.neb.com
Copyright c Dr. Richard J. Roberts, 2001. All rights reserved.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-=-=-=-=-=-=
Rich Roberts Mar 30 2001
AaaI XmaIII CGGCCG AacI BamHI GGATCC
AaeI BamHI GGATCC AagI ClaI ATCGAT
AaqI ApaLI GTGCAC AarI CACCTGCNNNN
AarI NNNNNNNNGCAGGTG AatI StuI AGGCCT
AatII GACGTC AauI Bsp1407I TGTACA
AbaI BclI TGATCA AbeI BbvCI CCTCAGC
AbeI BbvCI GCTGAGG AbrI XhoI CTCGAG
AcaI AsuII TTCGAA AcaII BamHI GGATCC
AcaIII MstI TGCGCA AcaIV HaeIII GGCC
AccI GTMKAC AccII FnuDII CGCG
AccIII BspMII TCCGGA Acc16I MstI TGCGCA
IT-SC 211
Acc36I BspMI ACCTGCNNNN Acc36I BspMI NNNNNNNNGCAGGT
Acc38I EcoRII CCWGG Acc65I KpnI GGTACC
Acc113I ScaI AGTACT AccB1I HgiCI GGYRCC
AccB2I HaeII RGCGCY AccB7I PflMI CCANNNNNTGG
AccBSI BsrBI CCGCTC AccBSI BsrBI GAGCGG
AccEBI BamHI GGATCC AceI TseI GCWGC
AceII NheI GCTAGC AceIII CAGCTCNNNNNNN
AceIII NNNNNNNNNNNGAGCTG AciI CCGC
AciI GCGG AclI AACGTT
AclNI SpeI ACTAGT AclWI BinI GGATCNNNN
Your first task is to read this file and get the names and the recognition site or restriction site for each enzyme.To simplify matters for now, simply discard the parenthesized
enzyme names.
How can this data be read? Discard header lines
For each data line: remove parenthesized names, for simplicitys sake
get and store the name and the recognition site Translate the recognition sites to regular expressions
--but keep the recognition site, for printing out results
} return the names, recognition sites, and the regular
expressions
This is high-level undetailed pseudocode, so lets refine and expand it. Notice that the curly brace isnt properly matched. Thats okay, because there are no syntax rules for
pseudocode; do whatever works for you Heres some pseudocode that discards the header lines:
IT-SC 212
foreach line if Rich Roberts
break out of the foreach loop }
This is based on the format of the file, in which the string youre looking for is the last text before the data lines start. Of course, if the format of the file should change, this
might no longer work.
Now lets further expand the pseudocode, thinking how to do the tasks involved: Discard header lines
This keeps reading lines, up to a line containing Rich Roberts
foreach line if Rich Roberts
break out of the foreach loop }
For each data line: Split the two or three if theres a parenthesized
name fields fields = split , _;
Get and store the name and the recognition site name = shift fields;
site = pop fields; Translate the recognition sites to regular
expressions --but keep the recognition site, for printing out
results }
return the names, recognition sites, and the regular expressions
This isnt the translation, but lets look at what youve done. First, you want to extract the name and recognition site data from a string. The most
common way to separate words in a line of Perl, especially if the string is nicely formatted, is with the Perl built-in function split .
IT-SC 213
If you have two or three per line that have whitespace and are separated from each other by whitespace, you can get them into an array with the following simple call to split
which acts on the line as stored in the special variable _.:
name, site = split The
fields array may have two or three elements depending on whether there was a
parenthesized alternate enzyme named. But you always want the first and the last elements:
name = shiftfields; site = popfields;
You now have the problem of translating the recognition site to a regular expression. Looking over the recognition sites and having read the documentation on REBASE you
found on its web site, you know that the cut site is represented by the caret . This doesnt help make a regular expression that finds the site in sequence, so you should
remove it see Exercise 9.6 in the
Section 9.4 section.
Also notice that the bases given in the recognition sites are not just the bases A, C, G, and T, but they also use the more extended alphabet presented in
Table 4-1 . These
additional letters include a letter for every possible group of two, three, or four bases. Theyre really like abbreviations for character classes in that respect. Aha Lets write a
subroutine that substitutes character classes for these codes, and then well have our regular expression.
Of course, REBASE uses them, because a given restriction enzyme might well match a few different recognition sites.
Example 9-1 is a subroutine that, given a string, translates these codes into character
classes.
Example 9-1. Translate IUB ambiguity codes to regular expressions
IUB_to_regexp A subroutine that, given a sequence with IUB ambiguity
codes, outputs a translation with IUB codes changed to regular
expressions These are the IUB ambiguity codes
Eur. J. Biochem. 150: 1-5, 1985: R = G or A
Y = C or T M = A or C
K = G or T S = G or C
IT-SC 214
W = A or T B = not A C or G or T
D = not C A or G or T H = not G A or C or T
V = not T A or C or G N = A or C or G or T
sub IUB_to_regexp { myiub = _;
my regular_expression = ; my iub2character_class =
A = A, C = C,
G = G, T = T,
R = [GA], Y = [CT],
M = [AC], K = [GT],
S = [GC], W = [AT],
B = [CGT], D = [AGT],
H = [ACT], V = [ACG],
N = [ACGT], ;
Remove the signs from the recognition sites iub =~ s\g;
Translate each character in the iub sequence for my i = 0 ; i lengthiub ; ++i {
regular_expression .= iub2character_class{substriub, i, 1};
} return regular_expression;
}
It seems youre almost ready to write a subroutine to get the data from the REBASE datafile. But theres one important item you havent addressed: what exactly is the data
you want to return?
IT-SC 215
You plan to return three data items per line of the original REBASE file: the enzyme name, the recognition site, and the regular expression. This doesnt fit easily into a hash.
You can return an array that stores these three data items in three consecutive slots. This can work: to read the data, youd have to read groups of three items from the array. Its
doable but might make lookup a little difficult. As you get into more advanced Perl, youll find that you can create your own complex data structures.
Since youve learned about split, maybe you can have a hash in which the key is the enzyme name, and the value is a string with the recognition site and the regular
expression separated by whitespace. Then you can look up the data fast and just extract the desired values using split.
Example 9-2 shows this method.
Example 9-2. Subroutine to parse a REBASE datafile
parseREBASE--Parse REBASE bionet file A subroutine to return a hash where
key = restriction enzyme name value = whitespace-separated recognition site and
regular expression sub parseREBASE {
myrebasefile = _; use strict;
use warnings; use BeginPerlBioinfo; see Chapter 6 about this
module Declare variables
my rebasefile = ; my rebase_hash = ;
my name; my site;
my regexp; Read in the REBASE file
rebasefile = get_file_datarebasefile; foreach rebasefile {
Discard header lines 1 .. Rich Roberts and next;
Discard blank lines \s and next;
IT-SC 216
Split the two or three if includes parenthesized name fields
my fields = split , _; Get and store the name and the recognition site
Remove parenthesized names, for simplicitys sake, by not saving the middle field, if any,
just the first and last name = shift fields;
site = pop fields; Translate the recognition sites to regular
expressions regexp = IUB_to_regexpsite;
Store the data into the hash rebase_hash{name} = site regexp;
} Return the hash containing the reformatted REBASE
data return rebase_hash;
}
This parseREBASE subroutine does quite a lot. Is there, however, too much in one subroutine; should it be rewritten? Its a good question to ask yourself as youre writing
code. In this case, lets leave it as it is. However, in addition to doing a lot, it also does it in a few new ways, which well look at now.
9.2.4 Logical Operators and the Range Operator