IT-SC 208
fundamental ideas weve just seen—repetition, alternation, and concatenation. For instance, the character class shown earlier can be written using alternation as
C|G|T .
Another common feature is the period, which can stand for any character, except a newline. So
ACG.GCA stands for any DNA that starts with
ACG and ends with
GCA . In
English, this reads as: ACG
followed by 0 or more characters followed by GCA
. In Perl, regular expressions are usually enclosed within forward slashes and are used as
pattern-matching specifiers. Check the documentation or Appendix B
, for m,
which includes some options that affect the behavior of the regular expressions. Regular expressions are also used in many of Perls built-in commands, as you will see.
The Perl documentation is essential: start with the perlre section of the Perl manual at
http:www.perldoc.comperl5.6podperlre.htmltop .
9.2 Restriction Maps and Restriction Enzymes
One of the great discoveries in molecular biology, which paved the way for the current golden age in biological research, was the discovery of restriction enzymes. For the
nonbiologist, and to help set up the programming material that follows, heres a short overview.
9.2.1 Background
Restriction enzymes are proteins that cut DNA at short, specific sequences; for example, the popular restriction enzymes EcoRI and HindIII are widely used in the lab. EcoRI cuts
where it finds GAATTC
, between the G
and A
. Actually, it cuts both complementary strands, leaving an overhang on each end. These sticky ends of a few bases in single
strands make it possible for the fragments to re-form, making possible the insertion of DNA into vectors for cloning and sequencing, for instance. HindIII cuts at
AAGCTT and
cuts between the A
s. Some restriction enzymes cut in the middle and result in blunt ends with no overhang. About 1,000 restriction enzymes are known.
If you look at the reverse complement of the restriction enzyme EcoRI, you see its GAATTC
, the same sequence. This is a biological version of a palindrome, a word that reads the same in reverse. Many restriction sites are palindromes.
Computing restriction maps is a common and practical bioinformatics calculation in the laboratory. Restriction maps are computed to plan
experiments, to find the best way to cut DNA to insert a gene, to make a site-specific mutation, or for several other applications of
recombinant DNA techniques. By computing first, the laboratory scientist saves considerably on the necessary trial-and-error at the
laboratory bench. Look for more about restriction enzymes at
http:www.neb.comrebaserebase.html .
Well now write a program that does something useful in the lab: it will look for
IT-SC 209
restriction enzymes in a sequence of DNA and report back with a restriction map of exactly where in the DNA the restriction enzymes appear.
9.2.2 Planning the Program
Back in Chapter 5
, you saw how to look for regular expressions in text. So youve an idea of how to find motifs in sequences with Perl. Now lets think about how to use those
techniques to create restriction maps. Here are some questions to ask: Where do I find restriction enzyme data?
Restriction enzyme data can be found at the Restriction Enzyme Database, REBASE, which is on the Web at
http:www.neb.comrebaserebase.html .
How do I represent restriction enzymes in regular expressions? Exploring that site, youll see that restriction enzymes are represented in their own
language. Well try to translate that language into the language of regular expressions.
How do I store restriction enzyme data? There are about 1,000 restriction enzymes with names and definitions. This makes
them candidates for the fast key-value type of lookup hashes provide. When you write a real application, say for the Web, its a good idea to create a DBM file to
store the information, ready to use when a program needs a lookup. I will cover DBM files in
Chapter 10 ; here, Ill just demonstrate the principle. Well keep
only a few restriction enzyme definitions in the program. How do I accept queries from the user?
You can ask for a restriction enzyme name, or you can allow the user to type in a regular expression directly. Well do the first. Also, you want to let the user
specify which sequence to use. Again, to simplify matters, youll just read in the data from a sample DNA file.
How do I report back the restriction map to the user? This is an important question. The simplest way is to generate a list of positions
with the names of the restriction enzymes found there. This is useful for further processing, as it presents the information very simply.
But what if you dont want to do further processing; you just want to communicate the restriction map to the user? Then, perhaps itd be more useful to
present a graphical display, perhaps print out the sequence with a line above it that flags the presence of the enzymes.
There are lots of fancy bells and whistles you can use, but lets do it the simple way for now and output a list.
So, the plan is to write a program that includes restriction enzyme data translated into regular expressions, stored as the values of the keys of the restriction enzyme names.
DNA sequence data will be used from the file, and the user will be prompted for names
IT-SC 210
of restriction enzymes. The appropriate regular expression will be retrieved from the hash, and well search for all instances of that regular expression, plus their locations. Finally,
the list of locations found will be returned.
9.2.3 Restriction Enzyme Data