Background Planning the Program

IT-SC 208 fundamental ideas weve just seen—repetition, alternation, and concatenation. For instance, the character class shown earlier can be written using alternation as C|G|T . Another common feature is the period, which can stand for any character, except a newline. So ACG.GCA stands for any DNA that starts with ACG and ends with GCA . In English, this reads as: ACG followed by 0 or more characters followed by GCA . In Perl, regular expressions are usually enclosed within forward slashes and are used as pattern-matching specifiers. Check the documentation or Appendix B , for m, which includes some options that affect the behavior of the regular expressions. Regular expressions are also used in many of Perls built-in commands, as you will see. The Perl documentation is essential: start with the perlre section of the Perl manual at http:www.perldoc.comperl5.6podperlre.htmltop .

9.2 Restriction Maps and Restriction Enzymes

One of the great discoveries in molecular biology, which paved the way for the current golden age in biological research, was the discovery of restriction enzymes. For the nonbiologist, and to help set up the programming material that follows, heres a short overview.

9.2.1 Background

Restriction enzymes are proteins that cut DNA at short, specific sequences; for example, the popular restriction enzymes EcoRI and HindIII are widely used in the lab. EcoRI cuts where it finds GAATTC , between the G and A . Actually, it cuts both complementary strands, leaving an overhang on each end. These sticky ends of a few bases in single strands make it possible for the fragments to re-form, making possible the insertion of DNA into vectors for cloning and sequencing, for instance. HindIII cuts at AAGCTT and cuts between the A s. Some restriction enzymes cut in the middle and result in blunt ends with no overhang. About 1,000 restriction enzymes are known. If you look at the reverse complement of the restriction enzyme EcoRI, you see its GAATTC , the same sequence. This is a biological version of a palindrome, a word that reads the same in reverse. Many restriction sites are palindromes. Computing restriction maps is a common and practical bioinformatics calculation in the laboratory. Restriction maps are computed to plan experiments, to find the best way to cut DNA to insert a gene, to make a site-specific mutation, or for several other applications of recombinant DNA techniques. By computing first, the laboratory scientist saves considerably on the necessary trial-and-error at the laboratory bench. Look for more about restriction enzymes at http:www.neb.comrebaserebase.html . Well now write a program that does something useful in the lab: it will look for IT-SC 209 restriction enzymes in a sequence of DNA and report back with a restriction map of exactly where in the DNA the restriction enzymes appear.

9.2.2 Planning the Program

Back in Chapter 5 , you saw how to look for regular expressions in text. So youve an idea of how to find motifs in sequences with Perl. Now lets think about how to use those techniques to create restriction maps. Here are some questions to ask: Where do I find restriction enzyme data? Restriction enzyme data can be found at the Restriction Enzyme Database, REBASE, which is on the Web at http:www.neb.comrebaserebase.html . How do I represent restriction enzymes in regular expressions? Exploring that site, youll see that restriction enzymes are represented in their own language. Well try to translate that language into the language of regular expressions. How do I store restriction enzyme data? There are about 1,000 restriction enzymes with names and definitions. This makes them candidates for the fast key-value type of lookup hashes provide. When you write a real application, say for the Web, its a good idea to create a DBM file to store the information, ready to use when a program needs a lookup. I will cover DBM files in Chapter 10 ; here, Ill just demonstrate the principle. Well keep only a few restriction enzyme definitions in the program. How do I accept queries from the user? You can ask for a restriction enzyme name, or you can allow the user to type in a regular expression directly. Well do the first. Also, you want to let the user specify which sequence to use. Again, to simplify matters, youll just read in the data from a sample DNA file. How do I report back the restriction map to the user? This is an important question. The simplest way is to generate a list of positions with the names of the restriction enzymes found there. This is useful for further processing, as it presents the information very simply. But what if you dont want to do further processing; you just want to communicate the restriction map to the user? Then, perhaps itd be more useful to present a graphical display, perhaps print out the sequence with a line above it that flags the presence of the enzymes. There are lots of fancy bells and whistles you can use, but lets do it the simple way for now and output a list. So, the plan is to write a program that includes restriction enzyme data translated into regular expressions, stored as the values of the keys of the restriction enzyme names. DNA sequence data will be used from the file, and the user will be prompted for names IT-SC 210 of restriction enzymes. The appropriate regular expression will be retrieved from the hash, and well search for all instances of that regular expression, plus their locations. Finally, the list of locations found will be returned.

9.2.3 Restriction Enzyme Data