IT-SC 204
GPHLRVAQVWL_PQEAP_LLMTPLPPHLHGCALTSVMAVAAVGWAVAGGALAGTLRASL VRARKGSTCTI
PGPAAGTGAAGTSAGSCWGPRTSSCPDRNHSDHSPQCADMPHTHHTCGLTV_SAAAAAG DAGWVWPPRAA
ERICGAKQSPEQAWPQPLSLTLPGAAGLDQGQASCALHPHPGAHCCPAHCHPAPVTSCA DSESLAWGLSL
CTPDSTTPGWPWPSSQ_SGCSPHGTTHCSCHTRS_SS_CPVCGRCSRWAHSPHSRTCCP PRHLEALGLNH
LPPYLRSRRTPSRPPPATREPAQTTRRRPRRLQRRPQLLPLLVTPPRQRTPIAQAPCVV SAANQ_VAGLE
PPRPLSAAI
8.7 Exercises
Exercise 8.1 Write a subroutine that checks a string and returns
true if its a DNA sequence.
Write another that checks for protein sequence data. Exercise 8.2
Write a program that can search by name for a gene in an unsorted array. Exercise 8.3
Write a program that can search by name for a gene in a sorted array; use the Perl sort function to sort an array. For extra credit: write a binary search subroutine to
do the searching.
Exercise 8.4 Write a subroutine that inserts an element into a sorted array. Hint: use the splice
Perl function to insert the element, as shown in Chapter 4
. Exercise 8.5
Write a program that searches by name for a gene in a hash. Get the genes from your own work or try downloading a list of all genes for a given organism from
www.ncbi.nlm.nih.gov or one of the web sites given in
Appendix A . Make
a hash of all the genes key=name, value=gene ID or sequence. Hint: you may have to write a short Perl program to reformat the list of genes you start with to
make it easy to populate the Perl hash.
Exercise 8.6 Write a subroutine that checks an array of data and returns
true if its in FASTA format.
Note that FASTA expects the standard IUBIUPAC amino acid and nucleic acid codes, plus the dash - that represents a gap of unknown length. Also, the asterisk represents
a stop codon for amino acids. Be careful using an asterisk in regular expressions; use a \ to escape it to match an actual asterisk.
The remaining problems deal with the effect of mutations in DNA on the proteins they
IT-SC 205
encode. They combine the subject of randomization and mutations from Chapter 7
plus the subject of the genetic code from this chapter.
Exercise 8.7 For each codon, make note of what effect single nucleotide mutations have on the
codon: does the same amino acid result, or does the codon now encode a different amino acid? Which one? Write a subroutine that, given a codon, returns a list of
all the amino acids that may result from any single mutation in the codon.
Exercise 8.8 Write a subroutine that, given an amino acid, randomly changes it to one of the
amino acids calculated in Exercise 8.7. Exercise 8.9
Write a program that randomly mutates the amino acids in a protein but restricts the possibilities to those that can occur due to a single mutation in the original
codons, as in Exercises 8.7 and 8.8.
Exercise 8.10 Some codons are more likely than others to occur in random DNA. For instance,
there are 6 of the 64 possible codons that code for the amino acid serine, but only 2 of the 64 codes for phenylalanine. Write a subroutine that, given an amino acid,
returns the probability that its coded by a randomly generated codon see
Chapter 7 .
Exercise 8.11 Write a subroutine that takes as arguments an amino acid; a position 1, 2, or 3;
and a nucleotide. It then takes each codon that encodes the specified amino acid there may be from one to six such codons, and mutates it at the specified
position to the specified nucleotide. Finally, it returns the set of amino acids that are encoded by the mutated codons.
Exercise 8.12 Write a program that, given two amino acids, returns the probability that a single
mutation in their underlying but unspecified codons results in the codon of one amino acid mutating to the codon of the other amino acid.
IT-SC 206
Chapter 9. Restriction Maps and Regular Expressions
In this chapter, Ill give an overview of Perl regular expressions and Perl operators, two essential features of the language weve been using all along. Well also investigate the
programming of a standard, fundamental molecular-biology technique: the discovery of a restriction map for a sequence. Restriction digests were one of the original ways to
fingerprint DNA; this can now be simulated on the computer.
Restriction maps and their associated restriction digests are common calculations in the laboratory and are provided by several software packages. They are essential tools in the
planning of cloning experiments; they can be used to insert a desired stretch of DNA into a cloning vector, for instance. Restriction maps also find application in sequencing
projects, for instance in shotgun or directed sequencing.
9.1 Regular Expressions
Weve been dealing with regular expressions for a while now. This section fills in some background an.d ties together the somewhat scattered discussions of regular expressions
from earlier parts of the book.
Regular expressions are interesting, important, and rich in capabilities. Jeffrey Friedls book Mastering Regular Expressions OReilly is entirely devoted to them. Perl
makes particularly good use of regular expressions, and the Perl documentation explains them well. Regular expressions are useful when programming with biological data such
as sequence, or with GenBank, PDB, and BLAST files.
Regular expressions are ways of representing—and searching for—many strings with one string. Although they are not strictly the same thing, its useful to think of regular
expressions as a kind of highly developed set of wildcards. The special characters in regular expressions are more properly known as metacharacters.
Most people are familiar with wildcards, which are found in search engines or in the game of poker. You might find the reference to every word that starts with
biolog by
typing biolog
, for instance. Or you may find yourself holding five aces. Different situations may use different wildcards. Perl regular expressions use to mean 0 or more
of the preceding item, not followed by anything as in the wildcard example just given. In computer science, these kinds of wildcards or metacharacters have an important
history, both practically and theoretically. The asterisk character in particular is called the Kleene closure after the eminent logician who invented it. As a nod to the theory, Ill
mention there is a simple model of a computer, less powerful than a Turing machine, that can deal with exactly the same kinds of languages that can be described by regular
expressions. This machine model is called a finite state automaton. But enough theory for