Exercises Mutations and Randomization

IT-SC 167 common to have to select all combinations of two or more elements from a set. The nested loop involves looking at n n-1 2 pairs of sequences, which is a square function of the size of the data set. This can get very big Try gradually increasing the size of the data set and rerunning the program, and youll see your compute time increase, and more than gradually. See how the looping works? First sequence 0 indexed by K is paired with sequences 1,2,3,...,9, in turn indexed by i . Then sequence 1 is paired with 2,3,...,9, etc. Finally, 8 is paired with 9. Recall that array elements are numbered starting at 0, so the last element of an array with 10 elements is numbered 9. Also recall that scalar random_DNA returns the number of elements in the array. You might find it a worthwhile exercise to let the number of sequences be some small value, say 3 or 4, and think through paper and pencil in hand how the nested loops and the variables k and i evolve during the running of the program. Or you can use the Perl debugger to watch how it happens.

7.6 Exercises

Exercise 7.1 Write a program that asks you to pick an amino acid and then keeps randomly guessing which amino acid you picked. Exercise 7.2 Write a program that picks one of the four nucleotides and then keeps prompting until you correctly guess the nucleotide it picked. Exercise 7.3 Write a subroutine to randomly shuffle the elements of an array. The subroutine should take an array as an argument and return an array with the same elements but shuffled in a random order. Each element of the original array should appear exactly once in the output array, just like shuffling a deck of cards. Exercise 7.4 Write a program to mutate protein sequence, similar to the code in Example 7- 2 that mutates DNA. Exercise 7.5 Write a subroutine that, given a codon a fragment of DNA of length 3, returns a random mutation in the codon. Exercise 7.6 Some versions of Perl automatically seed the random number generator, making it superfluous to call srand for that purpose before using rand to generate random numbers. Experiment to see if your implementation of rand calls IT-SC 168 srand automatically, or if you have to explicitly call srand yourself, as you have seen done in the code in this chapter. Exercise 7.7 Sometimes not all choices are will be picked in a random selection. Write a subroutine that randomly returns a nucleotide, in which the probability of each nucleotide can be specified. Pass the subroutine four numbers as arguments, representing the probabilities of each nucleotide; if each probability is 0.25, the subroutine is equally likely to pick each nucleotide. As error checking, have the subroutine ensure that the sum of the four probabilities is 1. Hint: one way to accomplish this is to divide the range between 0 and 1 into four intervals with lengths corresponding to the probability of the respective nucleotides. Then, simply pick a random number between 0 and 1, see in which interval it falls, and return the corresponding nucleotide. Exercise 7.8 This is a more difficult exercise. The study function in Perl may speed up searches for motifs in DNA or protein. Read the Perl documentation on this function. Its use is simple: given some sequence data in a variable sequence , type: study sequence; before doing the searches. Do you think study will speed up searches in DNA or protein, based on what youve read about it in the documentation? For lots of extra credit Now read the Perl documentation on the standard module Benchmark. Type perldoc Benchmark , or visit the Perl home page at http:www.perl.com . See if your guess is right by writing a program that benchmarks motif searches of DNA and of protein, with and without study. IT-SC 169

Chapter 8. The Genetic Code

Up to this point weve used Perl to search for motifs, simulate DNA mutations, generate random sequences, and transcribe DNA to RNA. These are all important activities, and they serve as a good introduction to the computational techniques you can use to study biological systems. In this chapter, well write Perl programs to simulate how the genetic code directs the translation of DNA into protein. I will start by introducing the hash datatype. Then, after a brief discussion of how different data structures hashes, arrays, and databases can store and access experimental information, we will write a program to translate DNA to protein. Well also continue exploring regular expressions and write code to handle FASTA files.

8.1 Hashes

There are three main datatypes in Perl. Youve already seen two: scalar variables and arrays. Now well start to use the third: hashes also called associative arrays. A hash provides very fast lookup of the value associated with a key. As an example, say you have a hash called english_dictionary . Yes, hashes start with the percent sign. If you want to look up the definition of the word recreant, you say: definition = english_dictionary{recreant}; The scalar recreant is the key, and the scalar definition thats returned is the value. As you see from this example, hashes like arrays change their leading character to a dollar sign when you access a single element, because the value returned from a hash lookup is a scalar value. You can tell a hash lookup from an array element by the type of braces they use: arrays use square brackets [ ]; hashes use curly braces { }. If you want to assign a value to a key, its similarly an easy, single statement: english_dictionary{recreant} = One who calls out in surrender.; Also, if you want to initialize a hash with some key-value pairs, its done much like initializing arrays, but every pair becomes a key-value: classification = dog, mammal, robin, bird, asp, reptile, ; which initializes the key dog with the value mammal , and so on. Theres another way of writing this, which shows whats happening a little more clearly. The following