Translating DNA into Proteins

IT-SC 185 of datatype will be added to Perl in the future, or perhaps you want to do lookups from a database or a DBM file. Then all you have to do is change the internals of this one subroutine. As long as the interface to the subroutine remains the same—that is to say, as long as it still takes one codon as an argument and returns a one-character amino acid— you dont need to worry about how it accomplishes the translation from the standpoint of the rest of the programs. Our subroutine has become a black box. This is one significant benefit of modularization and organization of programs with subroutines. Theres another good, and biological, reason why you should use a subroutine for the genetic code. There is actually more than one genetic code, because there are differences as to how DNA encodes amino acids among mammals, plants, insects, and yeast— especially in the mitochondria. So if you have modularized the genetic code, you can easily modify your program to work with a range of organisms. One of the benefits of hashes is that they are fast. Unfortunately, our subroutine declares the whole hash each time the subroutine is called, even for one lookup. This isnt so efficient; in fact, its kind of slow. There are other, much faster ways that involve declaring the genetic code hash only once as a global variable, but they would take us a little far afield at this point. Our current version has the advantage of being easy to read. So, lets be officially happy with the hash version of codon2aa and put it into our module in the file BeginPerlBioinfo.pm see Chapter 6 . Now that weve got a satisfactory way to translate codons to amino acids, well start to use it in the next section and in the examples.

8.4 Translating DNA into Proteins

Example 8-1 shows how the new codon2aa subroutine translates a whole DNA sequence into protein. Example 8-1. Translate DNA into protein usrbinperl Translate DNA into protein use strict; use warnings; use BeginPerlBioinfo; see Chapter 6 about this module Initialize variables my dna = CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC; my protein = ; my codon; Translate each three-base codon into an amino acid, and append to a protein formy i=0; i lengthdna - 2 ; i += 3 { IT-SC 186 codon = substrdna,i,3; protein .= codon2aacodon; } print I translated the DNA\n\ndna\n\n into the protein\n\nprotein\n\n; exit; To make this work, youll need the BeginPerlBioinfo.pm module for your subroutines in a separate file the program can find, as discussed in Chapter 6 . You also have to add the codon2aa subroutine to it. Alternatively, you can add the code for the subroutine condon2aa directly to the program in Example 8-1 and remove the reference to the BeginPerlBioinfo.pm module. Heres the output from Example 8-1 : I translated the DNA CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC into the protein RRLRTGLARVGR Youve seen all the elements in Example 8-1 before, except for the way it loops through the DNA with this statement: formy i=0; i lengthdna - 2 ; i += 3 { Recall that a for loop has three parts, delimited by the two semicolons. The first part initializes a counter: my i=0 statically scopes the i variable so its visible only inside this block, and any other i elsewhere in the code well, in this case, there arent any, but it can happen is now invisible inside the block. The third part of the for loop increments the counter after all the statements in the block are executed and before returning to the beginning of the loop: i += 3 Since youre trying to march through the DNA three bases at a shot, you increment by three. The second, middle part of the for loop tests whether the loop should continue: i lengthdna - 2 The point is that if there are none, one, or two bases left, you should quit, because theres not enough to make a codon. Now, the positions in a string of DNA of a certain length are numbered from to length-1 . So if the position counter i has reached length-2 , theres only two more bases at positions length-2 and length-1 , and you should quit. Only if the position counter i is less than length-2 will you still have at least three bases left, enough for a codon. So the test succeeds only if: i lengthdna -2 IT-SC 187 Notice also how the whole expression to the right of the less-than sign is enclosed in parentheses; well discuss this in Chapter 9 in Section 9.3.1 . The line of code: codon = substr dna, i 3; actually extracts the 3-base codon from the DNA. The call to the substr function specifies a substring of dna at position i of length 3 , and saves it in the variable codon . If you know youll need to do this DNA-to-protein translation a lot, you can turn Example 8-1 into a subroutine. Whenever you write a subroutine, you have to think about which arguments you may want to give the subroutine. So you realize, there may come a time when youll have some large DNA sequence but only want to translate a given part of it. Should you add two arguments to the subroutine as beginning and end points? You could, but decide not to. Its a judgment call—part of the art of decomposing a collection of code into useful fragments. But it might be better to have a subroutine that just translates; then you can make it part of a larger subroutine that picks endpoints in the sequence, if needed. The thinking is that youll usually just translate the whole thing and always typing in for the start and lengthdna-1 at the end, would be an annoyance. Of course, this depends on what youre doing, so this particular choice just illustrates your thinking when you write the code. You should also remove the informative print statement at the end, because its more suited to a main program than a subroutine. Anyway, youve now thought through the design and just want a subroutine that takes one argument containing DNA and returns a peptide translation: dna2peptide A subroutine to translate DNA sequence into a peptide sub dna2peptide { mydna = _; use strict; use warnings; use BeginPerlBioinfo; see Chapter 6 about this module Initialize variables my protein = ; IT-SC 188 Translate each three-base codon to an amino acid, and append to a protein formy i=0; i lengthdna - 2 ; i += 3 { protein .= codon2aa substrdna,i,3 ; } return protein; } Now add subroutine dna2peptide to the BeginPerlBioinfo.pm module. Notice that youve eliminated one of the variables in making the subroutine out of Example 8-1 : the variable codon . Why? Well, one reason is because you can. In Example 8-1 , you were using substr to extract the codon from dna , saving it in variable codon and then passing it into the subroutine codon2aa. This new way eliminates the middleman. Put the call to substr that extracts the codon as the argument to the subroutine codon2aa so that the value is passed in just as before, but without having to copy it to the variable codon first. This has somewhat improved efficiency and speed. Since copying strings is one of the slower things computer programs do, eliminating a bunch of string copies is an easy and effective way to speed up a program. But has it made the program less readable? You be the judge. I think it has, a little, but the comment right before the loop seems to make everything clear enough, for me, anyway. Its important to have readable code, so if you really need to boost the speed of a subroutine, but find it makes the code harder to read, be sure to include enough comments for the reader to be able to understand whats going on. For the first time use function calls are being included in a subroutine instead of the main program: use strict; use warnings; use BeginPerlBioinfo; This may be redundant with the calls in the main program, but it doesnt do any harm Perl checks and loads a module only once. If this subroutine should be called from a module that doesnt already load the modules, its done some good after all. Now lets improve how we deal with DNA in files.

8.5 Reading DNA from Files in FASTA Format