IT-SC 185
of datatype will be added to Perl in the future, or perhaps you want to do lookups from a database or a DBM file. Then all you have to do is change the internals of this one
subroutine. As long as the interface to the subroutine remains the same—that is to say, as long as it still takes one codon as an argument and returns a one-character amino acid—
you dont need to worry about how it accomplishes the translation from the standpoint of the rest of the programs. Our subroutine has become a black box. This is one
significant benefit of modularization and organization of programs with subroutines.
Theres another good, and biological, reason why you should use a subroutine for the genetic code. There is actually more than one genetic code, because there are differences
as to how DNA encodes amino acids among mammals, plants, insects, and yeast— especially in the mitochondria. So if you have modularized the genetic code, you can
easily modify your program to work with a range of organisms.
One of the benefits of hashes is that they are fast. Unfortunately, our subroutine declares the whole hash each time the subroutine is called, even for one lookup. This isnt so
efficient; in fact, its kind of slow. There are other, much faster ways that involve declaring the genetic code hash only once as a global variable, but they would take us a
little far afield at this point. Our current version has the advantage of being easy to read. So, lets be officially happy with the hash version of codon2aa and put it into our
module in the file BeginPerlBioinfo.pm see
Chapter 6 .
Now that weve got a satisfactory way to translate codons to amino acids, well start to use it in the next section and in the examples.
8.4 Translating DNA into Proteins
Example 8-1 shows how the new codon2aa subroutine translates a whole DNA
sequence into protein.
Example 8-1. Translate DNA into protein
usrbinperl Translate DNA into protein
use strict; use warnings;
use BeginPerlBioinfo; see Chapter 6 about this module Initialize variables
my dna = CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC; my protein = ;
my codon; Translate each three-base codon into an amino acid, and
append to a protein formy i=0; i lengthdna - 2 ; i += 3 {
IT-SC 186
codon = substrdna,i,3; protein .= codon2aacodon;
} print I translated the DNA\n\ndna\n\n into the
protein\n\nprotein\n\n; exit;
To make this work, youll need the BeginPerlBioinfo.pm module for your subroutines in a separate file the program can find, as discussed in
Chapter 6 . You also have to add the
codon2aa subroutine to it. Alternatively, you can add the code for the subroutine condon2aa directly to the program in
Example 8-1 and remove the reference to the
BeginPerlBioinfo.pm module. Heres the output from
Example 8-1 :
I translated the DNA CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC
into the protein RRLRTGLARVGR
Youve seen all the elements in Example 8-1
before, except for the way it loops through the DNA with this statement:
formy i=0; i lengthdna - 2 ; i += 3 { Recall that a
for loop has three parts, delimited by the two semicolons. The first part
initializes a counter: my i=0
statically scopes the i
variable so its visible only inside this block, and any other
i elsewhere in the code well, in this case, there arent any, but
it can happen is now invisible inside the block. The third part of the for
loop increments the counter after all the statements in the block are executed and before
returning to the beginning of the loop: i += 3
Since youre trying to march through the DNA three bases at a shot, you increment by three.
The second, middle part of the for
loop tests whether the loop should continue: i lengthdna - 2
The point is that if there are none, one, or two bases left, you should quit, because theres not enough to make a codon. Now, the positions in a string of DNA of a certain length
are numbered from to
length-1 . So if the position counter
i has reached
length-2 , theres only two more bases at positions
length-2 and
length-1 , and
you should quit. Only if the position counter i
is less than length-2
will you still have at least three bases left, enough for a codon. So the test succeeds only if:
i lengthdna -2
IT-SC 187
Notice also how the whole expression to the right of the less-than sign is enclosed in parentheses; well discuss this in
Chapter 9 in
Section 9.3.1 .
The line of code: codon = substr dna, i 3;
actually extracts the 3-base codon from the DNA. The call to the substr
function specifies a substring of
dna at position
i of length
3 , and saves it in the variable
codon .
If you know youll need to do this DNA-to-protein translation a lot, you can turn Example 8-1
into a subroutine. Whenever you write a subroutine, you have to think about which arguments you may want to give the subroutine. So you realize, there may
come a time when youll have some large DNA sequence but only want to translate a given part of it. Should you add two arguments to the subroutine as beginning and end
points? You could, but decide not to. Its a judgment call—part of the art of decomposing a collection of code into useful fragments. But it might be better to have a subroutine that
just translates; then you can make it part of a larger subroutine that picks endpoints in the sequence, if needed. The thinking is that youll usually just translate the whole thing and
always typing in
for the start and lengthdna-1
at the end, would be an annoyance. Of course, this depends on what youre doing, so this particular choice just
illustrates your thinking when you write the code. You should also remove the informative
print statement at the end, because
its more suited to a main program than a subroutine. Anyway, youve now thought through the design and just want a subroutine that takes one
argument containing DNA and returns a peptide translation: dna2peptide
A subroutine to translate DNA sequence into a peptide sub dna2peptide {
mydna = _; use strict;
use warnings; use BeginPerlBioinfo; see Chapter 6 about this
module Initialize variables
my protein = ;
IT-SC 188
Translate each three-base codon to an amino acid, and append to a protein
formy i=0; i lengthdna - 2 ; i += 3 { protein .= codon2aa substrdna,i,3 ;
} return protein;
}
Now add subroutine dna2peptide to the BeginPerlBioinfo.pm module. Notice that youve eliminated one of the variables in making the subroutine out of
Example 8-1 : the variable
codon . Why?
Well, one reason is because you can. In Example 8-1
, you were using substr to extract the codon from
dna , saving it in variable
codon and then passing
it into the subroutine codon2aa. This new way eliminates the middleman. Put the call to substr that extracts the codon as the argument to the subroutine codon2aa so that
the value is passed in just as before, but without having to copy it to the variable codon
first. This has somewhat improved efficiency and speed. Since copying strings is one of the
slower things computer programs do, eliminating a bunch of string copies is an easy and effective way to speed up a program.
But has it made the program less readable? You be the judge. I think it has, a little, but the comment right before the loop seems to make everything clear enough, for me,
anyway. Its important to have readable code, so if you really need to boost the speed of a subroutine, but find it makes the code harder to read, be sure to include enough
comments for the reader to be able to understand whats going on.
For the first time use function calls are being included in a subroutine instead of the main program:
use strict; use warnings;
use BeginPerlBioinfo;
This may be redundant with the calls in the main program, but it doesnt do any harm Perl checks and loads a module only once. If this subroutine should be called from a
module that doesnt already load the modules, its done some good after all.
Now lets improve how we deal with DNA in files.
8.5 Reading DNA from Files in FASTA Format