IT-SC 96
5.7 Writing to Files
Example 5-7 shows one more way to count nucleotides in a string of DNA. It uses a
Perl trick that was designed with exactly this kind of job in mind. It puts a global regular expression search in the test for a while loop, and as youll see, its a compact way of
counting characters in a string.
One of the nice things about Perl is that if you need to do something fairly regularly, the language has probably got a relatively succinct way to do it. The downside of this is that
Perl has a lot of things about it to learn.
The results of Example 5-7
, besides being printed to the screen, will also be written to a file. The code that accomplishes this writing to a file is as follows:
Also write the results to a file called countbase outputfile = countbase;
unless openCOUNTBASE, outputfile { print Cannot open file \outputfile\ to write
to\n\n; exit;
} print COUNTBASE A=a C=c G=g T=t errors=e\n;
closeCOUNTBASE;
As you see, to write to a file, you do an open call, just as when reading from a file, but with a difference: you prepend a greater-than sign to the filename. The filehandle
becomes a first argument to a print
statement but without a comma following it. This makes the
print statement direct its output into the file.
[6] [6]
In this case, if the file already exists, its emptied out first. Its possible to specify several other behaviors. As mentioned earlier, the Perl documentation has all the details of the open
function, which sets the options for reading from, and writing to, files as well as other actions.
Example 5-7 is the third version of the Perl program that examines each base in a
string of DNA.
Example 5-7. Determining frequency of nucleotides, take 3
usrbinperl -w Determining frequency of nucleotides, take 3
Get the DNA sequence data print Please type the filename of the DNA sequence data: ;
IT-SC 97
dna_filename = STDIN; chomp dna_filename;
Does the file exist? unless -e dna_filename {
print File \dna_filename\ doesn\t seem to exist\n;
exit; }
Can we open the file? unless openDNAFILE, dna_filename {
print Cannot open file \dna_filename\\n\n; exit;
} DNA = DNAFILE;
close DNAFILE; DNA = join , DNA;
Remove whitespace DNA =~ s\sg;
Initialize the counts. Notice that we can use scalar variables to hold numbers.
a = 0; c = 0; g = 0; t = 0; e = 0; Use a regular expression trick, and five while loops,
to find the counts of the four bases plus errors whileDNA =~ aig{a++}
whileDNA =~ cig{c++} whileDNA =~ gig{g++}
whileDNA =~ tig{t++} whileDNA =~ [acgt]ig{e++}
print A=a C=c G=g T=t errors=e\n; Also write the results to a file called countbase
outputfile = countbase; unless openCOUNTBASE, outputfile {
IT-SC 98
print Cannot open file \outputfile\ to write to\n\n;
exit; }
print COUNTBASE A=a C=c G=g T=t errors=e\n; closeCOUNTBASE;
exit the program exit;
Example 5-7 looks like this when you run it:
Please type the filename of the DNA sequence data: small.dna
A=40 C=27 G=24 T=17 errors=1
The output file countbase has the following contents after you run Example 5-7
: A=40 C=27 G=24 T=17 errors=1
The while loop: whiledna =~ aig{a++}
has as its conditional test, within the parentheses, a string-matching expression: dna =~ aig
This expression is looking for the regular expression a
, that is, the letter a
. Since it has the
i modifier, its a case-insensitive match, which means it matches
a or
A . It also has
the global modifier, which means match all the a
s in the string. Without the global modifier, it just keeps returning
true every time through the loop, if there is an a in
dna .
Now, this string-matching expression, in the context of a while loop, causes the while loop to execute its block on every match of the regular expression. So, append the one-
statement block:
{a++} to increment the counter at each match of the regular expression; in other words, youre
counting all the a
s. One other point should be made about this third version of the program. Youll notice
some of the statements have been changed and shortened this time around. Some variables have shorter names, some statements are lined up on one line, and the print
statement at the end is more concise. These are just alternative ways of writing. As you program, youll find yourself experimenting with different approaches: try some on for
size.
The way to count bases in this third version is flexible; for instance, it
IT-SC 99
allows you to count non-ACGT characters without specifying them individually. In later chapters, youll use those
while loops to good effect.
However, theres an even faster way to count bases. You can use the tr transliteration function from
Chapter 4 ; its faster, which is helpful if you have a lot of DNA to count:
a = dna =~ trAa; c = dna =~ trCc;
g = dna =~ trGg; t = dna =~ trTt;
The tr function returns the count of the specified characters it finds in the string, and if the set of replacement characters is empty, it doesnt actually change the string. So it
makes a good character counter. Notice that with tr, you have to spell out the upper- and lowercase letters. Also, because tr doesnt accept character classes, theres no direct way
to count nonbases. You could, however, say:
basecount = dna = ~ trACGTacgt; nonbase = length dna - basecount
The program however, runs faster using tr than using the while loops of Example 5-7
. You may find it a bit much to have three really, four versions of this base-counting
program, especially since much of the code in each version is identical. The only part of the program that really changed was the part that did the counting of the bases. Wouldnt
it have been convenient to have a way to just alter the part that counts the bases? In
Chapter 6 , youll see how subroutines allow you to partition your programs in just such
a way.
5.8 Exercises