IT-SC 36
beginning tutorial on programming, its not reasonable to go into details about algorithmic methods. Your first task is just to learn how to write in some programming
language. But if you keep at it, youll start to learn the techniques. A decent textbook to keep around as a reference is a good investment for a serious programmer see
Appendix A .
In the current example that counts regulatory elements in DNA, I suggest a way of proceeding. Take each regulatory element in turn, and search through the DNA for it,
before proceeding to the next regulatory element. Other algorithms are also possible; in fact, this is one example from the general problem called string matching , which is one
of the most important for bioinformatics, and the study of which has resulted in a variety of clever algorithms.
Algorithms are usually grouped by such problems or by technique, and there is a wealth of material available. For the practical programmer, some of the most valuable materials
are collections of algorithms written in specific languages, that can be incorporated into your programs. Use
Appendix A as a starting place. Using the collections of code and
books given there, its possible to incorporate many algorithmic techniques in your Perl code with relative ease.
3.5.3 Pseudocode and Code
Now you have an overall design, including input, algorithm, and output. How do you actually turn this general idea into a design for a program?
A common implementation strategy is to begin by writing what is called pseudo-code. Pseudocode is an informal program, in which there are no details, and formal syntax isnt
followed.
[2]
It doesnt actually run as a program; its purpose is to flesh out an idea of the overall design of a program in a quick and informal way.
[2]
Syntax refers to the rules of grammar. English syntax decrees, Go to school not School go to. Programming languages also have syntax rules.
For example, in an actual Perl program you might write a bit of code called a subroutine see
Chapter 6 , in this case, a subroutine that gets an answer from a user typing at the
keyboard. Such a subroutine may look like this: sub getanswer {
print Type in your answer here :; my answer = STDIN;
chomp answer; return answer;
}
But in pseudocode, you might just say: getanswer
and worry about the details later.
IT-SC 37
Heres an example of pseudocode for the program Ive been discussing: get the name of DNAfile from the user
read in the DNA from the DNAfile for each regulatory element
if element is in DNA, then add one to the count
print count
3.5.4 Comments
Comments are parts of Perl source code that are used as an aid to understanding what the program does. Anything from a sign to the end of a line is considered a comment and is
ignored by the Perl interpreter. The exception is the first line of many Perl programs, which looks something like this:
usrbinperl ; see
Section 4.2.3 in
Chapter 4 .
Comments are of considerable importance in keeping code useful. They typically include a discussion of the overall purpose and design of the program, examples of how to use
the program, and detailed notes interspersed throughout the code explaining why that code is there and what it does. In general, a good programmer writes good comments as
an integral part of the program. Youll see comments in all the programming examples in this book.
This is important: your code has to be readable by humans as well as computers. Comments can also be useful when debugging misbehaving programs. If youre having
trouble figuring out where a program is going wrong, you can try to selectively comment out different parts of the code. If you find a section that, when commented out, removes
the problem, you can then narrow down the part youve commented out until you have a fairly short section of code in which you know where the problem is. This is often a
useful debugging approach.
Comments can be used when you turn pseudocode into Perl source code. Pseudocode is not Perl code, so the Perl interpreter will complain about any pseudocode that is not
commented out. You can comment out the pseudocode by placing signs at the
beginning of all pseudocode lines: get the name of DNAfile from the user
read in the DNA from the DNAfile for each regulatory element
if element is in DNA, then add one to the count
IT-SC 38
print count As you expand your pseudocode design into Perl code, you can uncomment the Perl code
by removing the signs. In this way you may have a mixture of Perl and pseudocode, but
you can run and test the Perl parts; the Perl interpreter simply ignores commented-out lines.
You can even leave the complete pseudocode design, commented out, intact in the program. This leaves an outline of the programs design that may come in handy when
you or someone else tries to read or modify the code.
Weve now reached the point where were ready for actual Perl programming. In Chapter 4
you will learn Perl syntax and begin programming in Perl. As you do, remember the initial phase of designing your program, followed by the cycle you will
spend most of your time in: editing the program, running the program, and revising the program.
IT-SC 39
Chapter 4. Sequences and Strings
In this chapter you will begin to write Perl programs that manipulate biological sequence data, that is, DNA and proteins. Once you have the sequences in the computer, youll start
writing programs that do the following with the sequence data:
Transcribe DNA to RNA Concatenate sequences
Make the reverse complement of sequences Read sequence data from files
Youll also write programs that give information about your sequences. How GC-rich is your DNA? How hydrophobic is your protein? Youll see programming techniques you
can use to answer these and similar questions.
The Perl skills you will learn in this chapter involve the basics of the language. Here are some of those basics:
Scalar variables Array variables
String operations such as substitution and translation Reading data from files
4.1 Representing Sequence Data
The majority of this book deals with manipulating symbols that represent the biological sequences of DNA and proteins. The symbols used in bioinformatics to represent these
sequences are the same symbols biologists have been using in the literature for this same purpose.
As stated earlier, DNA is composed of four building blocks: the nucleic acids, also called nucleotides or bases. Proteins are composed of 20 building blocks, the amino acids, also
called residues. Fragments of proteins are called peptides. Both DNA and proteins are essentially polymers, made from their building blocks attached end to end. So its
possible to summarize the structure of a DNA molecule or protein by simply giving the sequence of bases or amino acids.
These are brief definitions; Im assuming you are either already familiar with them or are willing to consult an introductory textbook on molecular biology for more specific details.
Table 4-1 shows bases; add a sugar and you get the nucleotides adenosine, guanosine,