IT-SC 271
sequencer produced on those days. After a few years, you could have quite a number of files.
Then, one day you discover a new sequence of DNA that seems to be implicated in cell division. You do a BLAST search see
Chapter 12 but find no significant hits for your
new DNA. At that point you want to know whether youve seen this DNA before in any previous sequencing runs.
[1]
What you need to do is run a comparison subroutine on each of the hundreds or thousands of files in all your various sequencing run subdirectories.
But thats going to take several days of repetitive, boring work sitting at the computer screen.
[1]
You may do a comparison by keeping copies of all your sequencing runs in one large BLAST library; building such a BLAST library can be done using the techniques shown in this section.
You can write a program in much less time than that Then all you have to do is sit back and examine the results of any significant matches your program finds. To write the
program, however, you have to know how to manipulate all the files and folders in Perl. The following sections show you how to do it.
11.2.1 Opening Directories
A filesystem is organized in a tree structure. The metaphor is apt. Starting from anyplace on the tree, you can proceed up the branches and get to any leaves that stem from your
starting place. If you start from the root of the tree, you can reach all the leaves. Similarly, in a filesystem, if you start at a certain directory, you can reach all the files in all the
subdirectories that stem from your starting place, and if you start at the root which, strangely enough, is also called the top of the filesystem, you can reach all the files.
Youve already had plenty of practice opening, reading from, writing to, and closing files. I will show a simple method with which you can open a folder also called a directory
and get the filenames of all the files in that folder. Following that, youll see how to get the names of all files from all directories and subdirectories from a certain starting point.
Lets look at the Perlish way to list all the files in a folder, beginning with some pseudocode:
open folder read contents of folder files and subfolders
print their names
Example 11-1 shows the actual Perl code.
Example 11-1. Listing the contents of a folder or directory
usrbinperl Demonstrating how to open a folder and list its
contents
IT-SC 272
use strict; use warnings;
use BeginPerlBioinfo; see Chapter 6 about this module my files = ;
my folder = pdb; open the folder
unlessopendirFOLDER, folder { print Cannot open folder folder\n;
exit; }
read the contents of the folder i.e. the files and subfolders
files = readdirFOLDER; close the folder
closedirFOLDER; print them out, one per line
print join \n, files, \n; exit;
Since youre running this program on a folder that contains PDB files, this is what youll see:
. ..
3c 44
pdb1a4o.ent
If you want to list the files in the current directory, you can give the directory name the special name . for the current directory, like so:
my folder = .; On Unix or Linux systems, the special files . and .. refer to the current directory and
the parent directory, respectively. These arent really files, at least not files youd want to read; you can avoid listing them with the wonderful and amazing grep function. grep
allows you to select elements from an array based on a test, such as a regular expression. Heres how to filter out the array entries . and ..:
files = grep \.\.?, files; grep selects all lines that dont match the regular expression, due to the negation
operator written as the exclamation mark. The regular expression \.\.?
is looking
IT-SC 273
for a line that begins with the beginning of a line is indicated with the metacharacter a
period \.
escaped with a backslash since a period is a metacharacter followed by 0 or 1 periods
\.? the
? matches 0 or 1 of the preceding items, and nothing more indicated
by the end-of-string metacharacter.
In fact, this is so often used when reading a directory that its usually combined into one step:
files = grep \.\.?, readdirFOLDER; Okay, now all the files are listed. But wait: what if some of these files arent files at all
but are subfolders? You can use the handy file test operators to test each filename and then even open each subfolder and list the files in them. First, some pseudocode:
open folder for each item in the folder
if its a file print its name
else if its a folder open the folder
print the names of the contents of the folder }
}
Example 11-2 shows the program.
Example 11-2. List contents of a folder and its subfolders
usrbinperl Demonstrating how to open a folder and list its
contents --distinguishing between files and subfolders, which
are themselves listed use strict;
use warnings; use BeginPerlBioinfo; see Chapter 6 about this module
my files = ; my folder = pdb;
Open the folder unlessopendirFOLDER, folder {
print Cannot open folder folder\n; exit;
IT-SC 274
} Read the folder, ignoring special entries . and ..
files = grep \.\.?, readdirFOLDER; closedirFOLDER;
If file, print its name If folder, print its name and contents
Notice that we need to prepend the folder name foreach my file files {
If the folder entry is a regular file if -f folderfile {
print folderfile\n; If the folder entry is a subfolder
}elsif -d folderfile { my folder = folderfile;
open the subfolder and list its contents unlessopendirFOLDER, folder {
print Cannot open folder folder\n; exit;
} my files = grep \.\.?, readdirFOLDER;
closedirFOLDER; foreach my file files {
print folderfile\n; }
} }
exit;
Heres the output of Example 11-2
: pdb3cpdb43c9.ent
pdb3cpdb43ca.ent pdb44pdb144d.ent
pdb44pdb144l.ent pdb44pdb244d.ent
pdb44pdb244l.ent pdb44pdb344d.ent
IT-SC 275
pdb44pdb444d.ent pdbpdb1a4o.ent
Notice how variable names such as file
and files
have been reused in this code, using lexical scoping in the inner blocks with
my . If the overall structure of the program
wasnt so short and simple, this could get really hard to read. When the program says file
, does it mean this file
or that file
? This code is an example of how to get into trouble. It works, but its hard to read, despite its brevity.
In fact, theres a deeper problem with Example 11-2
. Its not well designed. By extending
Example 11-1 , it can now list subdirectories. But what if there are further
levels of subdirectories?
11.2.2 Recursion