Finding the Restriction Sites

IT-SC 218 verbose and print helpful_but_verbose_message; Of course, the if statement is more flexible, because it allows you to easily add more statements to the block, and elsif and else conditions to their own blocks. But for simple situations, the and operator works well. [1] [1] You can even chain logical operators one after the other to build up more complicated expressions and use parentheses to group them. Personally, I dont like that style much, but in Perl, theres more than one way to do it The logical operator or evaluates and returns the left argument if its true ; if the left argument doesnt evaluate to true , the or operator then evaluates and returns the right argument. So heres another way to write a one-line statement that youll often see in Perl programs: openMYFILE, file or die I cannot open file file: ; This is basically equivalent to our frequent: unlessopenMYFILE, file { print I cannot open file file\n; exit; } Lets go back and take a look at the parseREBASE subroutine with the line: 1 .. Rich Roberts and next; The left argument is the range 1 .. Rich Roberts . When youre in that range of lines, the range operator returns a true value. Because its true , the and boolean operator goes on to see if the value on the other side is true and finds the next function, which evaluates to true , even as it takes you back to the next iteration of the enclosing foreach loop. So if youre between the first line and the Rich Roberts line, you skip the rest of the loop. Similarly, the line: \s and next; takes you back to the next iteration of the foreach if the left argument, which matches a blank line, is true . The other parts of this parseREBASE subroutine have already been discussed, during the design phase.

9.2.5 Finding the Restriction Sites

So now its time to write a main program and see our code in action. Lets start with a little pseudocode to see what still needs to be done: IT-SC 219 Get DNA get_file_data extract_sequence_from_fasta_data Get the REBASE data into a hash, from file bionet parseREBASEbionet; for each user query If query is defined in the hash Get positions of query in DNA Report on positions, if any } You now need to write a subroutine that finds the positions of the query in the DNA. Remember that trick of putting a global search in a while loop from Example 5-7 and take heart. No sooner said than: Given arguments query and dna while dna =~ queryig { save the position of the match } return positions When you used this trick before, you just counted how many matches there were, not what the positions were. Lets check the documentation for clues, specifically the list of built-in functions in the documentation. It looks like the pos function will solve the problem. It gives the location of the last match of a variable in an mg search. Example 9-3 shows the main program followed by the required subroutine. Its a simple subroutine, given the Perl functions like pos that make it easy. Example 9-3. Make restriction map from user queries usrbinperl Make restriction map from user queries on names of restriction enzymes use strict; use warnings; use BeginPerlBioinfo; see Chapter 6 about this module Declare and initialize variables IT-SC 220 my rebase_hash = ; my file_data = ; my query = ; my dna = ; my recognition_site = ; my regexp = ; my locations = ; Read in the file sample.dna file_data = get_file_datasample.dna; Extract the DNA sequence data from the contents of the file sample.dna dna = extract_sequence_from_fasta_datafile_data; Get the REBASE data into a hash, from file bionet rebase_hash = parseREBASEbionet; Prompt user for restriction enzyme names, create restriction map do { print Search for what restriction site for or quit?: ; query = STDIN; chomp query; Exit if empty query if query =~ \s { exit; } Perform the search in the DNA sequence if exists rebase_hash{query} { recognition_site, regexp = split , rebase_hash{query}; Create the restriction map locations = match_positionsregexp, dna; Report the restriction map to the user if locations { print Searching for query recognition_site regexp\n; IT-SC 221 print A restriction site for query at locations:\n; print join , locations, \n; } else { print A restriction site for query is not in the DNA:\n; } } print \n; } until query =~ quit ; exit; Subroutine Find locations of a match of a regular expression in a string return an array of positions where the regular expression appears in the string sub match_positions { myregexp, sequence = _; use strict; use BeginPerlBioinfo; see Chapter 6 about this module Declare variables my positions = ; Determine positions of regular expression matches while sequence =~ regexpig { IT-SC 222 push positions, possequence - length + 1; } return positions; } Here is some sample output from Example 9-3 : Search for what restriction enzyme or quit?: AceI Searching for AceI GCWGC GC[AT]GC A restriction site for AceI at locations: 54 94 582 660 696 702 840 855 957 Search for what restriction enzyme or quit?: AccII Searching for AccII CGCG CGCG A restriction site for AccII at locations: 181 Search for what restriction enzyme or quit?: AaeI A restriction site for AaeI is not in the DNA: Search for what restriction enzyme or quit?: quit Notice the length in the subroutine match_positions. That is a special variable thats set after a successful regular-expression match. It stands for the sequence that matched the regular expression. Since pos gives the position of the first base following the match, you have to subtract the length of the matching sequences, plus one to make the bases start at position 1 instead of position 0 to report the starting position of the match. Other special variables include ` which contains everything in the string before the successful match; and ´, which contains everything in the string after the successful match. So, for example: 123456 =~ 34 succeeds at setting these special variables like so: `= 12 , = 34 , and ´ = 56 . What we have here is admittedly bare bones, but it does work. See the exercises at the end of the chapter for ways to extend this code.

9.3 Perl Operations