Extracting Primary Sequence Parsing PDB Files

IT-SC 297 lines and collecting the record types would be the simplest programming approach. Example 11-5 contains a subroutine parsePDBrecordtypes that parses the PDB record types from an array containing the lines of the PDB record. This is a short, clean subroutine that accomplishes what is needed. The comments describe whats happening pretty well, which, as you know, is a critical factor in writing good code. Basically, each line is examined for its record type and is then added to the value of a hash entry with the record type as the key. The hash is returned from the subroutine.

11.4.1 Extracting Primary Sequence

Lets examine the subroutine extractSEQRES , now that the record types have been parsed out, and extract the primary amino acid sequence. You need to extract each chain separately and return an array of one or more strings of sequence corresponding to those chains, instead of just one sequence. The previous parse, in Example 11-4 , left the required SEQRES record type, which stretches over several lines, in a scalar string that is the value of the key SEQRES in a hash. Our success with the previous parsePDBrecordtypes subroutine that used iteration over lines as opposed to regular expressions over multiline strings leads to the same approach here. The split Perl function enables you to turn a multiline string into an array. As you iterate through the lines of the SEQRES record type, notice when a new chain is starting, save the previous chain in results , reset the sequence array, and reset the lastchain flag to the new chain. Also, when done with all the lines, make sure to save the last sequence chain in the results array. Also notice and verify by exploring the Perl documentation for the function that split, with the arguments you gave it, does what you want. The third and final subroutine of Example 11-5 is called iub3to1 . Since in PDB the sequence information is in three-character codes, you need this subroutine to change those sequences into one-character codes. It uses a straightforward hash lookup to perform the translation. Weve now decomposed the problem into a few complementary subroutines. Its always interesting as to how to best divide a problem into cooperating subroutines. You can put the call to iub3to1 inside the extractSEQRES subroutine; that might be a cleaner way to package these subroutines together, since, outside the PDB file format, you wont have use for the strings of amino acids in three-character codes. The important observation at this juncture is to point out that a few short subroutines, tied together with a very short main program, were sufficient to do a great deal of parsing of PDB files. IT-SC 298

11.4.2 Finding Atomic Coordinates