Features Parsing the FEATURES Table

IT-SC 254 E-mail:fujinomicrob.med.keio.ac.jp, Tel:+81-3-3353-1211ex.62692, Fax:+81-3-5360- 1508 ACCESSION ACCESSION AB031069 LOCUS LOCUS AB031069 2487 bp mRNA PRI 27-MAY-2000 ORIGIN ORIGIN BASE BASE COUNT 564 a 715 c 768 g 440 t As you see, the method is working, and apart from the difficulty of reading the regular expressions which will become easier with practice, the code is very straightforward, just a few short subroutines.

10.4.5 Parsing the FEATURES Table

Lets take this one step further and parse the features table to its next level, composed of the source , gene , and CDS features keys. See later in this section for a more complete list of these features keys. In the exercises at the end of the chapter, youll be challenged to descend further into the FEATURES table. To study the FEATURES table, you should first look over the NCBI gbrel.txt document mentioned previously. Then you should study the most complete documentation for the FEATURES table, available at http:www.ncbi.nlm.nih.govcollabFTindex.html .

10.4.5.1 Features

Although our GenBank entry is fairly simple and includes only three features, there are actually quite a few of them. Notice that the parsing code will find all of them, because its just looking at the structure of the document, not for specific features. The following is a list of the features defined for GenBank records. Although lengthy, I think its important to read through it to get an idea of the range of information that may be present in a GenBank record. allele Obsolete; see variation feature key attenuator Sequence related to transcription termination C_region Span of the C immunological feature IT-SC 255 CAAT_signal CAAT box in eukaryotic promoters CDS Sequence coding for amino acids in protein includes stop codon conflict Independent sequence determinations differ D-loop Displacement loop D_segment Span of the D immunological feature enhancer Cis-acting enhancer of promoter function exon Region that codes for part of spliced mRNA gene Region that defines a functional gene, possibly including upstream promoter, enhancer, etc. and downstream control elements, and for which a name has been assigned GC_signal GC box in eukaryotic promoters iDNA Intervening DNA eliminated by recombination intron Transcribed region excised by mRNA splicing J_region Span of the J immunological feature LTR Long terminal repeat mat_peptide Mature peptide coding region doesnt include stop codon misc_binding Miscellaneous binding site misc_difference IT-SC 256 Miscellaneous difference feature misc_feature Region of biological significance that cant be described by any other feature misc_recomb Miscellaneous recombination feature misc_RNA Miscellaneous transcript feature not defined by other RNA keys misc_signal Miscellaneous signal misc_structure Miscellaneous DNA or RNA structure modified_base The indicated base is a modified nucleotide mRNA Messenger RNA mutation Obsolete: see variation feature key N_region Span of the N immunological feature old_sequence Presented sequence revises a previous version polyA_signal Signal for cleavage and polyadenylation polyA_site Site at which polyadenine is added to mRNA precursor_RNA Any RNA species that isnt yet the mature RNA product prim_transcript Primary unprocessed transcript primer Primer binding region used with PCR primer_bind IT-SC 257 Noncovalent primer binding site promoter A region involved in transcription initiation protein_bind Noncovalent protein binding site on DNA or RNA RBS Ribosome binding site rep_origin Replication origin for duplex DNA repeat_region Sequence containing repeated subsequences repeat_unit One repeated unit of a repeat_region rRNA Ribosomal RNA S_region Span of the S immunological feature satellite Satellite repeated sequence scRNA Small cytoplasmic RNA sig_peptide Signal peptide coding region snRNA Small nuclear RNA source Biological source of the sequence data represented by a GenBank record; mandatory feature, one or more per record; for organisms that have been incorporated within the NCBI taxonomy database, an associated db_xref=taxon:NNNN qualifier will be present where NNNNN is the numeric identifier assigned to the organism within the NCBI taxonomy database stem_loop Hairpin loop structure in DNA or RNA IT-SC 258 STS Sequence Tagged Site: operationally unique sequence that identifies the combination of primer spans used in a PCR assay TATA_signal TATA box in eukaryotic promoters terminator Sequence causing transcription termination transit_peptide Transit peptide coding region transposon Transposable element TN tRNA Transfer RNA unsure Authors are unsure about the sequence in this region V_region Span of the V immunological feature variation A related population contains stable mutation - Placeholder hyphen -10_signal Pribnow box in prokaryotic promoters -35_signal -35 box in prokaryotic promoters 3clip 3-most region of a precursor transcript removed in processing 3UTR 3 untranslated region trailer 5clip 5-most region of a precursor transcript removed in processing 5UTR IT-SC 259 5 untranslated region leader These feature keys can have their own additional features, which youll see here and in the exercises.

10.4.5.2 Parsing