IT-SC 254
E-mail:fujinomicrob.med.keio.ac.jp, Tel:+81-3-3353-1211ex.62692, Fax:+81-3-5360-
1508 ACCESSION
ACCESSION AB031069 LOCUS
LOCUS AB031069 2487 bp mRNA PRI 27-MAY-2000
ORIGIN ORIGIN
BASE BASE COUNT 564 a 715 c 768 g 440 t
As you see, the method is working, and apart from the difficulty of reading the regular expressions which will become easier with practice, the code is very straightforward,
just a few short subroutines.
10.4.5 Parsing the FEATURES Table
Lets take this one step further and parse the features table to its next level, composed of the
source ,
gene , and
CDS features keys. See later in this section for a more
complete list of these features keys. In the exercises at the end of the chapter, youll be challenged to descend further into the FEATURES table.
To study the FEATURES table, you should first look over the NCBI gbrel.txt document mentioned previously. Then you should study the most complete documentation for the
FEATURES table, available at
http:www.ncbi.nlm.nih.govcollabFTindex.html .
10.4.5.1 Features
Although our GenBank entry is fairly simple and includes only three features, there are actually quite a few of them. Notice that the parsing code will find all of them, because
its just looking at the structure of the document, not for specific features.
The following is a list of the features defined for GenBank records. Although lengthy, I think its important to read through it to get an idea of the range of information that may
be present in a GenBank record.
allele Obsolete; see
variation feature key
attenuator Sequence related to transcription termination
C_region Span of the C immunological feature
IT-SC 255
CAAT_signal CAAT box in eukaryotic promoters
CDS Sequence coding for amino acids in protein includes stop codon
conflict Independent sequence determinations differ
D-loop Displacement loop
D_segment Span of the D immunological feature
enhancer Cis-acting enhancer of promoter function
exon Region that codes for part of spliced mRNA
gene Region that defines a functional gene, possibly including upstream promoter,
enhancer, etc. and downstream control elements, and for which a name has been assigned
GC_signal GC box in eukaryotic promoters
iDNA Intervening DNA eliminated by recombination
intron Transcribed region excised by mRNA splicing
J_region Span of the J immunological feature
LTR Long terminal repeat
mat_peptide Mature peptide coding region doesnt include stop codon
misc_binding Miscellaneous binding site
misc_difference
IT-SC 256
Miscellaneous difference feature misc_feature
Region of biological significance that cant be described by any other feature misc_recomb
Miscellaneous recombination feature misc_RNA
Miscellaneous transcript feature not defined by other RNA keys misc_signal
Miscellaneous signal misc_structure
Miscellaneous DNA or RNA structure modified_base
The indicated base is a modified nucleotide mRNA
Messenger RNA mutation
Obsolete: see variation
feature key N_region
Span of the N immunological feature old_sequence
Presented sequence revises a previous version polyA_signal
Signal for cleavage and polyadenylation polyA_site
Site at which polyadenine is added to mRNA precursor_RNA
Any RNA species that isnt yet the mature RNA product prim_transcript
Primary unprocessed transcript primer
Primer binding region used with PCR primer_bind
IT-SC 257
Noncovalent primer binding site promoter
A region involved in transcription initiation protein_bind
Noncovalent protein binding site on DNA or RNA RBS
Ribosome binding site rep_origin
Replication origin for duplex DNA repeat_region
Sequence containing repeated subsequences repeat_unit
One repeated unit of a repeat_region rRNA
Ribosomal RNA S_region
Span of the S immunological feature satellite
Satellite repeated sequence scRNA
Small cytoplasmic RNA sig_peptide
Signal peptide coding region snRNA
Small nuclear RNA source
Biological source of the sequence data represented by a GenBank record; mandatory feature, one or more per record; for organisms that have been
incorporated within the NCBI taxonomy database, an associated
db_xref=taxon:NNNN qualifier will be present where
NNNNN is the
numeric identifier assigned to the organism within the NCBI taxonomy database stem_loop
Hairpin loop structure in DNA or RNA
IT-SC 258
STS Sequence Tagged Site: operationally unique sequence that identifies the
combination of primer spans used in a PCR assay TATA_signal
TATA box in eukaryotic promoters terminator
Sequence causing transcription termination transit_peptide
Transit peptide coding region transposon
Transposable element TN tRNA
Transfer RNA unsure
Authors are unsure about the sequence in this region V_region
Span of the V immunological feature variation
A related population contains stable mutation -
Placeholder hyphen -10_signal
Pribnow box in prokaryotic promoters -35_signal
-35 box in prokaryotic promoters 3clip
3-most region of a precursor transcript removed in processing 3UTR
3 untranslated region trailer 5clip
5-most region of a precursor transcript removed in processing 5UTR
IT-SC 259
5 untranslated region leader These feature keys can have their own additional features, which youll see here and in
the exercises.
10.4.5.2 Parsing