Directory UMM :Data Elmu:jurnal:P:PlantScience:Plant Science_BioMedNet:181-200:

90

Genome annotation: which tools do we have for it?
Pierre Rouzé*, Nathalie Pavy† and Stephane Rombauts‡
Genome data have to be converted into knowledge to be
useful to biologists. Many valuable computational tools have
already been developed to help annotation of plant genome
sequences, and these may be improved further, for example by
identification of more gene regulatory elements. The lack of a
standard computer-assisted annotation platform for eukaryotic
genomes remains a major bottle-neck.
Addresses
*†Laboratoire Associé de l’INRA (France), *†‡ Department of Genetics,
Flanders Interuniversity Institute of Biotechnology, University of Ghent,
KL Ledeganckstraat 35, B-9000 Gent, Belgium
*e-mail: Pierre.Rouze@gengenp.rug.ac.be
†e-mail: Nathalie.Pavy@gengenp.rug.ac.be
‡e-mail: Stephane.Rombauts@gengenp.rug.ac.be
Current Opinion in Plant Biology 1999, 2:90–95
http://biomednet.com/elecref/1369526600200090
© Elsevier Science Ltd ISSN 1369-5266

Abbreviations
AGI
Arabidopsis genome initiative
EST
expressed sequence tag

For a genomic DNA sequence, annotation is the process by
which an anonymous sequence is documented, by positioning along the sequence the various sites and segments
that are involved in the genome functionality. Many elements of possible biological relevance that are not, or not
yet, linked to a particular gene may well be annotated, such
as matrix attachment regions (MARs) [5] and sequence
repeats. Nevertheless, the basic objects in genome annotation are obviously the genes and the related elements that
are involved in their structure, their expression, and their
function or the function of their encoded products.
Here, we will particularly consider recent annotation
tools that have been developed for plant nuclear
genomes, and will also present tools and results obtained
from other genomes, as much more information is available from bacterial, yeast and mammalian (especially
human) genome projects. Besides, plant nuclear genomes
are not the only ones. The genomes from the many plantinteracting organisms for which genomics projects are

ongoing [6], as well as organellar genomes, are useful
sources of information for plant biologists.

Introduction

Structural and functional annotation

High throughput genome sequencing projects are producing an enormous amount of raw sequence data. For plants,
the complete sequence of the Arabidopsis thaliana nuclear
genome is expected for the year 2000; about one-third of
the expected 120 Mb whole sequence is already available
[1,2••] and sequencing the rice genome has begun (see
S Goff this issue pp 86–89). Unannotated sequences are
not useful as such, except to find a given DNA sequence,
because rich database queries cannot be done and the relevance of homology searches cannot be evaluated. What
biologists are looking for is information derived from the
genome sequence: markers, genes, mRNA or deduced
protein sequence, not the sequence itself. To be meaningful to them, the genome sequences have to be converted
into biologically significant knowledge; annotation is the
first step toward knowledge acquisition. Contrary to a

recent opinion [3], we believe that this step should better
be done as soon as possible. Annotation is certainly a risky
process and at present its result is far from optimal, even
for well-documented bacterial genomes [4•]. But it is a
trade-off, with the benefit for the plant research community as the ability to mine the sequence data more efficiently,
and hence to plan classic and genome-wide experiments
which would gain a wider biological knowledge more
quickly, and would include a validation of the annotation
itself. As pointed out by Meinke et al. [2••], a long list of
fundamental and technological research advances have
already resulted from the Arabidopsis genome initiative
(AGI); for a number of these advances the availability of
annotated sequence played a important role.

Although gene annotation has been in existence as long as
the sequence databases themselves, and is familiar to any
biologists who has cloned and sequenced a particular gene,
annotating genomic sequences is in fact a new kind of task.
Many tools used for genomic sequences of individual
genes are inadequate, firstly because of the size of the

genome contig sequences. Sequence size in itself is often
a problem for pre-genome software, together with the
inability to deal with sequences containing several genes
on both strands. Secondly, and more importantly, the
sequence of a single gene often comes as a result of a carefully designed experimental approach which supplies
biological information, contrary to systematically obtained
genome sequences [3]. Functional genomics analysis in
plants [7•] should partially fill this gap by providing
genome-wide experimental data on function and expression of genes, and will undoubtedly contribute to a much
improved functional annotation in the near future.
Annotation is a two step process. The first step can be
described as structural annotation and consists of finding
the location of the biologically relevant sites and strings,
ending up with a coherent model for the whole sequence
where each object is properly defined and each object
component has a unique location. As the major objects are
genes, this step is often likened to gene finding, although
it is a true annotation step. After structural annotation has
been carried out, there is a generic description of the whole
sequence, which can now be used in a more powerful way

for database searches (e.g. for deduced protein sequences)

Genome annotation: which tools do we have for it? Rouzé, Pavy and Rombauts

and for experimental purposes (e.g. to design primers for
exons). The second step is an information processing step,
and is best described as functional annotation. It consists
of attributing a range of specific biologically relevant information (e.g. species, source, gene, gene product or domain
name and function) to the sequence as a whole, to each
compound object and to each individual gene component.
This information may reside inside the database proper,
for example in specified fields and in feature tables as in
the GenBank/EMBL/DDBJ public DNA databases or
may be obtained through links to other databases [8,9•].
SWISS-PROT, which now offers links to 29 different databases [10•], is the best example of functional annotation
and this expert curated protein database represents a turning point in interconnected databases.
Structural annotation: finding and locating the genes

Computational approaches to gene finding are a very active
field in bioinformatics with significant advances in the past

few years. Several reviews aimed at a variety of readerships
have been written recently on this issue
[11,12•,13•,14,15••,16••] and a useful bibliography on computational gene recognition is maintained on the Internet
by Li [17]. Fundamentally, there are two ways to find a
gene. The more obvious way is to search in sequence databases for a homologue. Besides giving the exon/intron
structure, the homology approach also gives some indication of gene function. This is done using the familiar
BLAST or FASTA suite of programs; BLASTN searching
in DNA databases is useful for the really close homologues
which include expressed sequence tags (ESTs) whereas
BLASTX searching in protein databases is much more sensitive. New flavours of BLAST have been introduced:
allowing gapping (BLAST2) and higher seach sensitivity
through an iterative process (PSI-BLAST) [18], taking into
account pattern information [19], or providing interactive
and specially tuned searching for large contig analysis
(PowerBLAST) [20]. These approaches, nevertheless,
depend on the existence of homologues in the database,
and on their correct annotation [3]. In practice, for the
Arabidopsis genome, a close homologue can be found for
about one third of the genes, and either distant or partial
similarities for an additional third, at best. Modeling gene

structure from homology alone is valid only for the first
third. Specific software such as PROCRUSTES [21•,22],
EST_GENOME [23], sim4 [24•] and EbEST [25] has
been written to find structure genes through comparison
with ESTs or distant cDNAs. AAT is an analysis and annotation tool with programs for comparing the query sequence
with a protein database [26], and is the only one routinely
used for Arabidopsis. Bork and Koonin [27] have given a
perceptive analysis of the issue of predicted protein
sequences and pointed to the potential problems and bottlenecks of this exercise these being the lack of an accepted
gene prediction system integrating a robust and updated
suite of sequence analysis methods first, and second, the
poor or misleading information in databases which easily
ends up in bad annotations which then propagate.

91

To search for genes with no homologue, and to confirm and
extend the ones with poor homologues, programs have
been developed to find genes only from the knowledge of
the sequence. They rely on characteristic gene properties

that leave clues in their sequence [11] which are mostly
linked to their ability to be expressed, for example regularity and nucleotide bias in the coding sequence, motifs
and composition bias for transcription and translation.
Many algorithms have been written to find either gene elements (e.g. start or splice sites), compound objects
(e.g. exons) or to model genes as a whole. Obviously the
task differs from organism to organism. For prokaryotes,
where the coding information is dense and uninterrupted,
it is easier than for higher eukaryotes; two gene prediction
programs are the current standards: GeneMark [28] and
Glimmer [29]. For eukaryotes, software development is
largely driven towards the human genome. This software,
as well as software devised for yeast or C. elegans cannot be
used as such for the plant genomes. Even if the basic molecular gene replication and expression mechanisms are
conserved, there are significant taxa peculiarities
(e.g. trans-splicing in worms) and the genome style variation is a reality [30•]. Software have to be specifically
developed for each genome, therefore, or adapted, at best
retrained, from one genome to the other. Software developed for Arabidopsis, a dicot, is often not adapted to rice, a
monocot. Likewise, the validation measures made for a
given software in a given species, for example humans
[31], do not necessarily apply in another species for which

this software may have been adapted, such as Arabidopsis.
Several annotation programs have recently been developed
or adapted for gene prediction in Arabidopsis, most of them
used by the various AGI consortiums (see Tables 1 and 2)
Splice Predictor [32], NetPlantGene [33] and NetGene2
[34•] for splice site prediction, NetStart [35] for translational
start, GeneMark [28], the ACeDB GeneFinder (P Green,
personal communication), GRAIL [36], MZEF [37•],
Solovyev’s gene-finder suite (FGENEA, FEXA and ASPL)
[38] for exon prediction, GENSCAN [39•] and
GeneMark.hmm [40] for gene modelling on both strands. No
evaluation on the respective performance of this software is
yet available, but we are in the process of assessing it.
Suprisingly, GeneMark.hmm, a software not yet utilized for
annotation, appears to be the most efficient gene predictor
for Arabidopsis (N Pavy, S Rombauts, VV Ramana Daluvuri,
C Mathe, P Dehais et al., unpublished data). GeneGenerator
[41•] is an integrative gene prediction software developed for
maize; a version adapted for Arabidopsis is in preparation
(J Kleffe, personal communication). Even if the performance

of this software in Arabidopsis is reasonably high for sites or
individual exons, however, it has been observed during careful manual annotation of a 400 Kb contig (N Terryn et al.,
personal communication) that gene modeling remains poor
whatever the program — many genes are wrongly fused or
split, border exons are often missing or added wrongly, overlapping genes are not rare, although this has never been
experimentally observed. Similar observations were made for

92

Genome studies and molecular genetics

Table 1
The Arabidopsis Genome Initiative consortia and their annotation tools.
(a) Web sites of the consortia involved in Arabidopsis sequencing and annotation groups.
Consortium
TIGR
CSHL
KAOS
SPP


ESSA
Genoscope

The Institute for Genomic Research
Cold Spring Harbor Laboratory
Kazusa Institute
Stanford
University of Pennsylvania
Plant Genome Expression Center
Munich Information center for Protein Sequences (MIPS)
Centre National de séquençage (France), annotation: MIPS

http://www.tigr.org/tdb/at/atgenome.html
http://nucleus.cshl.org/protarab/
http://zearth.kazusa.or.jp/arabi/
http://sequencewww.stanford.edu/SPP.html

http://www.mips.biochem.mpg.de/proj/thal/
http://www.genoscope.cns.fr/

(b) Prediction programs used by the different annotation groups.
Annotators

TIGR
CHSL
SPP
MIPS
KAOS

NetPlant
Gene
x
x
x
x
x

MZEF

Gene-Finder

x

x
x

GeneFinder GENSCAN
x
x
x

x

x
x
x
x
x

GRAIL
x
x
x
x
x

GeneMark

AAT

TRNA
SCAN-SE

x

x
x
x
x*
x†

x
x

*For tRNA search MIPS uses tRNA scanner [61]. † tRNA search is done at the Kazusa Institute by searching for similarity in a RNA database.

the human [13•] and C. elegans [42•] genomes. In order to
improve the prediction of gene borders and to gain further
biologically relevant information, the prediction of 5′ and 3′
untranslated regions [43] and especially of core promoters
have been attempted [44•,45••] with limited success so far
[15••]. Computer prediction of tRNA genes is efficient [46].
Other biological objects or remarkable features are often
reported, such as transposable elements and repeats in
Arabidopsis for which two useful databases have been developed [47]. Software for MAR prediction has recently been
developed, on the basis of statistics of motifs often found
associated with MARs vertebrates [5]; its value for plants
needs further checking.
Functional annotation — the link to biological knowledge

Functional annotation of gene database entries relies on the
expertise of scientists. For genome annotation, especially for
the first pass, which has to be performed shortly after the
release of a contig sequence, the annotator does not have the
time to act as an expert. As there is no a priori knowledge of
the sequence, every attribute of information has to be
assessed by analogy, using a variety of computer tools. Many
aspects of this process are well analysed in a recent exhaustive review [48••]. One should notice first that, up to now, it
is essentially the molecular function of the gene products
that are annotated this way, and only more rarely the biological function of the gene. The information on protein
function can be derived from full length sequence comparisons, or better through multiple alignments, pattern and
domain searches and predictions of location and structure for
which a large spectrum of computer tools — too numerous to
be cited here — are now available [16••], many of them
through the Internet. Predicting the molecular function of

the putative proteins encoded by the genes has many documented pitfalls [4•,27,48••], which include the risk of
propagating a wrong annotation [3]. The only way to prevent
this would be for experts to check manually each entry [4•].
Clearly, the quality of the previous structural annotation step
will affect the quality of functional annotation. This is especially true for the many cases where the structural annotation
comes from the interaction between intrinsic gene prediction
and extrinsic database searches. This often ends up with correct labels given to genes listed with the wrong structure, or
to wrong labels being given to genes with the correct structure, or both, as we observed ourselves for Arabidopsis.
As an alternative, or a help to the time-consuming manual
human expertise, active research towards automatic or computer-assisted procedures to fetch knowledge in literature
is ongoing [49–51], but is only practical at present if the literature source is a normalised text such as database entries.
For proper annotation, nomenclature is a major concern
which has been underevaluated. Fortunately for plants, the
International Society for Plant Molecular Biology has taken
an early initiative to work towards a kingdom-wide gene
nomenclature available in the MENDEL database [52•].

Generating genome-wide annotations
The conversion of raw genome data into knowledge, here
specifically sequence annotation, involves a series of interdependent tasks. Manual annotation is no longer a valid
solution and several systems have been designed to cope
with this flow of incoming genome sequences. Automatic
annotation such as that done by GeneQuiz [53] has proven
to be error prone in prokaryotes [4•]. Genotator [54] and
GAIA [55] are interactive graphical environments, allowing

Genome annotation: which tools do we have for it? Rouzé, Pavy and Rombauts

93

Table 2
Web sites for annotation programs.
Program

URL

PROCRUSTES

http://www-hto.usc.edu/software/procrustes/index.html

EbEST

http://ares.ifrc.mcw.edu/EBEST/ebest.html

AAT (analysis and annotation tool)

http://genome.cs.mtu.edu/aat.html

ALIGN (pairwise sequence alignment)

http://genome.cs.mtu.edu/align/align.html

MAP (multiple sequence alignment)

http://genome.cs.mtu.edu/map/map.html

GLIMMER

http://www.cs.jhu.edu/labs/compbio/glimmer.html

GRAIL

http://compbio.ornl.gov/Grail-1.3/intro.html

SPLICEPREDICTOR

http://gremlin1.zoo1.iastate.edu/~volker/SplicePredictor.html

NETPLANTGENE

http://www.cbs.dtu.dk/services/NetPGene/

NETGENE2

http://www.cbs.dtu.dk/services/NetGene2/

NETSTART

http://www.cbs.dtu.dk/services/NetStart/

MZEF

http://siclio.cshl.org/genefinder/

FGENEA, FEXA and ASPL

http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html

GENSCAN

http://genomic.stanford.edu/GENSCANW.html

GeneMark.hmm

http://genemark.biology.gatech.edu/GeneMark.hmmchoice.html

GeneMark

http://genemark.biology.gatech.edu/GeneMark/webgenemark.html

tRNAscan-SE

http://www.genetics.wustl.edu/eddy/ tRNAscan-se/

AtRepBase

http://nucleus.chsl.org/protarab/AtRepbase.htm

human expertise, that have recently been designed to face
the needs of eukaryote genome annotations. In a further
category, Imagene [56•] which has recently been used for
B. subtilis annotation, has the additional capacity to chain
the various component tasks needed for annotation, in scenarios piloted by their individual results. For Arabidopsis a
simpler management tool called ANNOTATOR is available [26] and is used by several AGI consortiums. As stated
by Bork and Koonin [27], ‘there is’ obviously ‘a lack of a
widely accepted, robust and continuously updated suite of
sequence analysis methods integrated into a coherent and
efficient prediction system’.
Proper annotation of genome data is crucial for genomics
[48••]. The huge potential of genome-wide approaches
resides indeed in the capacity to integrate the various
kinds of data into a higher level of biological knowledge for
one species and to allow comparison between genomes.
This implies that some work on semantics and ontology is
necessary to allow a full interconnectivity between databases [9•]. Gelbart [57] gives a striking example of such a
need when debating the definition of a gene, which has a
conceptual definition for geneticists but is ascribed to a
sequence entity in databases. This debate has practical
implications in the way the data should be represented in
the different databases: insertion mutagenesis, expression,
proteome and sequence databases that will all be components of an Arabidopsis knowledge base. Special care

should be paid to this issue in plants where relatively few
cases of alternative expression of genes are experimentally
documented compared to vertebrates, but where the real
extent of their occurrence is unknown.

Open problems and future trends
Although many advanced annotation tools exist, computational annotation of genome sequences is still in its
infancy. Gene finding software must still be significantly
improved, especially for plant genomes which are not the
main spot for developments. As an example, exon prediction programs have been improved taking into account
the clustering of Arabidopsis coding sequences into two
classes [58•]. The existing tools are intended to find the
mainstream highly expressed protein-encoding genes and
have neglected many other genes of biological significance, such as low expressed genes, RNA-encoding
genes (besides tRNAs), rare introns [59,60•], or objects
with less available information, such as promoters, transcriptional and translational control regions or MARs.
There is ample room for the improvement of gene modeling and gene function prediction using improved
identification and classification of promoters and transcriptional control regions. Annotation will evolve with
time to cope with and feed new experiments; it is of
major importance to track the way annotation has developed, and to clearly identify what has an experimental
basis and what is derived information.

94

Genome studies and molecular genetics

The status of Arabidopsis genome annotation will change dramatically with the availability of expression and insertion
data [7•], which will secure the structural and functional
annotation of a large fraction of the genes, and serve to
improve the computational methods for the rest awaiting
experimental support. Similarly, sequencing the rice genome
will also change deeply our understanding and annotation of
the genes which are common to both plants. The challenge is
in our capacity to integrate these many data, as well as to use
the dispersed expertise of the whole research community.

Note added in proof
The work referred to in the text as N Terryn et al. has now
been accepted for publication [62].

Acknowledgements
We gratefully acknowledge Roderic Guigo, Juergen Kleffe, Klaus Mayer
and Larry Parnell for communicating unpublished information, Patrice
Déhais and Luc Van Wiemeersch for continuous expert support with
hardware and software and Martine de Cock for editorial assistance. Pierre
Rouzé is a research director and Natalie Pavy a contract scientist of the
Institut National de Recherche Agronomique (France).

References and recommended reading
Papers of particular interest, published within the annual period of review,
have been highlighted as:

• of special interest
•• of outstanding interest
1.

Bevan M, Bancroft I, Bent E, Love K, Goodman H, Dean C, Bergkamp
R, Dirkse W, Van Staveren M, Stiekema W et al.: Analysis of 1.9 Mb
of contiguous sequence from chromosome 4 of Arabidopsis
thaliana. Nature 1998, 391:485-488.

2.
••

Meinke DZ, Cherry JM, Dean C, Rounsley SD, Koornneef M: Arabidopsis
thaliana: a model plant for genome analysis. Science 1998,
282:662-682.
This is a very useful review and progress report on the present status of
genomics research using Arabidopsis thaliana as a model plant.
3.

Wheelan SJ, Boguski MS: Late-night thoughts on the sequence
annotation problem. Genome Res 1998, 8:168-169.

Galperin MY, Koonin EV: Sources of systematic error in functional
annotation of genomes: domain rearrangement, non-orthologous
gene displacement, and operon disruption. In Silico Biol 1998,
1:0007.
By looking at bacterial genome annotations obtained through similarity
search, the authors identify several causes of inadequate annotation which
lead to ‘database explosion’: non-critical use of databases and functional
inference, low complexity regions, ignoring multi-domain organisation
(http://www.bioinfo.de/isb/1998/01/0007/). A useful paper, together with
[27], for those who would like to avoid similar mistakes.
4.


5.

6.

Singh GB, Kramer JA, Krawetz SA: Mathematical model to predict
regions of chromatin attachment to the nuclear matrix. Nucleic
Acids Res 1997, 25:1419-1425.
Preston GM, Haubold B, Rainey PB: Bacterial genomics and
adaptation to life on plants: implications for the evolution of
pathogenicity and symbiosis. Curr Opin Microbiol 1998,
1:589-597.

7.
Bouchez D, Höfte H: Functional genomics in plants. Plant Physiol

1998, 118:725-732.
The first comprehensive review giving a broad spectrum of the ‘toolbox for
the global study of gene function in plants’ and critically presenting the different methods presently available.
8.

Baker PG, Brass A: Recent developments in biological sequence
databases. Curr Opin Biotech 1998, 9:54-58

Frishman D, Heumann K, Lesk A, Mewes HW: Comprehensive,
comprehensible, distributed and intelligent databases: current
status. Bioinformatics 1998, 14:551-561.
A review focusing on the challenges, and the conceptual and technical
aspects of biological databases.
9.


10. Bairoch A, Apweiler R: The Swiss-Prot protein sequence data bank

and its supplement TrEMBL in 1999. Nucleic Acids Res 1999, 27:49-54.
A short paper presenting the reality of an expert-curated database and its
computer-generated complement, as a relational node between various
databases. The entire issue of the journal is worth consulting to discover the
wide range of existing biological databases.
11. Fickett JW: The gene identification problem: an overview for
developers. Comput Chem 1996, 20:103-118.
12. JM Claverie: Computational methods for the identification of genes in

vertebrate genomic sequences. Human Mol Genet 1997, 6:1735-1744.
A critical and comparative review on the software available for gene prediction, with a historical perspective over the last 15 years and an analytical presentation of the methods behind the software. The author concludes by
pointing to conservatism as the common limitation of current methods.
13. Guigo R: Computational gene identification: an open problem.

Comput Chem 1997, 21:215-222.
The author presents the current limitations of gene finding programs and
their low efficiency for gene modeling.
14. Krogh A: Gene finding: putting the parts together. In Guide to
Human Genome Computing, edn 2. Edited by Martin Bishop. Oxford:
Academic Press; 1998:261-274.
15. Burge CB, Karlin S: Finding the genes in genomic DNA. Curr Opin
•• Struct Biol 1998, 8:346-354.
This review presents the rationale of gene finding, and is useful to understand
the methodologies (especially GENSCAN) and analyse the current limitations.
16. Trends guide to Bioinformatics Trends Supplement
•• 1998.
This very useful supplement to the Trends series of journals contains a few
articles that are either fundamental or pragmatic and easy to read for their
target audience of biologists. They include features on the most important
every-day issues in using bioinformatics and deal with computational issues
arising from genome annotation.
17.

Computational gene recognition on World Wide Web URL:
http://linkage.rockefeller.edu/wli/gene/

18. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman
DJ: Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Res 1997, 25:3389-3402.
19. Zhang Z, Schäffer AA, Miller W, Madden TL, Lipman DJ, Koonin EV,
Altschul SF: Protein sequence similarity searches using patterns
as seeds. Nucleic Acids Res 1998, 26:3986-3990.
20. Zhang J, Madden TL: PowerBLAST: a new network BLAST
application for interactive or automated sequence analysis and
annotation. Genome Res 1997, 7:649-656.
21. Sze SH, Pevzner PA: Las Vegas algorithms for gene recognition:

suboptimal and error-tolerant spliced alignment. J Comput Biol
1997, 4:297-309.
This is a paper presenting a more efficient algorithm for the spliced alignment method used in PROCRUSTES [22], the principle of which is to
search using the exon-intron boundaries that maximize the score of alignment of a cDNA with the genomic sequence.
22. Mironov AA, Roytberg MA, Pevzner PA, Gelfand MS: Performance
guarantee gene predictions via spliced alignment. Genomics
1998, 51:332-339.
23. Mott R: EST_GENOME: a program to align spliced DNA sequences
to unspliced genomic DNA. Comput Appl Biosci 1997, 13:477-478.
24. Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A computer

program for aligning a cDNA sequence with a genomic DNA
sequence. Genome Res 1998, 8:967-974.
This paper presents a new program which can be used to find splice sites by
homology searching, and compares its performance to previous similar tools.
25. Jiang J, Jacob HJ: EbEST: an automated tool using expressed
sequence tags to delineate gene structure. Genome Res 1998,
8:268-275.
26. Huang X, Adams MD, Zhou H, Kerlavage AR: A tool for analyzing
and annotating genomic sequences. Genomics 1997, 46:37-45.
27.

Bork P, Koonin EV: Predicting functions from protein sequences:
where are the bottlenecks? Nat Genet 1998, 18:313-318.

28. Borodovsky M, McIninch J: GENMARK: parallel gene recognition
for both DNA strands. Comput Chem 1993, 17:123-133.
29. Salzberg SL, Delcher AL, Kasif S, White O: Microbial gene
identification using interpolated Markov models. Nucleic Acids
Res 1998, 26:544-548.

Genome annotation: which tools do we have for it? Rouzé, Pavy and Rombauts

30. Karlin S: Global dinucleotide signatures and analysis of genomic

heterogeneity. Curr Opin Microbiol 1998, 1:598-610
This paper reviews the concept of ‘genomic signature’ introduced by the
author to refer to the remarkable stability of dinucleotide abundance in an
organism and its conservation in closely related taxa, allowing discrimination
between organisms and identification of horizontal transfer.
31. Burset M, Guigo R: Evaluation of gene structure prediction
programs. Genomics 1996, 34:353-367.
32. Brendel V, Kleffe J: Prediction of locally optimal splice sites in plant
pre-mRNA with applications to gene identification in Arabidopsis
thaliana genomic DNA. Nucleic Acids Res 1998, 26:4748-4757.
33. Hebsgaard SM, Korning PG, Tolstup N, Engelbrecht J, Rouzé P,
Brunak S: Splice site prediction in Arabidopsis thaliana pre-mRNA
by combining local and global sequence information. Nucleic
Acids Res 1996, 24:3439-3452.
34. Tolstrup N, Rouzé P, Brunak S: A branch point consensus from

Arabidopsis found by non-circular analysis allows for better
prediction of acceptor sites. Nucleic Acids Res 1997, 25:3159-3163.
In this paper, the authors show first that it is possible to localize the branch-point
in documented Arabidopsis intron sequences using a Hidden Markov Model of
the U2 snRNA site, and then that predicting intron branch-points in contigs in
addition to splice sites [33] increases the prediction performance for acceptor
sites. NetGene2 uses this method and in addition predicts the position of the
splice sites in the codon, which could be used later on in gene modeling.
35. Pedersen AG, Nielsen H: Neural network prediction of translation
initiation sites in eukaryotes: perspectives for EST and genome
analysis. In Proceedings of the Fifth International Conference on
Intelligent Systems for Molecular Biology. Menlo Park: AAAI Press;
1997:226-233.
36. Xu Y, Uberbacher EC: Automated gene identification in large-scale
genomic sequences. J Comput Biol 1997, 4:325-338.
Zhang MQ: Identification of protein coding regions in Arabidopsis
thaliana genome based on quadratic discriminant analysis. Plant
Mol Biol 1998, 37:803-806.
This paper is an application of a previous one dedicated to the human
genome, describing MZEF for Arabidopsis. The method is using quadratic
discriminant analysis, a generalisation of linear discriminant analysis, as used
by Solovyev and Salamov [38].

37.


95

46. Lowe TM, Eddy SR: tRNAscan-SE: a program for improved
detection of transfer RNA genes in genomic sequence. Nucleic
Acids Res 1997, 25:955-964.
47.

Annotation of transposons and genomic repeats on World Wide Web
URL: http://nucleus.cshl.org/protarab/TnAnnotation.htm

48. Bork P, Dandekar T, Diaz-Lazcoz Y, Eisenhaber F, Huynen M, Yuan Y:
•• Predicting function: from genes to genomes and back. J Mol Biol
1998, 283:707-725.
This paper raises the issue of genome-wide annotation, setting to bridge the
gap between phenotype and genotype as a goal. An interesting point made
in this review is on using the added value provided by completely sequenced
genomes for gene function prediction.
49. Ohta Y, Yamamoto Y, Okazaki T, Uchiyama I, Takagi T: Automatic
constructing of knowledge base from biological papers. In
Proceedings of the Fifth International Conference on Intelligent Systems
for Molecular Biology. Menlo Park: AAAI Press; 1997:218-225.
50. Andrade M, Valencia A: Automatic extraction of keywords from
scientific text: application to the knowledge domain of protein
families. Bioinformatics 1998, 14:600-607.
51. Proux D, Rechenmann F, Julliard L, Pillet V, Jacq B: Detecting gene
symbols and names in biological texts: a first step toward
pertinent information extraction. In Proceedings from the Ninth
Workshop on Genome Informatics. Edited by Miyano S, Takagi T.
Tokyo: Universal Academy Press Inc.; 1998:72-80.
52. Lonsdale D: Nomenclature regulation. Nature 1998,

391:118.
The International Society for Plant Molecular Biology initiative to develop a
kingdom-wide nomenclature can be found on the World Wide Web URL:
MENDEL: http://www.mendel.ac.uk/ or http://genome-www.stanford.edu/
Mendel/ see also ATDB: http://genome-www.stanford.edu/Arabidopsis/
nomencl.html for additional information on nomenclature.
53. Sharf M, Schneider R, Casar G, Bork P, Valencia A, Ouzounis C,
Sander C: GeneQuiz: a workbench for sequence analysis. In
Proceedings of the Second International Conference on Intelligent
Systems for Molecular Biology. Menlo Park: AAAI Press;
1994:348-353.
54. Harris NL: Genotator: a workbench for sequence annotation.
Genome Res 1997, 7:754-762.

38. Solovyev V, Salamov A: The gene-finder computer tools for analysis
of human and model organisms genome sequences. In Proceedings
of the Fifth International Conference on Intelligent Systems for
Molecular Biology. Menlo Park: AAAI Press; 1997:294-302.

55. Bailey LC, Fisher S, Schug J, Crabtree J, Gibson M, Overton GC:
GAIA: framework annotation of genomic sequence. Genome Res
1998, 8:234-250.

39. Burge C, Karlin S: Prediction of complete gene structures in

human genomic DNA. J Mol Biol 1997, 268:78-94.
This paper describes GENSCAN, the most efficient gene prediction software to date for vertebrates. GENSCAN makes use of many signals and
compositional statistical properties, with a dependency rule for splice sites.
It allows the search for multiple genes on both strands.

56. Médigue C, Rechenmann F, Danchin A, Viari A: Imagene: an

integrated computer environment for sequence annotation and
analysis. Bioinformatics 1999, 15:in press.
A new kind of integrated annotation environment is presented which is a task
management platform incorporating a data management system and a
graphical tool. A prototype of Imagene was used for B. subtilis genome
sequence annotation.

40. Lukashin AV, Borodovsky M: GeneMark.hmm: new solutions for
gene finding. Nucleic Acids Res 1998, 26:1107-1115.
41. Kleffe J, Hermann K, Vahrson W, Wittig B, Brendel V:

GeneGenerator — a flexible algorithm for gene prediction and its
application to maize sequences. Bioinformatics 1998, 14:232-243.
The first gene prediction tool for a monocot, using a low information conceptual approach.
42. The C. elegans Sequencing Consortium: Genome sequence of the

nematode C. elegans: a platform for investigating biology. Science
1998, 282:2012-2018.
The first report on the full sequence of a higher eukaryote, with a genome
size comparable to Arabidopsis. The accompanying papers from the same
issue are worth reading too, as an illustration of what genomics is offering to
biology now.
43. Milanesi L, Muselli M, Arrigo P: Hamming-clustering method for
signals prediction in 5¢ and 3¢ regions of eukaryotic genes.
Comput Appl Biosci 1996, 12:399-404.
44. Fickett JW, Hatzigeorgiou AG: Eukaryotic promoter recognition.

Genome Res 1997, 7:861-878.
A documented review on computational promoter recognition, with an excellent first part on biology and a second on a fair comparative analysis of the
available tools. The conclusion shows that there is much to be done yet.
45. Duret L, Bucher P: Searching for regulatory elements in human
•• noncoding sequences. Curr Opin Struct Biol 1997, 7:399-406.
Another, more conceptual, view on the same topic which presents the
power of a comparative phylogenetic approach to identify new unknown
regulatory elements.

57.

Gelbart W: Databases in genomic research. Science 1998,
282:659-661.

58. Mathé C, Peresetsky A, Déhais P, Van Montagu M, Rouzé P:

Classification of Arabidopsis thaliana gene sequences: clustering
of coding sequences into two groups according to codon usage
improves gene prediction. J Mol Biol 1999, 285:1977-1991.
This paper reports the clustering of Arabidopsis coding sequences into two
classes according to their codon usage, the main difference lying in the use
of T and C at the third codon position. The classes correlate with gene
expression. Using GeneMark, it was shown that using two gene models
instead of one improves gene prediction.
59. Wu HJ, Gaubier-Comella P, Delseny M, Grellet F, Van Montagu M,
Rouzé P: AU-AA splicing in Arabidopsis: non-canonical introns are
at least 109 years old. Nat Genet 1996, 14:383-384.
60. Sharpe PA, Burge CB: Classification of introns: U2-type or U12

type. Cell 1997, 91:875-879.
A synthetic presentation of the recent finding of a second class of spliceosomal introns found in animal and plants [59]. Their distinctive splicing and
signature are described together with the donor and the branch sites.
61. Fichant GA, Burks C: Identifying potential tRNA genes in genomic
DNA sequences. J Mol Biol 1991, 220:659-671.
62. Terryn N, Heijnen L, De Keyser A, Van Asseldonck M, DeClercq R:
Evidence for an ancient chromosomal duplication in Arabidopsis
thaliana by sequencing and analysing a 400 kb contig at the
APETALA 2 locus on chromosome 4. FEBS Lett 1999,
445:237-245.