After sequencing of the genome has been done, the first thing that comes to mind is "Where are the genes?". Genome annotation is the process of attaching information to the biological sequences. It is an active area of research and it would help scientists a lot to undergo with their wet lab projects once they know the coding parts of a genome.
2. Definition:
It is the process of taking the raw DNA sequence
produced by the genome-sequencing projects and
adding the layers of analysis and interpretation
necessary to extract its biological significance and
place it into the context of our understanding of
biological processes.
3. • Today, the public international sequence databases contain
more than nine billion nucleotides and the flow of new
sequences is increasing dramatically. For scientists, the
challenge is to exploit this huge amount of sequences.
• To extract biological knowledge from anonymous genomic
sequences is the main objective of genome annotation.
• The extensive use of computer tools is needed to minimize
the slow and costly human interventions. This is the reason
why annotation is often synonymous with prediction.
• The annotation work is divided into two steps: structural
annotation, which consists mainly of localizing gene elements;
and functional annotation, which aims at assigning a
biochemical function to the deduced gene products.
4. Structural annotation
The prediction of the gene elements is a complex problem
and its issue is primordial because of its consequences on all
the following analyses.
• Eukaryotic genes with their mosaic structure are more difficult
to find than prokaryotic ones which are simple open reading
frames. The presence of introns complicates the problem,
although the binding sites of the spliceosome may be used to
predict the exact position of the exon borders.
• According to the prediction tools, the result of the prediction
concerns the splice sites, the exons or the whole gene (gene
modelling software).
5. Gene prediction in Prokaryotes
• Prokaryotes have relatively small genomes with sizes ranging
from 0.5 to 10 Mbp.
• But gene density in the genomes is high with more than 90%
of a genome sequence containing coding sequence.
• In bacteria majority of genes start with ATG which codes for
methionine. Occasionally, GTG and TTG are used as
alternative start codons. These codons not necessarily give a
clear indication of the translation initiation site. This is
overcome by the presence of Shine- Delgarno sequence,
which is a stretch of purine rich sequence complementary to
16S rRNA in the ribosome.
• Many genes are transcribed together as one operon.
The end of the operon is characterized by a transcription
termination signal called rho- independent terminator.
6. Conventional determination of ORFs
• One method is based on the nucleotide composition of the
third position of codon. It has been observed that this
position has a preference to use G or C over A or T.
• By plotting the GC composition at this position, regions with
values significantly above the random level can be identified,
which are indicative of the presence of ORFs.
• There is a similar method called TESTCODE that exploits the
fact that the third codon nucleotides in a coding region tend
to repeat themselves.
7. Performance evaluation
• Accuracy can be described by evaluating two parameters such
as sensitivity and specificity. To describe this concept four
features are used: true positive (TP), false positive (FP), false
negative (FN), true negative (TN).
• TP: correctly predicted feature
• FP: incorrectly predicted feature
• FN: missed feature
• TN: correctly predicted absence of a feature
• Sensitivity is the proportion of true signals predicted among
all possible true strengths.
• Specificity is the proportion of true signals among all signals
that are predicted.
Sn = TP/(TP+FN)
SP = TP/(TP+FP)
9. Gene prediction in eukaryotes
• Eukaryotic nuclear genomes are much larger than prokaryotic
ones, with size ranging from 10 Mbp to 670 Gbp.
• They tend to have a very low gene density. For example in
humans only 3% of the genome codes for genes, with about 1
gene per 100 kbp on average.
• The nascent mRNA undergoes post -transcriptional
modification before becoming a mature mRNA for protein
translation.
• The main issue in prediction of eukaryotic genes is the
identification of exons, introns and splicing sites.
10. Prediction can be made on the basis of :
• Presence of conserved sequences - Splice junctions of introns and
exons follow the GT-AG rule.
• Statistical patterns- Nucleotide compositions and codon bias in
coding regions of eukaryotes are different from those of the non
coding regions
Most vertebrate genes use ATG as the translation start codon and
have uniquely conserved sequences called as Kozak sequence
(CCGCCATGG)
• Presence of CpG island- Most of these genes have a high density of
CG dinucleotides near the transcription start site. Here ‘p’ refers to
the phosphodiester bond between the two nucleotides.
11. Gene prediction programs
Ab initio-based programs:
• This discriminate exons from non coding sequences and
subsequently joins them together in the correct order.
• It rely on two features gene signals and gene content.
• In addition with HMMs, discriminant analysis ,neural network
based algorithms are also used in gene prediction.
12. • Neural networks:
It is a statistical
model with a special architecture
for pattern recognition
and classification. Here multiple
layers are constructed- input,
output and hidden layers. The
output is the probability of the
exon structure. GRAIL is a
program based on neural
network algorithm. Fig: Architecture of a neural
network for eukaryotic gene
prediction
13. • Prediction using HMMs:
- GENSCAN is a web based program on fifth- order HMMs,
- HMMgene is also a web program. It uses a criterion called
the conditional maximum likelihood to discriminate coding
from non coding features.
• Prediction using Discriminant Analysis:
- Some gene prediction algorithms rely on discriminant
analysis, either LDA or quadratic discriminant analysis (QDA).
- LDA works by plotting a 2-D graph of coding signals versus all
potential 3’ splice site positions and drawing a diagonal line
that best separates coding signals from non-coding signals
based on knowledge learned from training data sets of known
gene structures.
14. - QDA draws a curved line based on a quadratic function to
separate coding and non-coding features.
Programs used are: FGENES, FGENESH, FGENESH_C,
FGENESH+,MZEF
Homolgy based programs:
- These are based on the fact that exon structures and exon
sequences of related species are highly conserved.
15. - When coding frames in a query sequence are translated and
used to align with closest protein homologs found in
database, nearly perfectly matched regions can be used to
reveal the exon boundaries in the query.
- Programs used are:
GenomeScan, EST2Genome, SGP-1, TwinScan
• Consensus based programs:
These programs are developed using consensus- based
algorithms which combine results of multiple programs based
on consensus. However this may lead to lowered sensitivity
and missed predictions.
Eg of consensus- based programs are: GeneComber, DIGIT
16. Functional annotation
• At the present time, the functional genome annotation is
based on the idea that some sequence similarities detected
between two proteins mean that they are homologs i.e. they
come from the same ancestor and share the same
biochemical function.
• Therefore, for each predicted gene, the protein is deduced
from the coding region and is compared through BlastP with
the protein databases.
• If the similarities detected are considered relevant, the name
(function) of the putative homologue protein is associated
with the prediction.
17. • The tendency is nevertheless the following:
when a predicted gene product is 100 % identical to an
already characterized protein, it receives the same name,
whereas sequences with stringent similarity to known
proteins are called ‘putative’ proteins of the same name.
• The sequences for which only similarities to ESTs are detected
and named ‘unknown’ proteins.
• Finally, genes without similar sequences and, hence, only
deduced from intrinsic prediction programs are labelled
‘hypothetical’.
• Some annotators confirm and complete the Blast results by
full-length alignments between the query protein and the
closest homologue detected, and by looking for motifs and
family signatures.
18. Automatic genome annotation
pipelines
• The primary goal of the pipeline process is to deliver highly
accurate and reliable genome annotations, using the widest
possible range of evidence from available databases.
• As pipelines have evolved, the trend has been to move away
from single algorithm methods and towards consensus-based
approaches.
• Pipelines are the integration of suites of bioinformatics
software tools with multiple databases, to manage
automatically the analysis and storage of genomic sequence.
• Genomic sequences pass through several successive levels of
algorithms. Each layer of processing provides further
refinement of annotation detail.
19. Fig: The generic structure of an automatic genome annotation pipeline
and delivery system.
20. Genomic pipelines:
Several genomic pipelines exist worldwide. Publicly funded
projects include
• Ensembl at the European Bioinformatics Institute (EBI)/Sanger
Institute,
• NCBI Analysis Pipeline,
• Oak Ridge National Laboratories (ORNL) Genome Channel.