SlideShare a Scribd company logo
1 of 21
Genome Annotation
Definition:
It is the process of taking the raw DNA sequence
produced by the genome-sequencing projects and
adding the layers of analysis and interpretation
necessary to extract its biological significance and
place it into the context of our understanding of
biological processes.
• Today, the public international sequence databases contain
more than nine billion nucleotides and the flow of new
sequences is increasing dramatically. For scientists, the
challenge is to exploit this huge amount of sequences.
• To extract biological knowledge from anonymous genomic
sequences is the main objective of genome annotation.
• The extensive use of computer tools is needed to minimize
the slow and costly human interventions. This is the reason
why annotation is often synonymous with prediction.
• The annotation work is divided into two steps: structural
annotation, which consists mainly of localizing gene elements;
and functional annotation, which aims at assigning a
biochemical function to the deduced gene products.
Structural annotation
The prediction of the gene elements is a complex problem
and its issue is primordial because of its consequences on all
the following analyses.
• Eukaryotic genes with their mosaic structure are more difficult
to find than prokaryotic ones which are simple open reading
frames. The presence of introns complicates the problem,
although the binding sites of the spliceosome may be used to
predict the exact position of the exon borders.
• According to the prediction tools, the result of the prediction
concerns the splice sites, the exons or the whole gene (gene
modelling software).
Gene prediction in Prokaryotes
• Prokaryotes have relatively small genomes with sizes ranging
from 0.5 to 10 Mbp.
• But gene density in the genomes is high with more than 90%
of a genome sequence containing coding sequence.
• In bacteria majority of genes start with ATG which codes for
methionine. Occasionally, GTG and TTG are used as
alternative start codons. These codons not necessarily give a
clear indication of the translation initiation site. This is
overcome by the presence of Shine- Delgarno sequence,
which is a stretch of purine rich sequence complementary to
16S rRNA in the ribosome.
• Many genes are transcribed together as one operon.
The end of the operon is characterized by a transcription
termination signal called rho- independent terminator.
Conventional determination of ORFs
• One method is based on the nucleotide composition of the
third position of codon. It has been observed that this
position has a preference to use G or C over A or T.
• By plotting the GC composition at this position, regions with
values significantly above the random level can be identified,
which are indicative of the presence of ORFs.
• There is a similar method called TESTCODE that exploits the
fact that the third codon nucleotides in a coding region tend
to repeat themselves.
Performance evaluation
• Accuracy can be described by evaluating two parameters such
as sensitivity and specificity. To describe this concept four
features are used: true positive (TP), false positive (FP), false
negative (FN), true negative (TN).
• TP: correctly predicted feature
• FP: incorrectly predicted feature
• FN: missed feature
• TN: correctly predicted absence of a feature
• Sensitivity is the proportion of true signals predicted among
all possible true strengths.
• Specificity is the proportion of true signals among all signals
that are predicted.
Sn = TP/(TP+FN)
SP = TP/(TP+FP)
• Correlation coefficient:
• Value of CC provides an overall measure of accuracy which
ranges from -1 to +1
Gene prediction in eukaryotes
• Eukaryotic nuclear genomes are much larger than prokaryotic
ones, with size ranging from 10 Mbp to 670 Gbp.
• They tend to have a very low gene density. For example in
humans only 3% of the genome codes for genes, with about 1
gene per 100 kbp on average.
• The nascent mRNA undergoes post -transcriptional
modification before becoming a mature mRNA for protein
translation.
• The main issue in prediction of eukaryotic genes is the
identification of exons, introns and splicing sites.
Prediction can be made on the basis of :
• Presence of conserved sequences - Splice junctions of introns and
exons follow the GT-AG rule.
• Statistical patterns- Nucleotide compositions and codon bias in
coding regions of eukaryotes are different from those of the non
coding regions
Most vertebrate genes use ATG as the translation start codon and
have uniquely conserved sequences called as Kozak sequence
(CCGCCATGG)
• Presence of CpG island- Most of these genes have a high density of
CG dinucleotides near the transcription start site. Here ‘p’ refers to
the phosphodiester bond between the two nucleotides.
Gene prediction programs
Ab initio-based programs:
• This discriminate exons from non coding sequences and
subsequently joins them together in the correct order.
• It rely on two features gene signals and gene content.
• In addition with HMMs, discriminant analysis ,neural network
based algorithms are also used in gene prediction.
• Neural networks:
It is a statistical
model with a special architecture
for pattern recognition
and classification. Here multiple
layers are constructed- input,
output and hidden layers. The
output is the probability of the
exon structure. GRAIL is a
program based on neural
network algorithm. Fig: Architecture of a neural
network for eukaryotic gene
prediction
• Prediction using HMMs:
- GENSCAN is a web based program on fifth- order HMMs,
- HMMgene is also a web program. It uses a criterion called
the conditional maximum likelihood to discriminate coding
from non coding features.
• Prediction using Discriminant Analysis:
- Some gene prediction algorithms rely on discriminant
analysis, either LDA or quadratic discriminant analysis (QDA).
- LDA works by plotting a 2-D graph of coding signals versus all
potential 3’ splice site positions and drawing a diagonal line
that best separates coding signals from non-coding signals
based on knowledge learned from training data sets of known
gene structures.
- QDA draws a curved line based on a quadratic function to
separate coding and non-coding features.
Programs used are: FGENES, FGENESH, FGENESH_C,
FGENESH+,MZEF
Homolgy based programs:
- These are based on the fact that exon structures and exon
sequences of related species are highly conserved.
- When coding frames in a query sequence are translated and
used to align with closest protein homologs found in
database, nearly perfectly matched regions can be used to
reveal the exon boundaries in the query.
- Programs used are:
GenomeScan, EST2Genome, SGP-1, TwinScan
• Consensus based programs:
These programs are developed using consensus- based
algorithms which combine results of multiple programs based
on consensus. However this may lead to lowered sensitivity
and missed predictions.
Eg of consensus- based programs are: GeneComber, DIGIT
Functional annotation
• At the present time, the functional genome annotation is
based on the idea that some sequence similarities detected
between two proteins mean that they are homologs i.e. they
come from the same ancestor and share the same
biochemical function.
• Therefore, for each predicted gene, the protein is deduced
from the coding region and is compared through BlastP with
the protein databases.
• If the similarities detected are considered relevant, the name
(function) of the putative homologue protein is associated
with the prediction.
• The tendency is nevertheless the following:
when a predicted gene product is 100 % identical to an
already characterized protein, it receives the same name,
whereas sequences with stringent similarity to known
proteins are called ‘putative’ proteins of the same name.
• The sequences for which only similarities to ESTs are detected
and named ‘unknown’ proteins.
• Finally, genes without similar sequences and, hence, only
deduced from intrinsic prediction programs are labelled
‘hypothetical’.
• Some annotators confirm and complete the Blast results by
full-length alignments between the query protein and the
closest homologue detected, and by looking for motifs and
family signatures.
Automatic genome annotation
pipelines
• The primary goal of the pipeline process is to deliver highly
accurate and reliable genome annotations, using the widest
possible range of evidence from available databases.
• As pipelines have evolved, the trend has been to move away
from single algorithm methods and towards consensus-based
approaches.
• Pipelines are the integration of suites of bioinformatics
software tools with multiple databases, to manage
automatically the analysis and storage of genomic sequence.
• Genomic sequences pass through several successive levels of
algorithms. Each layer of processing provides further
refinement of annotation detail.
Fig: The generic structure of an automatic genome annotation pipeline
and delivery system.
Genomic pipelines:
Several genomic pipelines exist worldwide. Publicly funded
projects include
• Ensembl at the European Bioinformatics Institute (EBI)/Sanger
Institute,
• NCBI Analysis Pipeline,
• Oak Ridge National Laboratories (ORNL) Genome Channel.
THANK YOU

More Related Content

What's hot

Gene bank by kk sahu
Gene bank by kk sahuGene bank by kk sahu
Gene bank by kk sahuKAUSHAL SAHU
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data miningSangeeta Das
 
Protein 3 d structure prediction
Protein 3 d structure predictionProtein 3 d structure prediction
Protein 3 d structure predictionSamvartika Majumdar
 
Ab Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionAb Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionArindam Ghosh
 
SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)talhakhat
 
methods for protein structure prediction
methods for protein structure predictionmethods for protein structure prediction
methods for protein structure predictionkaramveer prajapat
 
Scoring matrices
Scoring matricesScoring matrices
Scoring matricesAshwini
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignmentRamya S
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Vijay Hemmadi
 
Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databasesPranavathiyani G
 
Introduction to NCBI
Introduction to NCBIIntroduction to NCBI
Introduction to NCBIgeetikaJethra
 
Complementary DNA (cDNA) Libraries
Complementary DNA 	(cDNA) LibrariesComplementary DNA 	(cDNA) Libraries
Complementary DNA (cDNA) LibrariesRamesh Pothuraju
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
 

What's hot (20)

Gene bank by kk sahu
Gene bank by kk sahuGene bank by kk sahu
Gene bank by kk sahu
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data mining
 
Est database
Est databaseEst database
Est database
 
Protein 3 d structure prediction
Protein 3 d structure predictionProtein 3 d structure prediction
Protein 3 d structure prediction
 
Ab Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionAb Initio Protein Structure Prediction
Ab Initio Protein Structure Prediction
 
SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)
 
methods for protein structure prediction
methods for protein structure predictionmethods for protein structure prediction
methods for protein structure prediction
 
Scoring matrices
Scoring matricesScoring matrices
Scoring matrices
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Protein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modelingProtein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modeling
 
YEAST TWO HYBRID SYSTEM
 YEAST TWO HYBRID SYSTEM YEAST TWO HYBRID SYSTEM
YEAST TWO HYBRID SYSTEM
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins
 
Structural genomics
Structural genomicsStructural genomics
Structural genomics
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databases
 
Introduction to NCBI
Introduction to NCBIIntroduction to NCBI
Introduction to NCBI
 
European molecular biology laboratory (EMBL)
European molecular biology laboratory (EMBL)European molecular biology laboratory (EMBL)
European molecular biology laboratory (EMBL)
 
Complementary DNA (cDNA) Libraries
Complementary DNA 	(cDNA) LibrariesComplementary DNA 	(cDNA) Libraries
Complementary DNA (cDNA) Libraries
 
Kegg
KeggKegg
Kegg
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 

Viewers also liked

BIOL335: How to annotate a genome
BIOL335: How to annotate a genomeBIOL335: How to annotate a genome
BIOL335: How to annotate a genomePaul Gardner
 
Gene identification and discovery
Gene identification and discoveryGene identification and discovery
Gene identification and discoveryAmit Ruchi Yadav
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomicshemantbreeder
 
Open Reading Frames
Open Reading FramesOpen Reading Frames
Open Reading FramesOsama Zahid
 
Fine structure of gene
Fine structure of geneFine structure of gene
Fine structure of geneSayali28
 
Equal in the Eyes of the Law 12.12.16
Equal in the Eyes of the Law 12.12.16Equal in the Eyes of the Law 12.12.16
Equal in the Eyes of the Law 12.12.16Katy Collins
 
EL USO DE LAS TICs EN EL TURISMO
EL USO DE LAS TICs EN EL TURISMO EL USO DE LAS TICs EN EL TURISMO
EL USO DE LAS TICs EN EL TURISMO Laura Vargas
 

Viewers also liked (8)

Gemome annotation
Gemome annotationGemome annotation
Gemome annotation
 
BIOL335: How to annotate a genome
BIOL335: How to annotate a genomeBIOL335: How to annotate a genome
BIOL335: How to annotate a genome
 
Gene identification and discovery
Gene identification and discoveryGene identification and discovery
Gene identification and discovery
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Open Reading Frames
Open Reading FramesOpen Reading Frames
Open Reading Frames
 
Fine structure of gene
Fine structure of geneFine structure of gene
Fine structure of gene
 
Equal in the Eyes of the Law 12.12.16
Equal in the Eyes of the Law 12.12.16Equal in the Eyes of the Law 12.12.16
Equal in the Eyes of the Law 12.12.16
 
EL USO DE LAS TICs EN EL TURISMO
EL USO DE LAS TICs EN EL TURISMO EL USO DE LAS TICs EN EL TURISMO
EL USO DE LAS TICs EN EL TURISMO
 

Similar to Genome Annotation Process

BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxBTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxChijiokeNsofor
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical NotebookNaima Tahsin
 
Comparative and functional genomics
Comparative and functional genomicsComparative and functional genomics
Comparative and functional genomicsJalormi Parekh
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityMonica Munoz-Torres
 
gene prediction programs
gene prediction programsgene prediction programs
gene prediction programsMugdhaSharma11
 
Functional annotation- prediction of genes.pptx
Functional annotation- prediction of genes.pptxFunctional annotation- prediction of genes.pptx
Functional annotation- prediction of genes.pptxSridharshinisathishk
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination NetworkMonica Munoz-Torres
 
SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionAashish Patel
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomicsajay301
 
Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Monica Munoz-Torres
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_predictionBas van Breukelen
 
Introduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisIntroduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisMonica Munoz-Torres
 

Similar to Genome Annotation Process (20)

Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxBTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
 
Gene identification using bioinformatic tools.pptx
Gene identification using bioinformatic tools.pptxGene identification using bioinformatic tools.pptx
Gene identification using bioinformatic tools.pptx
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical Notebook
 
Comparative and functional genomics
Comparative and functional genomicsComparative and functional genomics
Comparative and functional genomics
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 
Gene prediction strategies
Gene prediction strategies Gene prediction strategies
Gene prediction strategies
 
gene prediction programs
gene prediction programsgene prediction programs
gene prediction programs
 
Functional annotation- prediction of genes.pptx
Functional annotation- prediction of genes.pptxFunctional annotation- prediction of genes.pptx
Functional annotation- prediction of genes.pptx
 
Genome analysis2
Genome analysis2Genome analysis2
Genome analysis2
 
artificial neural network-gene prediction
artificial neural network-gene predictionartificial neural network-gene prediction
artificial neural network-gene prediction
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination Network
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene Expression
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_prediction
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Introduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisIntroduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinis
 

Recently uploaded

basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomyDrAnita Sharma
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxnoordubaliya2003
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXDole Philippines School
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Tamer Koksalan, PhD
 

Recently uploaded (20)

basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomy
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptx
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)
 

Genome Annotation Process

  • 2. Definition: It is the process of taking the raw DNA sequence produced by the genome-sequencing projects and adding the layers of analysis and interpretation necessary to extract its biological significance and place it into the context of our understanding of biological processes.
  • 3. • Today, the public international sequence databases contain more than nine billion nucleotides and the flow of new sequences is increasing dramatically. For scientists, the challenge is to exploit this huge amount of sequences. • To extract biological knowledge from anonymous genomic sequences is the main objective of genome annotation. • The extensive use of computer tools is needed to minimize the slow and costly human interventions. This is the reason why annotation is often synonymous with prediction. • The annotation work is divided into two steps: structural annotation, which consists mainly of localizing gene elements; and functional annotation, which aims at assigning a biochemical function to the deduced gene products.
  • 4. Structural annotation The prediction of the gene elements is a complex problem and its issue is primordial because of its consequences on all the following analyses. • Eukaryotic genes with their mosaic structure are more difficult to find than prokaryotic ones which are simple open reading frames. The presence of introns complicates the problem, although the binding sites of the spliceosome may be used to predict the exact position of the exon borders. • According to the prediction tools, the result of the prediction concerns the splice sites, the exons or the whole gene (gene modelling software).
  • 5. Gene prediction in Prokaryotes • Prokaryotes have relatively small genomes with sizes ranging from 0.5 to 10 Mbp. • But gene density in the genomes is high with more than 90% of a genome sequence containing coding sequence. • In bacteria majority of genes start with ATG which codes for methionine. Occasionally, GTG and TTG are used as alternative start codons. These codons not necessarily give a clear indication of the translation initiation site. This is overcome by the presence of Shine- Delgarno sequence, which is a stretch of purine rich sequence complementary to 16S rRNA in the ribosome. • Many genes are transcribed together as one operon. The end of the operon is characterized by a transcription termination signal called rho- independent terminator.
  • 6. Conventional determination of ORFs • One method is based on the nucleotide composition of the third position of codon. It has been observed that this position has a preference to use G or C over A or T. • By plotting the GC composition at this position, regions with values significantly above the random level can be identified, which are indicative of the presence of ORFs. • There is a similar method called TESTCODE that exploits the fact that the third codon nucleotides in a coding region tend to repeat themselves.
  • 7. Performance evaluation • Accuracy can be described by evaluating two parameters such as sensitivity and specificity. To describe this concept four features are used: true positive (TP), false positive (FP), false negative (FN), true negative (TN). • TP: correctly predicted feature • FP: incorrectly predicted feature • FN: missed feature • TN: correctly predicted absence of a feature • Sensitivity is the proportion of true signals predicted among all possible true strengths. • Specificity is the proportion of true signals among all signals that are predicted. Sn = TP/(TP+FN) SP = TP/(TP+FP)
  • 8. • Correlation coefficient: • Value of CC provides an overall measure of accuracy which ranges from -1 to +1
  • 9. Gene prediction in eukaryotes • Eukaryotic nuclear genomes are much larger than prokaryotic ones, with size ranging from 10 Mbp to 670 Gbp. • They tend to have a very low gene density. For example in humans only 3% of the genome codes for genes, with about 1 gene per 100 kbp on average. • The nascent mRNA undergoes post -transcriptional modification before becoming a mature mRNA for protein translation. • The main issue in prediction of eukaryotic genes is the identification of exons, introns and splicing sites.
  • 10. Prediction can be made on the basis of : • Presence of conserved sequences - Splice junctions of introns and exons follow the GT-AG rule. • Statistical patterns- Nucleotide compositions and codon bias in coding regions of eukaryotes are different from those of the non coding regions Most vertebrate genes use ATG as the translation start codon and have uniquely conserved sequences called as Kozak sequence (CCGCCATGG) • Presence of CpG island- Most of these genes have a high density of CG dinucleotides near the transcription start site. Here ‘p’ refers to the phosphodiester bond between the two nucleotides.
  • 11. Gene prediction programs Ab initio-based programs: • This discriminate exons from non coding sequences and subsequently joins them together in the correct order. • It rely on two features gene signals and gene content. • In addition with HMMs, discriminant analysis ,neural network based algorithms are also used in gene prediction.
  • 12. • Neural networks: It is a statistical model with a special architecture for pattern recognition and classification. Here multiple layers are constructed- input, output and hidden layers. The output is the probability of the exon structure. GRAIL is a program based on neural network algorithm. Fig: Architecture of a neural network for eukaryotic gene prediction
  • 13. • Prediction using HMMs: - GENSCAN is a web based program on fifth- order HMMs, - HMMgene is also a web program. It uses a criterion called the conditional maximum likelihood to discriminate coding from non coding features. • Prediction using Discriminant Analysis: - Some gene prediction algorithms rely on discriminant analysis, either LDA or quadratic discriminant analysis (QDA). - LDA works by plotting a 2-D graph of coding signals versus all potential 3’ splice site positions and drawing a diagonal line that best separates coding signals from non-coding signals based on knowledge learned from training data sets of known gene structures.
  • 14. - QDA draws a curved line based on a quadratic function to separate coding and non-coding features. Programs used are: FGENES, FGENESH, FGENESH_C, FGENESH+,MZEF Homolgy based programs: - These are based on the fact that exon structures and exon sequences of related species are highly conserved.
  • 15. - When coding frames in a query sequence are translated and used to align with closest protein homologs found in database, nearly perfectly matched regions can be used to reveal the exon boundaries in the query. - Programs used are: GenomeScan, EST2Genome, SGP-1, TwinScan • Consensus based programs: These programs are developed using consensus- based algorithms which combine results of multiple programs based on consensus. However this may lead to lowered sensitivity and missed predictions. Eg of consensus- based programs are: GeneComber, DIGIT
  • 16. Functional annotation • At the present time, the functional genome annotation is based on the idea that some sequence similarities detected between two proteins mean that they are homologs i.e. they come from the same ancestor and share the same biochemical function. • Therefore, for each predicted gene, the protein is deduced from the coding region and is compared through BlastP with the protein databases. • If the similarities detected are considered relevant, the name (function) of the putative homologue protein is associated with the prediction.
  • 17. • The tendency is nevertheless the following: when a predicted gene product is 100 % identical to an already characterized protein, it receives the same name, whereas sequences with stringent similarity to known proteins are called ‘putative’ proteins of the same name. • The sequences for which only similarities to ESTs are detected and named ‘unknown’ proteins. • Finally, genes without similar sequences and, hence, only deduced from intrinsic prediction programs are labelled ‘hypothetical’. • Some annotators confirm and complete the Blast results by full-length alignments between the query protein and the closest homologue detected, and by looking for motifs and family signatures.
  • 18. Automatic genome annotation pipelines • The primary goal of the pipeline process is to deliver highly accurate and reliable genome annotations, using the widest possible range of evidence from available databases. • As pipelines have evolved, the trend has been to move away from single algorithm methods and towards consensus-based approaches. • Pipelines are the integration of suites of bioinformatics software tools with multiple databases, to manage automatically the analysis and storage of genomic sequence. • Genomic sequences pass through several successive levels of algorithms. Each layer of processing provides further refinement of annotation detail.
  • 19. Fig: The generic structure of an automatic genome annotation pipeline and delivery system.
  • 20. Genomic pipelines: Several genomic pipelines exist worldwide. Publicly funded projects include • Ensembl at the European Bioinformatics Institute (EBI)/Sanger Institute, • NCBI Analysis Pipeline, • Oak Ridge National Laboratories (ORNL) Genome Channel.