Translating research data into Gene Ontology annotations
1. Translating research data into
Gene Ontology annotations
Pascale Gaudet
SIB – Swiss Institute of Bioinformatics
GO Consortium
2. Ontology Annotations Model of biology
Gene Ontology Consortium
What we provide
A structured representation
of biology, composed of:
• Classes
• Relations
• Definitions
+ =
- Antigen binding
- Adaptive
immune response
- Extracellular
IGHA1
Immunoglobulin heavy constant alpha 1
- Glutamine-tRNA
ligase activity
- Translation
- Cytoplasm
QARS
Gln tRNA synthetase
Statements about the
functions of specific gene
products.
3 aspects:
• Molecular function
• Biological process
• Cellular component
Representation of current
knowledge in a manner
that is:
• Human
understandable
• Machine computable
3. GO “annotations”
§ An annotation is a statement linking a gene to
some aspect of its function (a GO ontology term)
§ Each annotation is based on some evidence,
recorded as part of the annotation
§ Evidence code (type of evidence)
§ Reference (published journal article)
Examples:
Annotation 1: INSR + ‘receptor activity’
Annotation 2: INSR + ‘plasma membrane’
Annotation 3: INSR + ‘insulin receptor signaling pathway’
4. Semantics of a GO annotation
The association of a GO class with a gene product
is a statement that means:
§ molecular function: molecular activities of gene
products
§ cellular component: where gene products are active
§ biological process: pathways and larger processes
made up of the activities of multiple gene products.
§ In other words, annotations represent the
normal, in vivo biological role of gene products
5. Manual - Literature-based Manual - Sequence-based Algorithmic (unreviewed)
How are annotations generated?
An computer program
analyses a sequences and
make a prediction based
on some decision criteria,
for example:
-protein domain
(InterPro2GO)
- sequence similarity
(BLAST2GO)
An expert reviews the
literature and assigns
functions, processes and
cellular components to
genes products
> 500,000 annotations > 65M annotations
An expert analyses a
sequence and makes a
prediction concerning the
gene function based on
known functions of
related sequences
The predictions can be
based on the known
function of evolutionarily
related sequences
(phylogenetic
relationships)
> 3M annotations
6. Manual - Literature-based
Evidence types
Chibucos MC, Siegele DA, Hu JC, Giglio M. (2017) Evidence and conclusion ontology PMID: 27812948
Manual - Sequence-based Algorithmic (unreviewed)
EXP
experimental evidence
IDA
inferred from direct assay
IPI
inferred from physical
interaction
IMP
inferred from mutant
phenotype
ISS
inferred from sequence
similarity
ISO
inferred from sequence
ortholog
IBA
inferred from biological
aspect of ancestor
IEA
inferred from electronic
annotation
7. Who produces GO annotations?
• Model organism databases (SGD, FlyBase,
wormbase, MGI, etc)
• Generalist databases, for eg UniProtKB, IntAct
• Domain-specific projects: Cardiovascular project
(UCL), synapse project (VU), etc.
• Anyone who wishes to contribute their expertise
and data to the project
8. Best practices for generating
literature-based GO annotations
§ Ensure consistency of usage across a
broad consortium of contributors
§ Improve inferencing capabilities
9. Focus on the research hypothesis
§ Use prior knowledge to understand the hypothesis
being tested and its relation to the experimental
observation
Protein Known roles Hypothesis Assay Result Conclusion for GO
DDFB (O76075) DNase The nuclease activity of
DDFB is required for
nuclear DNA
fragmentation during
apoptosis
Apoptotic DNA
fragmentation
increased in the
presence of DDFB
DDFB mediates nuclear DNA
fragmentation during
apoptosis
= apoptotic DNA
fragmentation
(GO:0006309)
FOXL2 (P58012) Transcription
factor
Mutations in FOXL2 are
known to cause
premature ovarian
failure, which may be
due to increased
apoptosis
Apoptotic DNA
fragmentation
increased in the
presence of FOXL2
FOXL2 increases the rate of
apoptosis
= positive regulation of
apoptotic process
(GO:0043065)
10. Annotate the conclusion, not the assay
1) rubidium if often used to assay potassium transport,
because the radioactive form is more readily available;
- the physiologically relevant substrate is potassium
2) Protein kinases are often tested with non-physiologically
relevant substrates, such as histone
- if the authors do not discuss the physiological relevance,
one cannot annotate the substrate
11. On the in vivo relevance of phenotypes
• Phenotypes can help understand the function of proteins
• Phenotypes can insights into mechanisms leading to disease
• The scope of the GO, though, is to capture the normal function of
proteins
Indirect effects of a mutation
- RNA polymerase affects essentially all cellular processes (cell
proliferation, development, etc) but does not mediate these
processes
Lack of hypothesis for a role of a protein in a process:
- Knockdown of Tmem234 in zebrafish results defects in pronephric
glomerulus formation. Annotation by IMP to glomerulus formation is
not supported by any cellular/molecular data
12. Get the wider perspective
• Favor a gene-by-gene or pathway-by-pathway
approach for curation rather than paper-by-paper
• Read recent publications
• Remove incorrect annotations based on invalidated
hypothesis
13. Guidelines for high quality
annotations
• Annotate the conclusion of the experiment
• Use the biological context to interpret the
experiments
• Carefully select publications. Read recent
publications
• Ensure consistency with existing annotations
• Keep annotation up-to date: Remove obsolete
annotations
14. Other approaches for quality
control
• Annotation consistency exercises
• Taxonomic constraints
• Co-occurrence of annotations
• Phylogenetic annotations
• User feedback
- from GO website
- from PubMed
- from databases
18. Acknowledgments
• GO PIs
• Judy Blake
• Mike Cherry
• Suzanna Lewis
• Paul Sternberg
• Paul Thomas
• GO Handbook
contributors
• Christophe Dessimoz
• Jim Hu
• Nives Skunca
• Sylvain Poux
• Funding
• NIH HG002273 (GO)