Apollo is a web-based application that supports and enables collaborative genome curation in real time, allowing teams of curators to improve on existing automated gene models through an intuitive interface. Apollo allows researchers to break down large amounts of data into manageable portions to mobilize groups of researchers with shared interests.
The i5k, an initiative to sequence the genomes of 5,000 insect and related arthropod species, is a broad and inclusive effort that seeks to involve scientists from around the world in their genome curation process, and Apollo is serving as the platform to empower this community.
This presentation is an introduction to Apollo for the members of the i5K Pilot Project working on species of the order Calanoida (copepod).
Introduction to Apollo - i5k Research Community – Calanoida (copepod)
1. Introduction to Apollo
Collaborative genome annotation editing
A webinar for the i5K Research Community – Calanoida (copepod)
Monica Munoz-Torres | @monimunozto
Berkeley Bioinformatics Open-Source Projects (BBOP)
Environmental Genomics & Systems Biology Division, Lawrence Berkeley National Laboratory
i5k Pilot Project Species Calls | 17 October, 2016
http://GenomeArchitect.org
2. Outline
• Today you will discover
effective ways to extract
valuable information
about a genome through
curation efforts.
3. After this talk you will...
• Better understand ‘curation’ in the context of genome annotation:
assembled genome à automated annotation à manual annotation
• Become familiar with Apollo’s environment and functionality.
• Learn to identify homologs of known genes of interest in your newly
sequenced genome.
• Learn how to corroborate and modify automatically annotated gene
models using all available evidence in Apollo.
4. Experimental design, sampling.
Comparative analyses
Official / Merged
Gene Set
Manual
Annotation
Automated
Annotation
Sequencing
Assembly
Synthesis &
dissemination.
This is our focus.
5. We must care about curation
Marbach et al. 2011. Nature Methods | Shutterstock.com | Alexander Wild
The gene set of an organism informs a variety of studies:
• Characterization: Gene number, GC%, TEs, repeats.
• Functional assignments.
• Molecular evolution, sequence conservation.
• Gene families.
• Metabolic pathways.
• What makes an organism what it is?
What makes a bee a “bee”?
6. Genome Curation
Identifies elements that best
represent the underlying biology
and eliminates elements that
reflect systemic errors of
automated analyses.
Assigns function through
comparative analysis of similar
genome elements from closely
related species using literature,
databases, and experimental
data.
Apollo
Gene Ontology
Resources
7. A few things to remember
when conducting manual annotation
7BIO-REFRESHER
• KEEP A GLOSSARY HANDY
from contig to splice site
• WHAT IS A GENE?
defining your goal
• TRANSCRIPTION
mRNA in detail
• TRANSLATION
reading frames, etc.
• GENOME CURATION
steps involved
8. The gene: a “moving target”
“The gene is a union
of genomic
sequences encoding
a coherent set of
potentially
overlapping
functional products.”
Gerstein et al., 2007. Genome Res
9. 9
"Gene structure" by Daycd- Wikimedia Commons
BIO-REFRESHER
mRNA
• Although of brief existence, understanding mRNAs is crucial,
as they will become the center of your work.
10. 10BIO-REFRESHER
Reading frames
v In eukaryotes, only one reading frame per section of DNA is biologically
relevant at a time: it has the potential to be transcribed into RNA and
translated into protein. This is called the OPEN READING FRAME (ORF)
• ORF = Start signal + coding sequence (divisible by 3) + Stop signal
11. 11BIO-REFRESHER
Splice sites
v The spliceosome catalyzes the removal of introns and the ligation of
flanking exons.
v Splicing signals (from the point of view of an intron):
• One splice signal (site) on the 5’ end: usually GT (less common: GC)
• And a 3’ end splice site: usually AG
• Canonical splice sites look like this: …]5’-GT/AG-3’[…
12. 12BIO-REFRESHER
Exons and Introns
v Introns can interrupt the reading frame of a gene by inserting a sequence
between two consecutive codons
v Between the first and second nucleotide of a codon
v Or between the second and third nucleotide of a codon
"Exon and Intron classes”. Licensed under Fair use via Wikipedia
14. 14GENE PREDICTION & ANNOTATION
PREDICTION & ANNOTATION
v Identification and annotation of genome features:
• primarily focuses on protein-coding genes.
• also identifies RNAs (tRNA, rRNA, long and small non-coding
RNAs (ncRNA)), regulatory motifs, repetitive elements, etc.
• happens in 2 phases:
1. Computation phase
2. Annotation phase
15. 15GENE PREDICTION & ANNOTATION
COMPUTATION PHASE
a. Experimental data are aligned to the genome: expressed sequence tags,
RNA-sequencing reads, proteins (also from other species).
a. Gene predictions are generated:
- ab initio: based on nucleotide sequence and composition
e.g. Augustus, GENSCAN, geneid, fgenesh, etc.
- evidence-driven: identifying also domains and motifs
e.g. SGP2, JAMg, fgenesh++, etc.
Result: the single most likely coding sequence, no UTRs, no isoforms.
Yandell & Ence. Nature Rev 2012 doi:10.1038/nrg3174
16. 16GENE PREDICTION & ANNOTATION
ANNOTATION PHASE
Experimental data (evidence) and predictions are synthetized into gene
annotations.
Result: gene models that generally include UTRs, isoforms, evidence trails.
Yandell & Ence. Nature Rev 2012 doi:10.1038/nrg3174
5’ UTR 3’ UTR
18. ANNOTATION
needs some refinement
No one is perfect, least of all automated annotation. 18
New technologies bring new challenges:
• Assembly errors can cause fragmented
annotations
• Limited coverage makes precise
identification a difficult task
20. GENOME CURATION
an inherently collaborative task
GENE PREDICTION & ANNOTATION 20
So many sequences, not enough hands.
Apis mellifera | Alexander Wild | www.alexanderwild.com
21. We have provided continuous training and support for hundreds of
geographically dispersed scientists to conduct manual annotations
efforts in order to recover coding sequences in agreement with all
available biological evidence.
21
Collaboration is key!
APOLLO
• Collaborative work distills invaluable knowledge.
• A little training goes a long way!
Wet lab scientists can easily learn to maximize the
generation of accurate, biologically supported gene models.
23. APOLLO: versatile genome annotation editing
• Apollo is a web-based genome annotation editor, integrated with JBrowse
• Supports real time collaboration & generates analysis-ready data
USER-CREATED ANNOTATIONS
EVIDENCE TRACKS
ANNOTATOR PANEL
24. BECOMING ACQUAINTED WITH APOLLO
General process of curation
1. Select or find a region of interest, e.g. scaffold.
2. Select appropriate evidence tracks to review the gene
model.
3. Determine whether a feature in an existing evidence track
will provide a reasonable gene model to start working.
4. If necessary, adjust the gene model.
5. Check your edited gene model for integrity and accuracy by
comparing it with available homologs.
6. Comment and finish.
25. Apollo- version at i5K Workspace@NAL
4. Becoming Acquainted with Web Apollo.
25
The Sequence Selection Window
26. Sort
Apollo- version at i5K Workspace@NAL
“Old Track Select Page”
4. Becoming Acquainted with Web Apollo.
26
27. APOLLO
annotation editing environment
BECOMING ACQUAINTED WITH APOLLO
Color by CDS frame,
toggle strands, set color
scheme and highlights.
- Upload evidence files
(GFF3, BAM, BigWig),
- combination track
- sequence search track
Query the genome using
BLAT.
Navigation and zoom.
Search for a gene
model or a scaffold.
Get coordinates and “rubber
band” selection for zooming.
Login
User-created
annotations.
New
annotator
panel.
Evidence
Tracks
Stage and
cell-type
specific
transcription
data.
http://genomearchitect.org/web_apollo_user_guide
28. 28 | BECOMING ACQUAINTED WITH APOLLO
USER NAVIGATION
Annotator
panel.
• Choose appropriate evidence from list of “Tracks” on annotator panel.
• Select & drag elements from evidence track into the ‘User-created Annotations’ area.
• Hovering over annotation in progress brings up an information pop-up.
• Creating a new annotation
33. Editing functionality
Example: Adding an exon supported by experimental data
• RNAseq reads show evidence in support of a transcribed product that was not predicted.
• Add exon by dragging up one of the RNAseq reads.
41. • A confirmation box will warn you if the receiving transcript is not on the
same strand as the feature where the new exon originated.
• Check ‘Start’ and ‘Stop’ signals after each edit.
ADDING EXONS
BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
43. To modify an exon boundary and match
data in the evidence tracks: select
both the [offending] exon and the
feature with the expected boundary,
then right click on the annotation to
select ‘Set 3’ end’ or ‘Set 5’ end’ as
appropriate.
In some cases all the data may disagree with the annotation, in
other cases some data support the annotation and some of the
data support one or more alternative transcripts. Try to annotate
as many alternative transcripts as are well supported by the data.
MATCHING EXON BOUNDARY TO EVIDENCE
BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
46. Non-canonical splices are indicated by
an orange circle with a white
exclamation point inside, placed over
the edge of the offending exon.
Canonical splice sites:
3’-…exon]GA / TG[exon…-5’
5’-…exon]GT / AG[exon…-3’
reverse strand, not reverse-complemented:
forward strand
SPLICE SITES
Zoom to review non-canonical
splice site warnings. Although
these may not always have to be
corrected (e.g GC donor), they
should be flagged with a
comment.
Exon/intron splice site error warning
Curated model
BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
49. Evidence may support joining two or more different gene models.
Warning: protein alignments may have incorrect splice sites and lack non-conserved regions!
1. In ‘User-created Annotations’ area shift-click to select an intron from each gene model and
right click to select the ‘Merge’ option from the menu.
2. Drag supporting evidence tracks over the candidate models to corroborate overlap, or
review edge matching and coverage across models.
3. Check the resulting translation by querying a protein database e.g. UniProt, NCBI nr. Add
comments to record that this annotation is the result of a merge.
Red lines around exons:
‘edge-matching’ allows annotators to confirm whether the
evidence is in agreement without examining each exon at the
base level.
COMPLEX CASES
merge two gene predictions on the same scaffold
BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES
51. DNA Track
‘User-created Annotations’ Track
COMPLEX CASES
annotate frameshifts and correct single-base errors
Always remember: when annotating gene models using Apollo, you are looking at a ‘frozen’ version of
the genome assembly and you will not be able to modify the assembly itself.
BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES
54. 1. Apollo allows annotators to make single base modifications or frameshifts that are reflected in
the sequence and structure of any transcripts overlapping the modification. These
manipulations do NOT change the underlying genomic sequence.
2. If you determine that you need to make one of these changes, zoom in to the nucleotide level
and right click over a single nucleotide on the genomic sequence to access a menu that
provides options for creating insertions, deletions or substitutions.
3. The ‘Create Genomic Insertion’ feature will require you to enter the necessary string of
nucleotide residues that will be inserted to the right of the cursor’s current location. The
‘Create Genomic Deletion’ option will require you to enter the length of the deletion, starting
with the nucleotide where the cursor is positioned. The ‘Create Genomic Substitution’ feature
asks for the string of nucleotide residues that will replace the ones on the DNA track.
4. Once you have entered the modifications, Apollo will recalculate the corrected transcript and
protein sequences, which will appear when you use the right-click menu ‘Get Sequence’
option. Since the underlying genomic sequence is reflected in all annotations that include the
modified region you should alert the curators of your organisms database using the
‘Comments’ section to report the CDS edits.
5. In special cases such as selenocysteine containing proteins (read-throughs), right-click over the
offending/premature ‘Stop’ signal and choose the ‘Set readthrough stop codon’ option from
the menu.
COMPLEX CASES
annotating frameshifts and correcting single-base errors & selenocysteines
BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES
62. • Check ‘Start’ and ‘Stop’ sites.
• Check splice sites: most splice sites display
these residues …]5’-GT/AG-3’[…
• Check if you can annotate UTRs, for example
using RNA-Seq data:
– align it against relevant genes/gene family
– blastp against NCBI’s RefSeq or nr
• Check for gaps in the genome.
• Additional functionality may be necessary:
– merging 2 gene predictions - same scaffold
– ‘merging’ 2 gene predictions - different
scaffolds
– splitting a gene prediction
– annotating frameshifts
– annotating selenocysteines, correcting
single-base and other assembly errors, etc.
62 |
• Add:
– Important project information in the form of
comments
– IDs from public databases e.g. GenBank (via
DBXRef), gene symbol(s), common name(s),
synonyms, top BLAST hits, orthologs with
species names, and everything else you can
think of, because you are the expert.
– Comments about the kinds of changes you
made to the gene model of interest, if any.
– Any appropriate functional assignments, e.g.
via BLAST, RNA-Seq data, literature searches,
etc.
CHECKLIST
for accuracy and integrity
MANUAL ANNOTATION CHECKLIST
64. 64i5K Workspace@NAL
The collaborative curation process at i5k
1. A computationally predicted consensus gene set has been generated
using multiple lines of evidence; e.g. HVIT_v0.5.3-Models
1. i5K Projects will integrate consensus computational predictions with
manual annotations to produce an updated Official Gene Set (OGS):
Warning!
• If it’s not on either track, it won’t make the OGS!
• If it’s there and it shouldn’t, it will still make the OGS!
65. The ‘Replace Models’ rules
BECOMING ACQUAINTED WITH APOLLO http://tinyurl.com/apollo-i5k-replace
68. What’s new?...
finding inspiration in PubMed.
Example 68
“Molecular analysis of bed bug populations from across the USA and Europe
found that >80% and >95% of the respective populations contained V419L
and/or L925I mutations in the voltage-gated sodium channel gene, indicating
widespread distribution of target-site-based pyrethroid resistance.”
Homalodisca vitripennis | Alexander Wild | www.alexanderwild.comHalyomorpha halys | Fondazione Edmund Mach - Italy
Now for our species of interest. . .
70. What do we know about this genome?
• Currently publicly available data at NCBI:
• >37,000 nucleotide seqsà scaffolds, mitochondrial genes
• 344 amino acid seqsà mitochondrion
• 47 ESTs
• 0 conserved domains identified
• 0 “gene” entries submitted
• Data at i5K Workspace@NAL (annotation hosted at USDA)
- 10,832 scaffolds: 23,288 transcripts: 12,906 proteins
Example 70
72. PubMed Search: what’s new?
Example 72
“Ten populations differed by at least 550-fold in sensitivity to
pyrethroids.”
“Sequencing the primary pyrethroid target site, the voltage-
gated sodium channel (vgsc), shows that point mutations and
their spread in natural populations were responsible for
differences in pyrethroid sensitivity.”
“The finding that a non-target aquatic species has acquired
resistance to pesticides used only on terrestrial pests is
troubling evidence of the impact of chronic pesticide
transport from land-based applications into aquatic systems.”
73. How many sequences are there, publicly available,
for our gene of interest?
Example 73
• Para, (voltage-gated sodium channel alpha
subunit; Nasonia vitripennis).
• NaCP60E (Sodium channel protein 60 E; D.
melanogaster).
– MF: voltage-gated cation channel activity
(IDA, GO:0022843).
– BP: olfactory behavior (IMP,
GO:0042048), sodium ion
transmembrane transport
(ISS,GO:0035725).
– CC: voltage-gated sodium channel
complex (IEA, GO:0001518).
And what do we know about them?
74. Retrieving sequences for a
sequence similarity search.
Example 74
>vgsc-Segment3-DomainII
RVFKLAKSWPTLNLLISIMGKTVGALGNLTFVLCIIIFIFAVMGMQLFGKNYTEKVTKFKWSQD
GQMPRWNFVDFFHSFMIVFRVLCGEWIESMWDCMYVGDFSCVPFFLATVVIGNLVVSFMHR
76. BLAT search
results
Example 76
• High-scoring segment pairs (hsp)
are listed in tabulated format.
• Clicking on one line of results
sends you to those coordinates.
77. BLAST at i5K
https://i5k.nal.usda.gov/blast
Example 77
>vgsc-Segment3-DomainII
RVFKLAKSWPTLNLLISIMGKTVGALGNLTFVLCIIIFIFAVMGMQLFGKNYTEKVTKFKWSQD
GQMPRWNFVDFFHSFMIVFRVLCGEWIESMWDCMYVGDFSCVPFFLATVVIGNLVVSFMHR
79. BLAST at i5K: hsps in “BLAST+ Results” track
Example 79
80. Creating a new gene model: drag and drop
Example 80
• Apollo automatically calculates longest ORF.
• In this case, ORF includes the high-scoring segment pairs (hsp),
marked here in blue.
• Note that gene is transcribed from reverse strand.
86. Editing: merge the three models
Example 86
Merge by dropping an
exon or gene model
onto another.
Merge by selecting
two exons (holding
down “Shift”) and
using the right click
menu.
or…
88. Editing: correct offending splice site
Example 88
Modify exon / intron
boundary:
- Drag the end of the
exon to the nearest
canonical splice site.
or
- Use right-click menu.
90. Editing: delete exon not supported by evidence
Example 90
Delete first exon from
HaztTmpM006233
91. Editing: add an exon supported by RNAseq
Example 91
• RNAseq reads show evidence in support of transcribed product, which was not predicted.
• Add exon at coordinates 97946-98012 by dragging up one of the RNAseq reads.
97. PUBLIC DEMO
97 |
APOLLO ON THE WEB
instructions
At i5K
1. Register for access to Apollo at the i5K Workspace@NAL at
https://i5k.nal.usda.gov/web-apollo-registration
2. Contact the coordinator for each species community to receive
more information about how to contribute. Contact info is available
on each organism’s page.
98. PUBLIC DEMO
98 |
APOLLO ON THE WEB
instructions
Public Honey bee demo available at:
http://GenomeArchitect.org/WebApolloDemo
Username:
demo@demo.com
Password:
demo
101. Apollo Development
Nathan Dunn
Technical Lead Eric Yao
Christine Elsik’s Lab,
University of Missouri
Suzi Lewis
Principal Investigator
BBOP
Moni Munoz-Torres
Project Manager
Deepak Unni
JBrowse. Ian Holmes’ Lab
University of California, Berkeley
102. • Berkeley Bioinformatics Open-source Projects (BBOP),
Berkeley Lab: Apollo and Gene Ontology teams.
Suzanna E. Lewis (PI).
• § Christine G. Elsik (PI). University of Missouri.
• * Ian Holmes (PI). University of California Berkeley.
• Arthropod genomics community & i5K Steering
Committee.
• Stephen Ficklin, GenSAS, Washington State University
• Apollo is supported by NIH grants 5R01GM080203
from NIGMS, and 5R01HG004483 from NHGRI. Also
supported by the Director, Office of Science, Office of
Basic Energy Sciences, of the U.S. Department of
Energy under Contract No. DE-AC02-05CH11231
• For your attention, thank you!
Apollo
Nathan Dunn
Deepak Unni §
Gene Ontology
Chris Mungall
Seth Carbon
Heiko Dietze
BBOP
Learn more about Apollo at http://GenomeArchitect.org
Thank you!
NAL at USDA
Monica Poelchau
Mei-Ju Chen
Christopher Childers
Gary Moore
HGSC at BCM
fringy Richards
Kim Worley
JBrowse Eric Yao *
106. Update: Transforming coordinates
Bringing exons closer together to facilitate
annotation of gene models with long introns.
1,275 bp
Concept for Apollo v2.1 – Northern Spring 2016
107. Transforming coordinates
Assembly artifacts may cause gene models to be split
across two or more scaffolds. To facilitate annotation,
Apollo allows the generation of an artificial space where
the annotation can be completed.
Scaffold 2Scaffold 1
Genome
Assembly
. . . . . .
Scaffold n