Apollo is a web-based application that supports and enables collaborative genome curation in real time, allowing teams of curators to improve on existing automated gene models through an intuitive interface. Apollo allows researchers to break down large amounts of data into manageable portions to mobilize groups of researchers with shared interests.
A Workshop at the Stowers Institute for Medical Research.
Observational constraints on mergers creating magnetism in massive stars
Apollo Collaborative genome annotation editing
1. Apollo
Collaborative genome annotation editing
A workshop for the Stowers Institute Research Community
Monica Munoz-Torres, PhD | @monimunozto
Berkeley Bioinformatics Open-Source Projects (BBOP)
Environmental Genomics & Systems Biology Division
Lawrence Berkeley National Laboratory
Kansas City, MO | 12 December, 2016
http://GenomeArchitect.org
2. Outline
• Today we will discuss
effective ways to extract
valuable information
about a genome through
curation efforts.
3. After this talk you will...
• Better understand curation in the context of genome annotation:
assembled genome à automated annotation à manual annotation
• Become familiar with Apollo’s environment and functionality.
• Learn to identify homologs of known genes of interest in your newly
sequenced genome.
• Learn how to corroborate and modify automatically annotated gene
models using all available evidence in Apollo.
7. A few things to remember during curation of gene models
7BIO-REFRESHER
• KEEP A GLOSSARY HANDY
from contig to splice site
• WHAT IS A GENE?
defining your goal
• TRANSCRIPTION
mRNA in detail
• TRANSLATION
reading frames, etc.
• GENOME CURATION
steps involved
8. The gene: a moving target
“The gene is a union
of genomic
sequences encoding
a coherent set of
potentially
overlapping
functional products.”
Gerstein et al., 2007. Genome Res
9. 9
"Gene structure" by Daycd- Wikimedia Commons
BIO-REFRESHER
mRNA
• Although of brief existence, understanding mRNAs is crucial,
as they will become the center of your work.
10. 10BIO-REFRESHER
Reading frames
v In eukaryotes, only one reading frame per section of DNA is biologically
relevant at a time: it has the potential to be transcribed into RNA and
translated into protein. This is called the OPEN READING FRAME (ORF)
• ORF = Start signal + coding sequence (divisible by 3) + Stop signal
11. 11BIO-REFRESHER
Splice sites
v The spliceosome catalyzes the removal of introns and the ligation of
flanking exons.
v Splicing signals (from the point of view of an intron):
• One splice signal (site) on the 5’ end: usually GT (less common: GC)
• And a 3’ end splice site: usually AG
• Canonical splice sites look like this: …]5’-GT/AG-3’[…
12. 12BIO-REFRESHER
Exons and Introns
v Introns can interrupt the reading frame of a gene by inserting a sequence
between two consecutive codons
v Between the first and second nucleotide of a codon
v Or between the second and third nucleotide of a codon
"Exon and Intron classes”. Licensed under Fair use via Wikipedia
13. 13BIO-REFRESHER
Obstacles
to transcription and translation
v The presence of premature Stop codons in the message is possible. A
process called non-sense mediated decay checks for them and corrects
them to avoid: incomplete splicing, DNA mutations, transcription errors,
and leaky scanning of ribosome – causing changes in the reading frame
(frame shifts).
v Insertions and deletions (indels) can cause frame shifts when indel is not
divisible by three. As a result, the peptide can be abnormally long, or
abnormally short – depending when the first in-frame Stop signal is
located.
15. 15GENE PREDICTION & ANNOTATION
PREDICTION & ANNOTATION
v Identification and annotation of genome features:
• primarily focuses on protein-coding genes.
• also identifies RNAs (tRNA, rRNA, long and small non-coding
RNAs (ncRNA)), regulatory motifs, repetitive elements, etc.
• happens in 2 phases:
1. Computation phase
2. Annotation phase
16. 16GENE PREDICTION & ANNOTATION
COMPUTATION PHASE
a. Experimental data are aligned to the genome: expressed sequence tags,
RNA-sequencing reads, proteins (also from other species).
b. Gene predictions are generated:
- ab initio: based on nucleotide sequence and composition
e.g. Augustus, GENSCAN, geneid, fgenesh, etc.
- evidence-driven: identifying also domains and motifs
e.g. SGP2, JAMg, fgenesh++, etc.
Result: the single most likely coding sequence, no UTRs, no isoforms.
Yandell & Ence. Nature Rev 2012 doi:10.1038/nrg3174
17. 17GENE PREDICTION & ANNOTATION
ANNOTATION PHASE
Experimental data (evidence) and predictions are synthetized into gene
annotations.
Result: gene models that generally include UTRs, isoforms, evidence trails.
Yandell & Ence. Nature Rev 2012 doi:10.1038/nrg3174
5’ UTR 3’ UTR
19. 19BIO-REFRESHER
Good genes are required!
1. Generate gene models
v A few rounds of gene prediction.
2. Annotate gene models
v Function, expression patterns,
metabolic network memberships.
3. Manually review them
v Structure & Function.
20. Best representation of biology &
removal of elements reflecting
errors in automated analyses.
Functional assignments through
comparative analysis using literature,
databases, and experimental data.
Apollo
Gene Ontology
Curation improves quality
21. 21BIO-REFRESHER
Curation is inherently collaborative
• It is impossible for a single individual to curate an entire
genome with precise biological fidelity.
• Curators need second opinions and insights from colleagues
with domain and gene family expertise.
23. Apollo: genome annotation editing
• Collaborative, instantaneous, web-based, built on top of JBrowse.
• Supports real time collaboration & generates analysis-ready data
USER-CREATED
ANNOTATIONS
EVIDENCE
TRACKS
ANNOTATOR
PANEL
GenomeArchitect.org
24. BECOMING ACQUAINTED WITH APOLLO
General process of curation
1. Select or find a region of interest (e.g. scaffold).
2. Select appropriate evidence tracks to review the genome
element to annotate (e.g. gene model).
3. Determine whether a feature in an existing evidence track
will provide a reasonable gene model to start working.
4. If necessary, adjust the gene model.
5. Check your edited gene model for integrity and accuracy by
comparing it with available homologs.
6. Comment and finish.
34. 34 | BECOMING ACQUAINTED WITH APOLLO
USER NAVIGATION
Annotator
panel.
• Choose appropriate evidence from list of “Tracks” on annotator panel.
• Select & drag elements from evidence track into the ‘User-created Annotations’ area.
• Hovering over annotation in progress brings up an information pop-up.
• Creating a new annotation
39. Editing functionality
Example: Adding an exon supported by experimental data
• RNAseq reads show evidence in support of a transcribed product that was not predicted.
• Add exon by dragging up one of the RNAseq reads.
46. • A confirmation box will warn you if the receiving transcript is not on the
same strand as the feature where the new exon originated.
• Check ‘Start’ and ‘Stop’ signals after each edit.
ADDING EXONS
BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
48. To modify an exon boundary and match
data in the evidence tracks: select
both the [offending] exon and the
feature with the expected boundary,
then right click on the annotation to
select ‘Set 3’ end’ or ‘Set 5’ end’ as
appropriate.
In some cases all the data may disagree with the annotation, in
other cases some data support the annotation and some of the
data support one or more alternative transcripts. Try to annotate
as many alternative transcripts as are well supported by the data.
MATCHING EXON BOUNDARY TO EVIDENCE
BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
54. Evidence may support joining two or more different gene models.
Warning: protein alignments may have incorrect splice sites and lack non-conserved regions!
1. In ‘User-created Annotations’ area shift-click to select an intron from each gene model and
right click to select the ‘Merge’ option from the menu.
2. Drag supporting evidence tracks over the candidate models to corroborate overlap, or
review edge matching and coverage across models.
3. Check the resulting translation by querying a protein database e.g. UniProt, NCBI nr. Add
comments to record that this annotation is the result of a merge.
Red lines around exons:
‘edge-matching’ allows annotators to confirm whether the
evidence is in agreement without examining each exon at the
base level.
COMPLEX CASES
merge two gene predictions on the same scaffold
BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES
56. DNA Track
‘User-created Annotations’ Track
COMPLEX CASES
annotate frameshifts and correct single-base errors
Always remember: when annotating gene models using Apollo, you are looking at a ‘frozen’ version of
the genome assembly and you will not be able to modify the assembly itself.
BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES
67. • Check ‘Start’ and ‘Stop’ sites.
• Check splice sites: most splice sites display
these residues …]5’-GT/AG-3’[…
• Check if you can annotate UTRs, for example
using RNA-Seq data:
– align it against relevant genes/gene family
– blastp against NCBI’s RefSeq or nr
• Check for gaps in the genome.
• Additional functionality may be necessary:
– merging 2 gene predictions - same scaffold
– ‘merging’ 2 gene predictions - different
scaffolds
– splitting a gene prediction
– annotating frameshifts
– annotating selenocysteines, correcting
single-base and other assembly errors, etc.
67 |
• Add:
– Important project information in the form of
comments
– IDs from public databases e.g. GenBank (via
DBXRef), gene symbol(s), common name(s),
synonyms, top BLAST hits, orthologs with
species names, and everything else you can
think of, because you are the expert.
– Comments about the kinds of changes you
made to the gene model of interest, if any.
– Any appropriate functional assignments, e.g.
via BLAST, RNA-Seq data, literature searches,
etc.
CHECKLIST
for accuracy and integrity
MANUAL ANNOTATION CHECKLIST
69. What’s new?...
finding inspiration in the literature.
Example 69
“Molecular analysis of bed bug populations from across the USA and Europe
found that >80% and >95% of the respective populations contained V419L
and/or L925I mutations in the voltage-gated sodium channel gene, indicating
widespread distribution of target-site-based pyrethroid resistance.”
73. PubMed Search: what’s new?
Example 73
“Ten populations differed by at least 550-fold in sensitivity to
pyrethroids.”
“Sequencing the primary pyrethroid target site, the voltage-
gated sodium channel (vgsc), shows that point mutations and
their spread in natural populations were responsible for
differences in pyrethroid sensitivity.”
“The finding that a non-target aquatic species has acquired
resistance to pesticides used only on terrestrial pests is
troubling evidence of the impact of chronic pesticide
transport from land-based applications into aquatic systems.”
74. How many sequences are there, publicly available,
for our gene of interest?
Example 74
• Para, (voltage-gated sodium channel alpha
subunit; Nasonia vitripennis).
• NaCP60E (Sodium channel protein 60 E; D.
melanogaster).
– MF: voltage-gated cation channel activity
(IDA, GO:0022843).
– BP: olfactory behavior (IMP,
GO:0042048), sodium ion
transmembrane transport
(ISS,GO:0035725).
– CC: voltage-gated sodium channel
complex (IEA, GO:0001518).
And what do we know about them?
75. Retrieving sequences for a
sequence similarity search.
Example 75
>vgsc-Segment3-DomainII
RVFKLAKSWPTLNLLISIMGKTVGALGNLTFVLCIIIFIFAVMGMQLFGKNYTEKVTKFKWSQD
GQMPRWNFVDFFHSFMIVFRVLCGEWIESMWDCMYVGDFSCVPFFLATVVIGNLVVSFMHR
77. BLAT search
results
Example 77
• High-scoring segment pairs (hsp)
are listed in tabulated format.
• Clicking on one line of results
sends you to those coordinates.
78. Creating a new gene model: drag and drop
Example 78
• Apollo automatically calculates longest ORF.
• In this case, ORF includes the high-scoring segment pairs (hsp),
marked here in blue.
• Note that gene is transcribed from reverse strand.
84. Editing: merge the three models
Example 84
Merge by dropping an
exon or gene model
onto another.
Merge by selecting
two exons (holding
down “Shift”) and
using the right click
menu.
or…
86. Editing: correct offending splice site
Example 86
Modify exon / intron
boundary:
- Drag the end of the
exon to the nearest
canonical splice site.
or
- Use right-click menu.
88. Editing: delete exon not supported by evidence
Example 88
Delete first exon from
HaztTmpM006233
89. Editing: add an exon supported by RNAseq
Example 89
• RNAseq reads show evidence in support of transcribed product, which was not predicted.
• Add exon at coordinates 97946-98012 by dragging up one of the RNAseq reads.
96. Practice using the Apis mellifera genome
GenomeArchitect.org
1. Evidence in support of protein coding gene
models.
1.1 Consensus Gene Sets:
Official Gene Set v3.2
Official Gene Set v1.0
1.2 Consensus Gene Sets comparison:
OGSv3.2 genes that merge OGSv1.0 and
RefSeq genes
OGSv3.2 genes that split OGSv1.0 and RefSeq
genes
1.3 Protein Coding Gene Predictions Supported
by Biological Evidence:
NCBI Gnomon
Fgenesh++ with RNASeq training data
Fgenesh++ without RNASeq training data
NCBI RefSeq Protein Coding Genes and Low
Quality Protein Coding Genes
1.4 Ab initio protein coding gene predictions:
Augustus Set 12, Augustus Set 9, Fgenesh,
GeneID, N-SCAN, SGP2
1.5 Transcript Sequence Alignment:
NCBI ESTs, Apis cerana RNA-Seq, Forager Bee
Brain Illumina Contigs, Nurse Bee Brain Illumina
Contigs, Forager RNA-Seq reads, Nurse RNA-Seq
reads, Abdomen 454 Contigs, Brain and Ovary
454 Contigs, Embryo 454 Contigs, Larvae 454
Contigs, Mixed Antennae 454 Contigs, Ovary 454
Contigs, Testes 454 Contigs, Forager RNA-Seq
HeatMap, Forager RNA-Seq XY Plot, Nurse RNA-
Seq HeatMap, Nurse RNA-Seq XY Plot
97. Practice using the Apis mellifera genome
GenomeArchitect.org
1. Evidence in support of protein coding gene
models (Continued).
1.6 Protein homolog alignment:
Acep_OGSv1.2
Aech_OGSv3.8
Cflo_OGSv3.3
Dmel_r5.42
Hsal_OGSv3.3
Lhum_OGSv1.2
Nvit_OGSv1.2
Nvit_OGSv2.0
Pbar_OGSv1.2
Sinv_OGSv2.2.3
Znev_OGSv2.1
Metazoa_Swissprot
2. Evidence in support of non protein coding
gene models
2.1 Non-protein coding gene predictions:
NCBI RefSeq Noncoding RNA
NCBI RefSeq miRNA
2.2 Pseudogene predictions:
NCBI RefSeq Pseudogene
98. Apollo Development
Nathan Dunn
Technical Lead Eric Yao
Christine Elsik’s Lab,
University of Missouri
Suzi Lewis
Principal Investigator
BBOP
Moni Munoz-Torres
Project Manager
Deepak Unni
JBrowse. Ian Holmes’ Lab
University of California, Berkeley
99. Thank you!
Berkeley Bioinformatics Open-Source Projects,
Environmental Genomics & Systems Biology,
Lawrence Berkeley National Laboratory
Funding
• Apollo is supported by NIH grants 5R01GM080203 from NIGMS,
and 5R01HG004483 from NHGRI.
• BBOP is also supported by the Director, Office of Science,
Office of Basic Energy Sciences, of the U.S. Department of
Energy under Contract No. DE-AC02-05CH11231
Collaborators
• Ian Holmes, Eric Yao, UC Berkeley (JBrowse)
• Chris Elsik, Deepak Unni, U of Missouri (Apollo)
• Monica Poelchau, USDA/NAL (Apollo)
• i5k Community
berkeleybop.org
UNIVERSITY OF
CALIFORNIA
Suzanna Lewis & Chris Mungall
Seth Carbon (Noctua / AmiGO)
Nathan Dunn (Apollo)
Monica Munoz-Torres (Apollo / GO)
Jeremy Nguyen Xuan (Monarch Init.)
100. PUBLIC DEMO
100 |
APOLLO ON THE WEB
instructions
Public Honey bee demo available at:
genomearchitect.org/demo/
Username:
demo@demo.com
Password:
demo