A 12-Step Program for Biology to Survive and Thrive in the Era of Data-Intensive Science
1. A 12-step program for biology to survive
and thrive in the era of data-intensive
science
C.Titus Brown
Genome Center & Data Science Initiative
Mar 18, 2016
Slides are on slideshare.net/c.titus.brown/
3. My guiding question
What is going to be happening in the next 5
years with biological data generation?
(And can I make progress on some of the
coming problems?)
4. DNA sequencing rates continues
to grow.
Stephens et al., 2015 - 10.1371/journal.pbio.1002195
12. “Fighting EbolaWith a Palm-
Sized DNA Sequencer”
See: http://www.theatlantic.com/science/archive/2015/09/ebola-
sequencer-dna-minion/405466/
13. “DeepDOM” cruise: examination
of dissolved organic matter &
microbial metabolism vs physical
parameters – potential collab.
Via Elizabeth Kujawinski
Another challenge beyond volume and velocity – variety.
17. Shotgun sequencing and coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.
18. Random sampling => deep sampling needed
Typically 10-100x needed for robust recovery (30-300 Gbp for human)
25. (Digital normalization is
a computational version of library
normalization)
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume disk
space and, because of
errors, memory.
We can discard it for you…
26. Some key points --
• Digital normalization is streaming.
• Digital normalizing is computationally efficient (lower memory than
other approaches; parallelizable/multicore; single-pass)
• Currently, primarily used for prefiltering for assembly, but relies on
underlying abstraction (De Bruijn graph) that is also used in variant
calling.
27. Assembly now scales with information content, not data
size.
• 10-100 fold decrease in memory
requirements
• 10-100 fold speed up in analysis
28. Diginorm is widely useful:
1. Assembly of the H. contortus parasitic nematode genome, a “high
polymorphism/variable coverage” problem.
(Schwarz et al., 2013; pmid 23985341)
2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big
assembly” problem. (in prep)
3. Osedax symbiont metagenome, a “contaminated metagenome” problem
(Goffredi et al, 2013; pmid 24225886)
30. Computational problems now scale with information
content rather than data set size.
Most samples can be reconstructed via de
novo assembly on commodity computers.
31. Applying digital normalization in
a new project –
the horse transcriptome
Tamer Mansour w/Bellone, Finno, Penedo, &
Murray labs.
32. Input data
Tissue Library length #samples #frag(M) #bp(Gb)
BrainStem PE fr.firststrand 101 8 166.73 33.68
Cerebellum PE fr.firststrand 100 24 411.48 82.3
Muscle PE fr.firststrand 126 12 301.94 76.08
Retina PE fr.unstranded 81 2 20.3 3.28
SpinalCord PE fr.firststrand 101 16 403 81.4
Skin PE fr.unstranded 81 2 18.54 3
SE fr.unstranded 81 2 16.57 1.34
SE fr.unstranded 95 3 105.51 10.02
Embryo ICM PE fr.unstranded 100 3 126.32 25.26
SE fr.unstranded 100 3 115.21 11.52
Embryo TE PE fr.unstranded 100 3 129.84 25.96
SE fr.unstranded 100 3 102.26 10.23
Total 81 1917.7 364.07
33. equCabs current status -
NCBI Annotation
Feature Acc Annotation GFF Refseq DB
Total no of genes 25565
ptn coding genes 19686
Coding RNA NM BestRefSeq 764 1097
Coding RNA XM Gnomon 31578 31346
Non coding RNA NR BestRefSeq 348 726
Non coding RNA XR Gnomon 3311 3310
Total 36001 36479
32342 coding transcripts encoded by 19686 genes
(average 1.6 transcript per gene)
There are 3034 pseudo genes
(with no annotated transcripts)
Status count
reviewed 4
Validated 267
Provisional 540
Predicted 7
inferred 279
Tamer Mansour
34. Library prep
Read
trimming
Mapping to ref
Merge rep.
Trans Ass.
Merge byTiss.
Predict ORF
VariantAna
Update dbvar
Haplotype ass
Pool/diginorm
Predict ncRNA
Filter & Compare Ass.
filter knowns
Compare to public ann. Merge All Ass.
Mapping to ref
Trans Ass.
Tamer Mansour
35. Digital normalization & (e.g.)
horse transcriptome
The computational demands for cufflinks
- Read binning (processing time)
- Construction of gene models (no of genes, no of splicing junctions, no of
reads per locus, sequencing errors, complexity of the locus like gene
overlap and multiple isoforms (processing time & Memory utilization)
Diginorm
- Significant reduction of binning time
- Relative increase of the resources
required for gene model construction
with merging more samples and tissues
- ? false recombinant isoforms
Tamer Mansour
36. Effect of digital normalization
** Should be very valuable for detection of ncRNA
Tamer Mansour
37. The ORF problem
Hestand et al 2014: “we identified 301,829 positions with SNPs or small indels within
these transcripts relative to EquCab2. Interestingly, 780 variants extend the open reading
frame of the transcript and appear to be small errors in the equine reference genome”
Tamer Mansour
38. We merged the assemblies into six tissue-specific transcription
profiles for cerebellum, brainstem, spinal cord, retina, muscle and
skin.The final merger of all assemblies overlaps with 63% and 73% of
NCBI and Ensembl loci, respectively, capturing about 72% and 81%
of their coding bases. Comparing our assembly to the most recent
transcriptome annotation shows ~85% overlapping loci. In addition, at
least 40% of our annotated loci represent novel transcripts.
Tamer Mansour
39. Diginorm can also process data as it
comes in – streaming decision making.
40. What do we do when we get new
data??
• How do we efficiently process, update our existing
resources?
• How do we evaluate whether or not our prior conclusions
need to change or be updated?
– # of genes, & their annotations;
– Differential expression based on new isoforms;
• This is a problem everyone has
…and it’s not going away…
41. The data challenge in biology
So we can sequence everything – so what?
What does it mean?
How can we do better biology with the data?
How can we understand?
42. A 12-step program for biology (??)
(This was a not terribly successful
attempt to be entertaining.)
43. 1.Think repeatability and scaling
x 100
What works for one data set,
Doesn’t work as well for three,
And doesn’t work at all for 100.
44. 2.Think streaming / few-pass analysis
Mapping
Data
Sorting
Calling Answer
1-pass
Data
Answer
versus
45. 3. Invest in computational training
Summer NGS workshop (2010-2017)
46. 4. Move beyond PDFs
This is
only part
of the
story!
Subramanian et al., doi: 10.1128/JVI.01163-13
47. 5. Focus on a biological question
Generating data for the sake of having data leads you into
a data analysis maze – “I’m sure there’s something
interesting in there… somewhere.”
48. "...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression
networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains.
The major distinguishing characteristic between these sets of genes is date of discovery, early
discovery being associated with greater research momentum—a genomic bandwagon effect."
Ref.: Pandey et al. (2014), PLoS One 11, e88889.Via Erich Schwarz
The problem of lopsided gene
characterization is pervasive:
e.g., the brain "ignorome"
6. Spend more effort on the unknowns!
49. 7. Invest in data integration.
Figure 2. Summary of challenges associated with the data integration in the proposed project.
Figure via E. Kujawinski
50. 8. Split your information into layers
Protein coding >> ncRNA >> ???
** Should be very valuable for detection of ncRNA
*** But what the heck do we do with ncRNA information?
Tamer Mansour
51. 9. Move to an update model.
Current
information
New data!!!!
Update
results?
Yes?
?????
52. Candidates for additional steps…
• Invest in data sharing and better “reference”
infrastructure.
• Build better tools for computationally exploring
hypotheses.
• Invest in “unsupervised” analysis of data (machine
learning)
• Learn/apply multivariate stats.
• Invest in social media & preprints & “open”
53. My future plans?
• Protocols and (distributed) platform for data
discovery & sharing.
• Data analysis and integration in marine
biogeochemistry & microbial physiology
54. Fig. 1: The cycle from data to discovery, over models back to experiment, that generates knowledge as
the cycle is repeated. Parts of that cycle are standard in particular disciplines, but putting together the
full cycle requires transdisciplinary expertise.
55. Training program at UC Davis:
• Regular intensive workshops, half-day or longer.
• Aimed at research practitioners (grad students & more
senior); open to all (including outside community).
• Novice (“zero entry”) on up.
• Low cost for students.
• Leverage global training initiatives.
(Google “dib training” for details; join the announce list!)