Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
2015 illinois-talk
1. AWager for 2016: How
SoftwareWill Beat Hardware
in Biological Data Analysis
C.Titus Brown
Associate Professor
PHR, School ofVeterinary Medicine, UC Davis
This talk on slideshare: slideshare.net/c.titus.brown/
2. This talk idea started with an argument on the
Internet.
xkcd.com/386/ - “Duty Calls”
4. The obligatory slide about abundant
sequencing data.
http://www.genome.gov/sequencingcosts/
Also see: https://biomickwatson.wordpress.com/2015/03/25/the-cost-of-sequencing-is-
still-going-down/
5. Big Sequencing Data and Biology
1) Listen to the physicists: “look, we know how
to analyze data from CERN and Sloan Digital
Sky Survey. Just do what we did.”
2) Listen to the SiliconValley folk: “Hadoop,
and Spark, dude. Just map-reduce it.”
3) Develop custom approaches.
6. Shotgun sequencing
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of
foolishness
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
7. Resequencing analysis
We know a reference genome (specific edition), and
want to find variants (differences - blue) in a
background of errors (red)
8. The scale of the problem (1)
Lots of data per “book”
• A human genome contains approximately 6
billion bases of DNA.
• Covering the entire genome using random
sampling requires ~150 billion bases of
sequencing
9. The scale of the problem (2)
Many “editions” in e.g. cancer
If you want to look at 1000 individual tumor
cells and build an evolutionary history of
changes, you need 150 Gbp per cell: 150Tbp.
10. The scale of the problem (3)
Many sequencers, many analyses.
• 10,000 sequencers worldwide (?)
• Worldwide sequencing capacity ??, but
~300,000 human genomes in 2014…
• Many research groups, each with own
question(s) - ~1m data sets each year?
• Cheap! ~$10-20k for a 100 Gbp data set.
11. Resequencing analysis
We know a reference genome (specific edition), and
want to find variants (differences - blue) in a
background of errors (red)
12. Mapping: locate reads in reference
(pass 1)
http://en.wikipedia.org/wiki/File:Mapping_Reads.png
13. Variant detection after mapping
(pass 2, 3, and 4)
http://www.kenkraaijeveld.nl/genomics/bioinformatics/
14. The current variant calling approach:
Map reads
Convert to
binary
Sort binary
format by
genome pos'n
"Pile up" and
call variants
Extract reads
for tricky bits
Realign/
assemble
(optional)
15. Current approach: pros and cons
Pros:
• Modular and flexible.
• Open source!Well supported! Mature!
• Some of it parallelizes easily!
Cons:
• 4+ passes across the data
• Very I/O intensive (hence unsuitable for cloud).
16. Some numbers:
• 1000 single cells from a tumors ~ 150Tbp of data.
• HiSeq X10 can do the sequencing in ~3 weeks.
• The variant calling requires ~2,000 CPU weeks…
• …so, given ~2,000 computers, can do this all in
one month.
…but, multiply problem by # of possible patients...
17. Big Sequencing Data and Biology
1) Listen to the physicists: “look, we know how to
analyze data from CERN and Sloan Digital Sky
Survey. Just do what we did.”
2) Listen to the SiliconValley folk: “Hadoop, and Spark,
dude. Just map-reduce it.”
3) Develop better custom approaches, swiping ideas
from SiliconValley and physicists as needed.
18. So, back to the Internet argument:
it ended with a bet.
In two years (Nov 2016), my 9 year old daughter
will be able to analyze a full human genome
sequence on her desktop computer.
https://twitter.com/ctitusbrown/status/535191544119451648
19. “Never compete unless you have an unfair advantage.”
1. My daughter is awesome.
2.We know how to do it
already*
(* some assembly required)
3. Heng Li just posted a
preprint yesterday!
“FermiKit”, http://arxiv.org/abs/1504.06574
20. Remainder of talk – outline.
1. “Data” vs “information”
2. Streaming approaches to lossy compression
and building compressible graphs for soil
metagenomics.
3. Sequencing errors and variants using graphs.
25. Shotgun sequencing and coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.
26. Random sampling => deep sampling needed
Typically 10-100x needed for robust recovery (30-300 Gbp for human)
27. Actual coverage varies widely from the average.
Low coverage introduces unavoidable breaks.
28. But! Shotgun sequencing is very redundant!
Lots of the high coverage simply isn’t needed.
(unnecessary data)
35. Graph sizes now scales with information content.
Most samples can be reconstructed via de
novo assembly on commodity computers.
36. Diginorm ~ “lossy compression”
Nearly perfect from an information theoretic
perspective:
– Discards 95% more of data for genomes.
– Loses < 00.02% of information.
38. Streaming lossy compression:
for read in dataset:
if estimated_coverage(read) < CUTOFF:
yield read
This is literally a three line algorithm. Not kidding.
It took four years to figure out which three lines, though…
39. Diginorm can detect information
saturation in a stream.
Zhang et al., submitted.
46. Preliminary benchmarks -
• Can do variant calling on E. coli in about 5
minutes, in 40 MB of RAM, with a single
thread, with no optimization.
• Scaling to human should be readily feasible.
• …I have another 18 months before I lose the
bet.
47. My real point -
• We need well founded, and flexible, and algorithmically
efficient, and high performance components for
sequence data manipulation in biology.
• We are building these on top of a streaming and low
memory paradigm.
• We are building out a scripting library for composing
these operations.
48. Scaling compute, or algorithms?
There are some problems that require big computers &
many processors.
Genomic data analysis shouldn’t be one of them, based
on information content alone!
(This is probably good, given the scale of the need.)
Many other biological problems do require big compute,
however.
49. Reminder: the real challenge is
understanding
We have gotten distracted by shiny toys: sequencing!!
Data!!
Data is now plentiful! But:
We typically have no knowledge of what > 50% of an
e.g. environmental metagenome “means”,
functionally.
http://ivory.idyll.org/blog/2014-function-of-unknown-genes.html
50. I was going to give you my 5 year
vision…
…but I don’t have 20/20 eyesight.
Via @adrianholovaty
51. I was going to give you my 5 year
vision…
…but I don’t have 20/20 eyesight.
(20/20? 2020? 2015 + 5?)
(My wife has asked that I apologize for this
joke.)
Via @adrianholovaty
52. Data integration as a next
challenge
In 5-10 years, we will have nigh-infinite data.
(Genomic, transcriptomic, proteomic, metabolomic,
…?)
How do we explore these data sets?
Registration, cross-validation, integration with
models…
53. Carbon cycling in the ocean -
“DeepDOM” cruise, Kujawinski & Longnecker et al.
54. Integrating many different data types to
build understanding.
Figure 2. Summary of challenges associated with the data integration in the proposed project.
“DeepDOM” cruise: examination of dissolved organic matter & microbial
metabolism vs physical parameters – potential collab.
56. A few thoughts on practical next
steps.
• Enable scientists with better tools.
• Train a bioinformatics “middle class.”
• Accelerate science via the open science “network
effect”.
57. That is… what do we do now?
Once you have all this data, what do you do?
"Business as usual simply cannot work.”
- David Haussler, 2014
Looking at millions to billions of (human) genomes in
the next 5-10 years.
58. Enabling scientists with better tools -
Build robust, flexible computational
frameworks for data exploration, and make
them open and remixable.
Develop theory, algorithms, & software
together, and train people in their use.
(Stop pretending that we can develop “black
boxes” that will give you the right answer.)
59. Education and training - towards a
bioinformatics “middle class”
Biology is underprepared for data-intensive investigation.
We must teach and train the next generations.
=> Build a cohort of “data intensive biologists” who can use
data and tools as an intrinsic and unremarkable part of their
research.
~10-20 workshops / year, novice -> masterclass; open
materials.
dib-training.rtfd.org/
60. Can open science trigger a
“network effect”?
http://prasoondiwakar.com/wordpress/trivia/the-network-effect
61. So: can we drive data sharing via a decentralized
model, e.g. a distributed graph database?
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
ivory.idyll.org/blog/2014-moore-ddd-award.html
62. My larger research vision:
100% buzzword compliantTM
Enable and incentivize sharing by providing immediate utility;
frictionless sharing.
Permissionless innovation for e.g. new data mining
approaches.
Plan for poverty with federated infrastructure built on open &
cloud.
Solve people’s current problems, while remaining agile for
the future.
ivory.idyll.org/blog/2014-moore-ddd-award.html
A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
High coverage is essential.
High coverage is essential.
Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
Taking advantage of structure within read
Passionate about training; necessary fro advancement of field; also deeply self-interested because I find out what the real problems are. (“Some people can do assembly” is not “everyone can do assembly”)
Analyze data in cloud; import and export important; connect to other databases.
Work with other Moore DDD folk on the data mining aspect. Start with cross validation, move to more sophisticated in-server implementations.