SlideShare a Scribd company logo
1 of 56
A 12-step program for biology to survive
and thrive in the era of data-intensive
science
C.Titus Brown
Genome Center & Data Science Initiative
Mar 18, 2016
Slides are on slideshare.net/c.titus.brown/
Math
undergrad
Evolutionary
modeling
Developmental
biology
Computer
science &
microbiology
Sea urchin
GRNs
Chick
neural
crest
Veterinary
medicine?
Bioinformatics
algorithms
"Data-
Intensive
Biology"
Marek’s disease
Soil metagenomics
Ascidian GRNs
Lamprey mRNAseq
My path:
My guiding question
What is going to be happening in the next 5
years with biological data generation?
(And can I make progress on some of the
coming problems?)
DNA sequencing rates continues
to grow.
Stephens et al., 2015 - 10.1371/journal.pbio.1002195
(2015 was a good year)
Oxford Nanopore sequencing
Slide viaTorsten Seeman
Nanopore technology
Slide viaTorsten Seeman
Scaling up --
Scaling up --
Slide viaTorsten Seeman
http://ebola.nextflu.org/
“Fighting EbolaWith a Palm-
Sized DNA Sequencer”
See: http://www.theatlantic.com/science/archive/2015/09/ebola-
sequencer-dna-minion/405466/
“DeepDOM” cruise: examination
of dissolved organic matter &
microbial metabolism vs physical
parameters – potential collab.
Via Elizabeth Kujawinski
Another challenge beyond volume and velocity – variety.
CRISPR
The challenge with genome editing is fast
becoming what to edit rather than how to do.
A point for reflection…
Increasingly, the best guide to the next 10 years
of biology is science fiction ...
Digital normalization
Statement of problem:
We can’t run de novo assembly on the
transcriptome data sets we have!
Shotgun sequencing and coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.
Random sampling => deep sampling needed
Typically 10-100x needed for robust recovery (30-300 Gbp for human)
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
(Digital normalization is
a computational version of library
normalization)
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume disk
space and, because of
errors, memory.
We can discard it for you…
Some key points --
• Digital normalization is streaming.
• Digital normalizing is computationally efficient (lower memory than
other approaches; parallelizable/multicore; single-pass)
• Currently, primarily used for prefiltering for assembly, but relies on
underlying abstraction (De Bruijn graph) that is also used in variant
calling.
Assembly now scales with information content, not data
size.
• 10-100 fold decrease in memory
requirements
• 10-100 fold speed up in analysis
Diginorm is widely useful:
1. Assembly of the H. contortus parasitic nematode genome, a “high
polymorphism/variable coverage” problem.
(Schwarz et al., 2013; pmid 23985341)
2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big
assembly” problem. (in prep)
3. Osedax symbiont metagenome, a “contaminated metagenome” problem
(Goffredi et al, 2013; pmid 24225886)
Anecdata: diginorm is used in
Illumina long-read sequencing (?)
Computational problems now scale with information
content rather than data set size.
Most samples can be reconstructed via de
novo assembly on commodity computers.
Applying digital normalization in
a new project –
the horse transcriptome
Tamer Mansour w/Bellone, Finno, Penedo, &
Murray labs.
Input data
Tissue Library length #samples #frag(M) #bp(Gb)
BrainStem PE fr.firststrand 101 8 166.73 33.68
Cerebellum PE fr.firststrand 100 24 411.48 82.3
Muscle PE fr.firststrand 126 12 301.94 76.08
Retina PE fr.unstranded 81 2 20.3 3.28
SpinalCord PE fr.firststrand 101 16 403 81.4
Skin PE fr.unstranded 81 2 18.54 3
SE fr.unstranded 81 2 16.57 1.34
SE fr.unstranded 95 3 105.51 10.02
Embryo ICM PE fr.unstranded 100 3 126.32 25.26
SE fr.unstranded 100 3 115.21 11.52
Embryo TE PE fr.unstranded 100 3 129.84 25.96
SE fr.unstranded 100 3 102.26 10.23
Total 81 1917.7 364.07
equCabs current status -
NCBI Annotation
Feature Acc Annotation GFF Refseq DB
Total no of genes 25565
ptn coding genes 19686
Coding RNA NM BestRefSeq 764 1097
Coding RNA XM Gnomon 31578 31346
Non coding RNA NR BestRefSeq 348 726
Non coding RNA XR Gnomon 3311 3310
Total 36001 36479
32342 coding transcripts encoded by 19686 genes
(average 1.6 transcript per gene)
There are 3034 pseudo genes
(with no annotated transcripts)
Status count
reviewed 4
Validated 267
Provisional 540
Predicted 7
inferred 279
Tamer Mansour
Library prep
Read
trimming
Mapping to ref
Merge rep.
Trans Ass.
Merge byTiss.
Predict ORF
VariantAna
Update dbvar
Haplotype ass
Pool/diginorm
Predict ncRNA
Filter & Compare Ass.
filter knowns
Compare to public ann. Merge All Ass.
Mapping to ref
Trans Ass.
Tamer Mansour
Digital normalization & (e.g.)
horse transcriptome
The computational demands for cufflinks
- Read binning (processing time)
- Construction of gene models (no of genes, no of splicing junctions, no of
reads per locus, sequencing errors, complexity of the locus like gene
overlap and multiple isoforms (processing time & Memory utilization)
Diginorm
- Significant reduction of binning time
- Relative increase of the resources
required for gene model construction
with merging more samples and tissues
- ? false recombinant isoforms
Tamer Mansour
Effect of digital normalization
** Should be very valuable for detection of ncRNA
Tamer Mansour
The ORF problem
Hestand et al 2014: “we identified 301,829 positions with SNPs or small indels within
these transcripts relative to EquCab2. Interestingly, 780 variants extend the open reading
frame of the transcript and appear to be small errors in the equine reference genome”
Tamer Mansour
We merged the assemblies into six tissue-specific transcription
profiles for cerebellum, brainstem, spinal cord, retina, muscle and
skin.The final merger of all assemblies overlaps with 63% and 73% of
NCBI and Ensembl loci, respectively, capturing about 72% and 81%
of their coding bases. Comparing our assembly to the most recent
transcriptome annotation shows ~85% overlapping loci. In addition, at
least 40% of our annotated loci represent novel transcripts.
Tamer Mansour
Diginorm can also process data as it
comes in – streaming decision making.
What do we do when we get new
data??
• How do we efficiently process, update our existing
resources?
• How do we evaluate whether or not our prior conclusions
need to change or be updated?
– # of genes, & their annotations;
– Differential expression based on new isoforms;
• This is a problem everyone has
…and it’s not going away…
The data challenge in biology
So we can sequence everything – so what?
What does it mean?
How can we do better biology with the data?
How can we understand?
A 12-step program for biology (??)
(This was a not terribly successful
attempt to be entertaining.)
1.Think repeatability and scaling
x 100
What works for one data set,
Doesn’t work as well for three,
And doesn’t work at all for 100.
2.Think streaming / few-pass analysis
Mapping
Data
Sorting
Calling Answer
1-pass
Data
Answer
versus
3. Invest in computational training
Summer NGS workshop (2010-2017)
4. Move beyond PDFs
This is
only part
of the
story!
Subramanian et al., doi: 10.1128/JVI.01163-13
5. Focus on a biological question
Generating data for the sake of having data leads you into
a data analysis maze – “I’m sure there’s something
interesting in there… somewhere.”
"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression
networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains.
The major distinguishing characteristic between these sets of genes is date of discovery, early
discovery being associated with greater research momentum—a genomic bandwagon effect."
Ref.: Pandey et al. (2014), PLoS One 11, e88889.Via Erich Schwarz
The problem of lopsided gene
characterization is pervasive:
e.g., the brain "ignorome"
6. Spend more effort on the unknowns!
7. Invest in data integration.
Figure 2. Summary of challenges associated with the data integration in the proposed project.
Figure via E. Kujawinski
8. Split your information into layers
Protein coding >> ncRNA >> ???
** Should be very valuable for detection of ncRNA
*** But what the heck do we do with ncRNA information?
Tamer Mansour
9. Move to an update model.
Current
information
New data!!!!
Update
results?
Yes?
?????
Candidates for additional steps…
• Invest in data sharing and better “reference”
infrastructure.
• Build better tools for computationally exploring
hypotheses.
• Invest in “unsupervised” analysis of data (machine
learning)
• Learn/apply multivariate stats.
• Invest in social media & preprints & “open”
My future plans?
• Protocols and (distributed) platform for data
discovery & sharing.
• Data analysis and integration in marine
biogeochemistry & microbial physiology
Fig. 1: The cycle from data to discovery, over models back to experiment, that generates knowledge as
the cycle is repeated. Parts of that cycle are standard in particular disciplines, but putting together the
full cycle requires transdisciplinary expertise.
Training program at UC Davis:
• Regular intensive workshops, half-day or longer.
• Aimed at research practitioners (grad students & more
senior); open to all (including outside community).
• Novice (“zero entry”) on up.
• Low cost for students.
• Leverage global training initiatives.
(Google “dib training” for details; join the announce list!)
Thanks for listening!
Please contact me at ctbrown@ucdavis.edu!

More Related Content

What's hot

Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangementGuy Coates
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use casesGuy Coates
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomicsGuy Coates
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 
Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences EMC
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsDuncan Hull
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchEuropean Bioinformatics Institute
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 
Towards Incidental Collaboratories; Research Data Services
Towards Incidental Collaboratories; Research Data ServicesTowards Incidental Collaboratories; Research Data Services
Towards Incidental Collaboratories; Research Data ServicesAnita de Waard
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
BIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCPBIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCPRobert Oostenveld
 
A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017Manish K Patel
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - WageningenAllen Day, PhD
 

What's hot (20)

2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
 
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge Graphs
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomics
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of Bioinformatics
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
Towards Incidental Collaboratories; Research Data Services
Towards Incidental Collaboratories; Research Data ServicesTowards Incidental Collaboratories; Research Data Services
Towards Incidental Collaboratories; Research Data Services
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
BIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCPBIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCP
 
A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen
 

Viewers also liked

From Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki ProjectFrom Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki ProjectJoel Gehman
 
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...Yue Liao
 
SMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic ManagementSMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic ManagementJoel Gehman
 

Viewers also liked (20)

2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings
 
From Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki ProjectFrom Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki Project
 
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
 
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...
 
How One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online ChemistryHow One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online Chemistry
 
NSF Data Management Requirements 101
NSF Data Management Requirements 101NSF Data Management Requirements 101
NSF Data Management Requirements 101
 
The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
SMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic ManagementSMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic Management
 
The influence of data curation on QSAR Modeling – examining issues of qualit...
 The influence of data curation on QSAR Modeling – examining issues of qualit... The influence of data curation on QSAR Modeling – examining issues of qualit...
The influence of data curation on QSAR Modeling – examining issues of qualit...
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
 
Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
 

Similar to A 12-Step Program for Biology to Survive and Thrive in the Era of Data-Intensive Science

2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxxRowlet
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talkc.titus.brown
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocolsc.titus.brown
 
Next Generation Sequencing in Big Data
Next Generation Sequencing in Big DataNext Generation Sequencing in Big Data
Next Generation Sequencing in Big Dataijtsrd
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
 
Visualization Approaches for Biomedical Omics Data: Putting It All Together
Visualization Approaches for Biomedical Omics Data: Putting It All TogetherVisualization Approaches for Biomedical Omics Data: Putting It All Together
Visualization Approaches for Biomedical Omics Data: Putting It All TogetherNils Gehlenborg
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...BaoTramDuong2
 

Similar to A 12-Step Program for Biology to Survive and Thrive in the Era of Data-Intensive Science (20)

Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
2014 villefranche
2014 villefranche2014 villefranche
2014 villefranche
 
2014 naples
2014 naples2014 naples
2014 naples
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2013 alumni-webinar
2013 alumni-webinar2013 alumni-webinar
2013 alumni-webinar
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
Next Generation Sequencing in Big Data
Next Generation Sequencing in Big DataNext Generation Sequencing in Big Data
Next Generation Sequencing in Big Data
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Ml in genomics
Ml in genomicsMl in genomics
Ml in genomics
 
Visualization Approaches for Biomedical Omics Data: Putting It All Together
Visualization Approaches for Biomedical Omics Data: Putting It All TogetherVisualization Approaches for Biomedical Omics Data: Putting It All Together
Visualization Approaches for Biomedical Omics Data: Putting It All Together
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Thesis biobix
Thesis biobixThesis biobix
Thesis biobix
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 

More from c.titus.brown

More from c.titus.brown (16)

2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 
2014 bosc-keynote
2014 bosc-keynote2014 bosc-keynote
2014 bosc-keynote
 

Recently uploaded

Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Sérgio Sacani
 
Probability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGProbability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGSoniaBajaj10
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxtuking87
 
Unveiling the Cannabis Plant’s Potential
Unveiling the Cannabis Plant’s PotentialUnveiling the Cannabis Plant’s Potential
Unveiling the Cannabis Plant’s PotentialMarkus Roggen
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfSubhamKumar3239
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaDr.Mahmoud Abbas
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPRPirithiRaju
 
Advances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerAdvances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerLuis Miguel Chong Chong
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxpriyankatabhane
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationSanghamitraMohapatra5
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx201bo007
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxfarhanvvdk
 
Measures of Central Tendency.pptx for UG
Measures of Central Tendency.pptx for UGMeasures of Central Tendency.pptx for UG
Measures of Central Tendency.pptx for UGSoniaBajaj10
 

Recently uploaded (20)

Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
 
Probability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGProbability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UG
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
 
Introduction Classification Of Alkaloids
Introduction Classification Of AlkaloidsIntroduction Classification Of Alkaloids
Introduction Classification Of Alkaloids
 
Unveiling the Cannabis Plant’s Potential
Unveiling the Cannabis Plant’s PotentialUnveiling the Cannabis Plant’s Potential
Unveiling the Cannabis Plant’s Potential
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdf
 
Ultrastructure and functions of Chloroplast.pptx
Ultrastructure and functions of Chloroplast.pptxUltrastructure and functions of Chloroplast.pptx
Ultrastructure and functions of Chloroplast.pptx
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
 
PLASMODIUM. PPTX
PLASMODIUM. PPTXPLASMODIUM. PPTX
PLASMODIUM. PPTX
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
 
Advances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerAdvances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of Cancer
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptx
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitation
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptx
 
AZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTXAZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTX
 
Measures of Central Tendency.pptx for UG
Measures of Central Tendency.pptx for UGMeasures of Central Tendency.pptx for UG
Measures of Central Tendency.pptx for UG
 

A 12-Step Program for Biology to Survive and Thrive in the Era of Data-Intensive Science

  • 1. A 12-step program for biology to survive and thrive in the era of data-intensive science C.Titus Brown Genome Center & Data Science Initiative Mar 18, 2016 Slides are on slideshare.net/c.titus.brown/
  • 3. My guiding question What is going to be happening in the next 5 years with biological data generation? (And can I make progress on some of the coming problems?)
  • 4. DNA sequencing rates continues to grow. Stephens et al., 2015 - 10.1371/journal.pbio.1002195
  • 5. (2015 was a good year)
  • 12. “Fighting EbolaWith a Palm- Sized DNA Sequencer” See: http://www.theatlantic.com/science/archive/2015/09/ebola- sequencer-dna-minion/405466/
  • 13. “DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab. Via Elizabeth Kujawinski Another challenge beyond volume and velocity – variety.
  • 14. CRISPR The challenge with genome editing is fast becoming what to edit rather than how to do.
  • 15. A point for reflection… Increasingly, the best guide to the next 10 years of biology is science fiction ...
  • 16. Digital normalization Statement of problem: We can’t run de novo assembly on the transcriptome data sets we have!
  • 17. Shotgun sequencing and coverage “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 18. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (30-300 Gbp for human)
  • 25. (Digital normalization is a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  • 26. Some key points -- • Digital normalization is streaming. • Digital normalizing is computationally efficient (lower memory than other approaches; parallelizable/multicore; single-pass) • Currently, primarily used for prefiltering for assembly, but relies on underlying abstraction (De Bruijn graph) that is also used in variant calling.
  • 27. Assembly now scales with information content, not data size. • 10-100 fold decrease in memory requirements • 10-100 fold speed up in analysis
  • 28. Diginorm is widely useful: 1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism/variable coverage” problem. (Schwarz et al., 2013; pmid 23985341) 2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem. (in prep) 3. Osedax symbiont metagenome, a “contaminated metagenome” problem (Goffredi et al, 2013; pmid 24225886)
  • 29. Anecdata: diginorm is used in Illumina long-read sequencing (?)
  • 30. Computational problems now scale with information content rather than data set size. Most samples can be reconstructed via de novo assembly on commodity computers.
  • 31. Applying digital normalization in a new project – the horse transcriptome Tamer Mansour w/Bellone, Finno, Penedo, & Murray labs.
  • 32. Input data Tissue Library length #samples #frag(M) #bp(Gb) BrainStem PE fr.firststrand 101 8 166.73 33.68 Cerebellum PE fr.firststrand 100 24 411.48 82.3 Muscle PE fr.firststrand 126 12 301.94 76.08 Retina PE fr.unstranded 81 2 20.3 3.28 SpinalCord PE fr.firststrand 101 16 403 81.4 Skin PE fr.unstranded 81 2 18.54 3 SE fr.unstranded 81 2 16.57 1.34 SE fr.unstranded 95 3 105.51 10.02 Embryo ICM PE fr.unstranded 100 3 126.32 25.26 SE fr.unstranded 100 3 115.21 11.52 Embryo TE PE fr.unstranded 100 3 129.84 25.96 SE fr.unstranded 100 3 102.26 10.23 Total 81 1917.7 364.07
  • 33. equCabs current status - NCBI Annotation Feature Acc Annotation GFF Refseq DB Total no of genes 25565 ptn coding genes 19686 Coding RNA NM BestRefSeq 764 1097 Coding RNA XM Gnomon 31578 31346 Non coding RNA NR BestRefSeq 348 726 Non coding RNA XR Gnomon 3311 3310 Total 36001 36479 32342 coding transcripts encoded by 19686 genes (average 1.6 transcript per gene) There are 3034 pseudo genes (with no annotated transcripts) Status count reviewed 4 Validated 267 Provisional 540 Predicted 7 inferred 279 Tamer Mansour
  • 34. Library prep Read trimming Mapping to ref Merge rep. Trans Ass. Merge byTiss. Predict ORF VariantAna Update dbvar Haplotype ass Pool/diginorm Predict ncRNA Filter & Compare Ass. filter knowns Compare to public ann. Merge All Ass. Mapping to ref Trans Ass. Tamer Mansour
  • 35. Digital normalization & (e.g.) horse transcriptome The computational demands for cufflinks - Read binning (processing time) - Construction of gene models (no of genes, no of splicing junctions, no of reads per locus, sequencing errors, complexity of the locus like gene overlap and multiple isoforms (processing time & Memory utilization) Diginorm - Significant reduction of binning time - Relative increase of the resources required for gene model construction with merging more samples and tissues - ? false recombinant isoforms Tamer Mansour
  • 36. Effect of digital normalization ** Should be very valuable for detection of ncRNA Tamer Mansour
  • 37. The ORF problem Hestand et al 2014: “we identified 301,829 positions with SNPs or small indels within these transcripts relative to EquCab2. Interestingly, 780 variants extend the open reading frame of the transcript and appear to be small errors in the equine reference genome” Tamer Mansour
  • 38. We merged the assemblies into six tissue-specific transcription profiles for cerebellum, brainstem, spinal cord, retina, muscle and skin.The final merger of all assemblies overlaps with 63% and 73% of NCBI and Ensembl loci, respectively, capturing about 72% and 81% of their coding bases. Comparing our assembly to the most recent transcriptome annotation shows ~85% overlapping loci. In addition, at least 40% of our annotated loci represent novel transcripts. Tamer Mansour
  • 39. Diginorm can also process data as it comes in – streaming decision making.
  • 40. What do we do when we get new data?? • How do we efficiently process, update our existing resources? • How do we evaluate whether or not our prior conclusions need to change or be updated? – # of genes, & their annotations; – Differential expression based on new isoforms; • This is a problem everyone has …and it’s not going away…
  • 41. The data challenge in biology So we can sequence everything – so what? What does it mean? How can we do better biology with the data? How can we understand?
  • 42. A 12-step program for biology (??) (This was a not terribly successful attempt to be entertaining.)
  • 43. 1.Think repeatability and scaling x 100 What works for one data set, Doesn’t work as well for three, And doesn’t work at all for 100.
  • 44. 2.Think streaming / few-pass analysis Mapping Data Sorting Calling Answer 1-pass Data Answer versus
  • 45. 3. Invest in computational training Summer NGS workshop (2010-2017)
  • 46. 4. Move beyond PDFs This is only part of the story! Subramanian et al., doi: 10.1128/JVI.01163-13
  • 47. 5. Focus on a biological question Generating data for the sake of having data leads you into a data analysis maze – “I’m sure there’s something interesting in there… somewhere.”
  • 48. "...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentum—a genomic bandwagon effect." Ref.: Pandey et al. (2014), PLoS One 11, e88889.Via Erich Schwarz The problem of lopsided gene characterization is pervasive: e.g., the brain "ignorome" 6. Spend more effort on the unknowns!
  • 49. 7. Invest in data integration. Figure 2. Summary of challenges associated with the data integration in the proposed project. Figure via E. Kujawinski
  • 50. 8. Split your information into layers Protein coding >> ncRNA >> ??? ** Should be very valuable for detection of ncRNA *** But what the heck do we do with ncRNA information? Tamer Mansour
  • 51. 9. Move to an update model. Current information New data!!!! Update results? Yes? ?????
  • 52. Candidates for additional steps… • Invest in data sharing and better “reference” infrastructure. • Build better tools for computationally exploring hypotheses. • Invest in “unsupervised” analysis of data (machine learning) • Learn/apply multivariate stats. • Invest in social media & preprints & “open”
  • 53. My future plans? • Protocols and (distributed) platform for data discovery & sharing. • Data analysis and integration in marine biogeochemistry & microbial physiology
  • 54. Fig. 1: The cycle from data to discovery, over models back to experiment, that generates knowledge as the cycle is repeated. Parts of that cycle are standard in particular disciplines, but putting together the full cycle requires transdisciplinary expertise.
  • 55. Training program at UC Davis: • Regular intensive workshops, half-day or longer. • Aimed at research practitioners (grad students & more senior); open to all (including outside community). • Novice (“zero entry”) on up. • Low cost for students. • Leverage global training initiatives. (Google “dib training” for details; join the announce list!)
  • 56. Thanks for listening! Please contact me at ctbrown@ucdavis.edu!