SlideShare a Scribd company logo
1 of 62
A data intensive future:
How can biology take full advantage of
the coming data deluge?
C.Titus Brown
School ofVeterinary Medicine;
Genome Center & Data Science Initiative
11/13/15
Outline
0. Background
1. Research: what do we do with infinite data?
2. Development: software and infrastructure.
3. Open science & reproducibility.
4. Training
0. Background
In which I present the perspective that we face
increasingly large data sets, from diverse
samples, generated in real time, with many
different data types.
DNA sequencing rates continues
to grow.
Stephens et al., 2015 - 10.1371/journal.pbio.1002195
Oxford Nanopore sequencing
Slide viaTorsten Seeman
Nanopore technology
Slide viaTorsten Seeman
Scaling up --
Scaling up --
Slide viaTorsten Seeman
http://ebola.nextflu.org/
“Fighting EbolaWith a Palm-
Sized DNA Sequencer”
See: http://www.theatlantic.com/science/archive/2015/09/ebola-
sequencer-dna-minion/405466/
“DeepDOM” cruise: examination
of dissolved organic matter &
microbial metabolism vs physical
parameters – potential collab.
Via Elizabeth Kujawinski
1. Research
In which I discuss advances made towards
analyzing infinite amounts of genomic data, and
the perspectives engendered thereby: to whit,
streaming and sketches.
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
De Bruijn graphs (sequencing graphs) scale with
data size, not information size.
Why do sequence graphs scale
badly?
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
Practical memory measurements
Velvet measurements (Adina Howe)
Our solution: lossy compression
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
Shotgun sequencing and coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.
Random sampling => deep sampling needed
Typically 10-100x needed for robust recovery (30-300 Gbp for human)
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Graph sizes now scales with information content.
Most samples can be reconstructed via de
novo assembly on commodity computers.
Diginorm ~ “lossy compression”
Nearly perfect from an information theoretic
perspective:
– Discards 95% more of data for genomes.
– Loses < 00.02% of information.
This changes the way analyses
scale.
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
Streaming lossy compression:
for read in dataset:
if estimated_coverage(read) < CUTOFF:
yield read
This is literally a three line algorithm. Not kidding.
It took four years to figure out which three lines, though…
Diginorm can detect information
saturation in a stream.
Zhang et al., submitted.
This generically permits semi-streaming
analytical approaches.
Zhang et al., submitted.
e.g. E. coli analysis => ~1.2 pass, sublinear
memory
Zhang et al., submitted.
Another simple algorithm.
Zhang et al., submitted.
Single pass, reference free, tunable, streaming online
variant calling.
Error detection  variant calling
Real time / streaming data
analysis.
Raw data
(real time, from
sequencer?)
Error trimming
Variant calling
De novo
assembly
My real point -
• We need well founded, and flexible, and algorithmically
efficient, and high performance components for
sequence data manipulation in biology.
• We are building some of these on a streaming and low
memory paradigm.
• We are building out a scripting library for composing
these operations.
2. Software and infrastructure
Alas, practical data analysis depends on
software and computers, which leads to
depressingly practical considerations for
gentleperson scientists.
Software
It’s all well and good to develop new data
analysis approaches, but their utility is greater
when they are implemented in usable software.
Writing, maintaining, and progressing research
software is hard.
The khmer software package
• Demo implementation of research data structures &
algorithms;
• 10.5k lines of C++ code, 13.7k lines of Python code;
• khmer v2.0 has 87% statement coverage under test;
• ~3-4 developers, 50+ contributors, ~1000s of users (?)
The khmer software package, Crusoe et al., 2015. http://f1000research.com/articles/4-900/v1
khmer is developed as a true open
source package
• github.com/dib-lab/khmer;
• BSD license;
• Code review, two-person sign off on changes;
• Continuous integration (tests are run on each
change request);
Challenges:
Research vs stability!
Stable software for users, & platform for future
research;
vs research “culture”
(funding and careers)
How is continued software dev feasible?!
Representative half-arsed lab software development
Version that
worked once, for
some publication.
Grad student 1
research
Grad student 2
research
Incompatible and broken code
A not-insane way to do software development
Stable version
Grad student 1
research
Grad student 2
research
Stable, tested code
Run tests
Run tests
Run tests
Run tests
Run tests
Run tests
Run tests
Infrastructure issues
Suppose that we have a nice ecosystem of bioinformatics &
data analysis tools.
Where and how do we run them?
Consider:
1. Biologists hate funding computational infrastructure.
2. Researchers are generally incompetent at building and
maintaining usable infrastructure.
3. Centralized infrastructure fails in the face of infinite data.
Decentralized infrastructure for
bioinformatics?
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
ivory.idyll.org/blog/2014-moore-ddd-award.html
3. Open science and
reproducibility
In which I start from the point that most
researchers* cannot replicate their own
computational analyses, much less reproduce
those published by anyone else.
*This doesn’t apply to anyone in this
audience; you’re all outliers!
My lab & the diginorm paper.
• All our code was on github;
• Much of our data analysis was in the cloud (on
Amazon EC2);
• Our figures were made in IPython Notebook.
• Our paper was in LaTeX.
Brown et al., 2012 (arXiv)
IPython Notebook: data + code =>
IPython)Notebook)
To reproduce our paper:
git clone <khmer> && python setup.py install
git clone <pipeline>
cd pipeline
wget <data> && tar xzf <data>
make && cd ../notebook && make
cd ../ && make
This is standard process in lab --
Our papers now have:
• Source hosted on github;
• Data hosted there or onAWS;
• Long running data analysis =>
‘make’
• Graphing and data digestion =>
IPython Notebook (also in
github)
Zhang et al. doi: 10.1371/journal.pone.0101271
Research process
Generate new
results; encode
in Makefile
Summarize in
IPython
Notebook
Push to githubDiscuss, explore
Literate graphing & interactive
exploration
Camille Scott
Why bother??
“There is no scientific knowledge of the individual.”
(Aristotle)
More pragmatically, we are tired of struggling to
reproduce other people’s results.
And, in the end, it’s not all that much extra work.
What does this have to do with
open science?
This is a longer & larger conversation, but:
All of our processes enable easy and efficient pre-publication
sharing. Source code, analyses, preprints…
When we share early, our ideas have a significant competitive
advantage in the research marketplace of ideas.
4.Training
In which I note that methods and tools do little
without a trained hand wielding them, and a
trained eye examining the results.
Perspectives on training
• Prediction: The single biggest challenge
facing biology over the next 20 years is the
lack of data analysis training (see: NIH DIWG
report)
• Data analysis is not turning the crank; it is an
intellectual exercise on par with
experimental design or paper writing.
• Training is systematically undervalued in
academia (!?)
UC Davis and training
My goal here is to support the coalescence and
growth of a local community of practice around
“data intensive biology”.
Summer NGS workshop (2010-2017)
General parameters:
• Regular intensive workshops, half-day or longer.
• Aimed at research practitioners (grad students & more
senior); open to all (including outside community).
• Novice (“zero entry”) on up.
• Low cost for students.
• Leverage global training initiatives.
Thus far & near future
~12 workshops on bioinformatics in 2015.
Trying out soon:
• Half-day intro workshops;
• Week-long advanced workshops;
• Co-working hours.
dib-training.readthedocs.org/
The End.
• If you think 5-10 years out, we face significant practical
issues for data analysis in biology.
• We need new algorithms/data structures, AND good
implementations, AND better computational practice,
AND training.
• This can be either viewed with despair… or seen as an
opportunity to seize the competitive advantage!
(How I view it varies from day to day.)
Thanks for listening!
Please contact me at ctbrown@ucdavis.edu!
Note: I work here now!

More Related Content

What's hot

2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenomec.titus.brown
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshopc.titus.brown
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dcc.titus.brown
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streamingc.titus.brown
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangementGuy Coates
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-reviewc.titus.brown
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesScott Edmunds
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use casesGuy Coates
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomicsGuy Coates
 
2013 10-30-sbc361-reproducible designsandsustainablesoftware
2013 10-30-sbc361-reproducible designsandsustainablesoftware2013 10-30-sbc361-reproducible designsandsustainablesoftware
2013 10-30-sbc361-reproducible designsandsustainablesoftwareYannick Wurm
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
Better science through superior software
Better science through superior softwareBetter science through superior software
Better science through superior softwareMichael R. Crusoe
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchEuropean Bioinformatics Institute
 

What's hot (20)

2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
 
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challenges
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomics
 
2013 10-30-sbc361-reproducible designsandsustainablesoftware
2013 10-30-sbc361-reproducible designsandsustainablesoftware2013 10-30-sbc361-reproducible designsandsustainablesoftware
2013 10-30-sbc361-reproducible designsandsustainablesoftware
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Better science through superior software
Better science through superior softwareBetter science through superior software
Better science through superior software
 
Genome Big Data
Genome Big DataGenome Big Data
Genome Big Data
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
 

Viewers also liked

Introduction to 3rd sequencing
Introduction to 3rd sequencing Introduction to 3rd sequencing
Introduction to 3rd sequencing Eric Lee
 
Algorithm of NGS Data
Algorithm of NGS DataAlgorithm of NGS Data
Algorithm of NGS DataEric Lee
 
Genome sequences as media files
Genome sequences as media filesGenome sequences as media files
Genome sequences as media filestparidae
 
Content-Driven Apps with React
Content-Driven Apps with ReactContent-Driven Apps with React
Content-Driven Apps with ReactNetcetera
 
Curoverse Presentation at ICG-11 (November 2016)
Curoverse Presentation at ICG-11 (November 2016)Curoverse Presentation at ICG-11 (November 2016)
Curoverse Presentation at ICG-11 (November 2016)Arvados
 
Compact Genome Format
Compact Genome FormatCompact Genome Format
Compact Genome FormatArvados
 
Towards using multimedia technology for biological data processing
Towards using multimedia technology for biological data processingTowards using multimedia technology for biological data processing
Towards using multimedia technology for biological data processingWesley De Neve
 
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & Exciting
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & ExcitingNetcetera Innovation Summit 2016: The Past 12 Months - What's New & Exciting
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & ExcitingNetcetera
 
SwissWallet - Die digitale Währung heisst Vertrauen
SwissWallet - Die digitale Währung heisst Vertrauen SwissWallet - Die digitale Währung heisst Vertrauen
SwissWallet - Die digitale Währung heisst Vertrauen Netcetera
 
COSCUP 2016 Workshop : 快快樂樂學Neo4j
COSCUP 2016 Workshop : 快快樂樂學Neo4jCOSCUP 2016 Workshop : 快快樂樂學Neo4j
COSCUP 2016 Workshop : 快快樂樂學Neo4jEric Lee
 
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...Wesley De Neve
 
Authentication requirements and application of PSD2 in e-Commerce - Presentat...
Authentication requirements and application of PSD2 in e-Commerce - Presentat...Authentication requirements and application of PSD2 in e-Commerce - Presentat...
Authentication requirements and application of PSD2 in e-Commerce - Presentat...Netcetera
 
SkopjePulse: Designing a better city with IoT
SkopjePulse: Designing a better city with IoTSkopjePulse: Designing a better city with IoT
SkopjePulse: Designing a better city with IoTNetcetera
 
Urinary System #4
Urinary System #4Urinary System #4
Urinary System #4avlainich
 
Advanced International Business Strategies for Entrepreneurs
Advanced International Business Strategies for EntrepreneursAdvanced International Business Strategies for Entrepreneurs
Advanced International Business Strategies for EntrepreneursKegler Brown Hill + Ritter
 
One Step Online School Traditional
One Step Online School TraditionalOne Step Online School Traditional
One Step Online School TraditionalChineseTeachers.com
 
net Balance presentation Queensland GreenIT Informatics Group
net Balance presentation Queensland GreenIT Informatics Groupnet Balance presentation Queensland GreenIT Informatics Group
net Balance presentation Queensland GreenIT Informatics GroupWarrick Tan
 

Viewers also liked (20)

Introduction to 3rd sequencing
Introduction to 3rd sequencing Introduction to 3rd sequencing
Introduction to 3rd sequencing
 
Algorithm of NGS Data
Algorithm of NGS DataAlgorithm of NGS Data
Algorithm of NGS Data
 
Genome sequences as media files
Genome sequences as media filesGenome sequences as media files
Genome sequences as media files
 
Content-Driven Apps with React
Content-Driven Apps with ReactContent-Driven Apps with React
Content-Driven Apps with React
 
Curoverse Presentation at ICG-11 (November 2016)
Curoverse Presentation at ICG-11 (November 2016)Curoverse Presentation at ICG-11 (November 2016)
Curoverse Presentation at ICG-11 (November 2016)
 
Compact Genome Format
Compact Genome FormatCompact Genome Format
Compact Genome Format
 
Towards using multimedia technology for biological data processing
Towards using multimedia technology for biological data processingTowards using multimedia technology for biological data processing
Towards using multimedia technology for biological data processing
 
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & Exciting
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & ExcitingNetcetera Innovation Summit 2016: The Past 12 Months - What's New & Exciting
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & Exciting
 
SwissWallet - Die digitale Währung heisst Vertrauen
SwissWallet - Die digitale Währung heisst Vertrauen SwissWallet - Die digitale Währung heisst Vertrauen
SwissWallet - Die digitale Währung heisst Vertrauen
 
COSCUP 2016 Workshop : 快快樂樂學Neo4j
COSCUP 2016 Workshop : 快快樂樂學Neo4jCOSCUP 2016 Workshop : 快快樂樂學Neo4j
COSCUP 2016 Workshop : 快快樂樂學Neo4j
 
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
 
Authentication requirements and application of PSD2 in e-Commerce - Presentat...
Authentication requirements and application of PSD2 in e-Commerce - Presentat...Authentication requirements and application of PSD2 in e-Commerce - Presentat...
Authentication requirements and application of PSD2 in e-Commerce - Presentat...
 
SkopjePulse: Designing a better city with IoT
SkopjePulse: Designing a better city with IoTSkopjePulse: Designing a better city with IoT
SkopjePulse: Designing a better city with IoT
 
Urinary System #4
Urinary System #4Urinary System #4
Urinary System #4
 
Legal Strategies: Exporting
Legal Strategies: ExportingLegal Strategies: Exporting
Legal Strategies: Exporting
 
Feelgoodreel09
Feelgoodreel09Feelgoodreel09
Feelgoodreel09
 
Advanced International Business Strategies for Entrepreneurs
Advanced International Business Strategies for EntrepreneursAdvanced International Business Strategies for Entrepreneurs
Advanced International Business Strategies for Entrepreneurs
 
One Step Online School Traditional
One Step Online School TraditionalOne Step Online School Traditional
One Step Online School Traditional
 
Br10 tekniske installationer
Br10 tekniske installationerBr10 tekniske installationer
Br10 tekniske installationer
 
net Balance presentation Queensland GreenIT Informatics Group
net Balance presentation Queensland GreenIT Informatics Groupnet Balance presentation Queensland GreenIT Informatics Group
net Balance presentation Queensland GreenIT Informatics Group
 

Similar to 2015 genome-center

2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceCarole Goble
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibilityc.titus.brown
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleEnis Afgan
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible researchYannick Wurm
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sangerChris Dwan
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8Scott Edmunds
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Ola Spjuth
 
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical ScienceAri Berman
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
2013 ucar best practices
2013 ucar best practices2013 ucar best practices
2013 ucar best practicesc.titus.brown
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 

Similar to 2015 genome-center (20)

2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Pine education-platform
Pine education-platformPine education-platform
Pine education-platform
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an example
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sanger
 
Reproducible Research and the Cloud
Reproducible Research and the CloudReproducible Research and the Cloud
Reproducible Research and the Cloud
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...
 
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
2013 ucar best practices
2013 ucar best practices2013 ucar best practices
2013 ucar best practices
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 

More from c.titus.brown

More from c.titus.brown (16)

2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 
2014 bosc-keynote
2014 bosc-keynote2014 bosc-keynote
2014 bosc-keynote
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 

Recently uploaded

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicinesherlingomez2
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxBhagirath Gogikar
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONrouseeyyy
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 

Recently uploaded (20)

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 

2015 genome-center

  • 1. A data intensive future: How can biology take full advantage of the coming data deluge? C.Titus Brown School ofVeterinary Medicine; Genome Center & Data Science Initiative 11/13/15
  • 2. Outline 0. Background 1. Research: what do we do with infinite data? 2. Development: software and infrastructure. 3. Open science & reproducibility. 4. Training
  • 3. 0. Background In which I present the perspective that we face increasingly large data sets, from diverse samples, generated in real time, with many different data types.
  • 4. DNA sequencing rates continues to grow. Stephens et al., 2015 - 10.1371/journal.pbio.1002195
  • 11. “Fighting EbolaWith a Palm- Sized DNA Sequencer” See: http://www.theatlantic.com/science/archive/2015/09/ebola- sequencer-dna-minion/405466/
  • 12. “DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab. Via Elizabeth Kujawinski
  • 13. 1. Research In which I discuss advances made towards analyzing infinite amounts of genomic data, and the perspectives engendered thereby: to whit, streaming and sketches.
  • 14. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com De Bruijn graphs (sequencing graphs) scale with data size, not information size.
  • 15. Why do sequence graphs scale badly? Memory usage ~ “real” variation + number of errors Number of errors ~ size of data set
  • 16. Practical memory measurements Velvet measurements (Adina Howe)
  • 17. Our solution: lossy compression Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 18. Shotgun sequencing and coverage “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 19. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (30-300 Gbp for human)
  • 26. Graph sizes now scales with information content. Most samples can be reconstructed via de novo assembly on commodity computers.
  • 27. Diginorm ~ “lossy compression” Nearly perfect from an information theoretic perspective: – Discards 95% more of data for genomes. – Loses < 00.02% of information.
  • 28. This changes the way analyses scale. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 29. Streaming lossy compression: for read in dataset: if estimated_coverage(read) < CUTOFF: yield read This is literally a three line algorithm. Not kidding. It took four years to figure out which three lines, though…
  • 30. Diginorm can detect information saturation in a stream. Zhang et al., submitted.
  • 31. This generically permits semi-streaming analytical approaches. Zhang et al., submitted.
  • 32. e.g. E. coli analysis => ~1.2 pass, sublinear memory Zhang et al., submitted.
  • 33. Another simple algorithm. Zhang et al., submitted.
  • 34. Single pass, reference free, tunable, streaming online variant calling. Error detection  variant calling
  • 35. Real time / streaming data analysis. Raw data (real time, from sequencer?) Error trimming Variant calling De novo assembly
  • 36. My real point - • We need well founded, and flexible, and algorithmically efficient, and high performance components for sequence data manipulation in biology. • We are building some of these on a streaming and low memory paradigm. • We are building out a scripting library for composing these operations.
  • 37. 2. Software and infrastructure Alas, practical data analysis depends on software and computers, which leads to depressingly practical considerations for gentleperson scientists.
  • 38. Software It’s all well and good to develop new data analysis approaches, but their utility is greater when they are implemented in usable software. Writing, maintaining, and progressing research software is hard.
  • 39. The khmer software package • Demo implementation of research data structures & algorithms; • 10.5k lines of C++ code, 13.7k lines of Python code; • khmer v2.0 has 87% statement coverage under test; • ~3-4 developers, 50+ contributors, ~1000s of users (?) The khmer software package, Crusoe et al., 2015. http://f1000research.com/articles/4-900/v1
  • 40. khmer is developed as a true open source package • github.com/dib-lab/khmer; • BSD license; • Code review, two-person sign off on changes; • Continuous integration (tests are run on each change request);
  • 41. Challenges: Research vs stability! Stable software for users, & platform for future research; vs research “culture” (funding and careers)
  • 42. How is continued software dev feasible?! Representative half-arsed lab software development Version that worked once, for some publication. Grad student 1 research Grad student 2 research Incompatible and broken code
  • 43. A not-insane way to do software development Stable version Grad student 1 research Grad student 2 research Stable, tested code Run tests Run tests Run tests Run tests Run tests Run tests Run tests
  • 44. Infrastructure issues Suppose that we have a nice ecosystem of bioinformatics & data analysis tools. Where and how do we run them? Consider: 1. Biologists hate funding computational infrastructure. 2. Researchers are generally incompetent at building and maintaining usable infrastructure. 3. Centralized infrastructure fails in the face of infinite data.
  • 45. Decentralized infrastructure for bioinformatics? Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI) ivory.idyll.org/blog/2014-moore-ddd-award.html
  • 46. 3. Open science and reproducibility In which I start from the point that most researchers* cannot replicate their own computational analyses, much less reproduce those published by anyone else. *This doesn’t apply to anyone in this audience; you’re all outliers!
  • 47. My lab & the diginorm paper. • All our code was on github; • Much of our data analysis was in the cloud (on Amazon EC2); • Our figures were made in IPython Notebook. • Our paper was in LaTeX. Brown et al., 2012 (arXiv)
  • 48. IPython Notebook: data + code => IPython)Notebook)
  • 49. To reproduce our paper: git clone <khmer> && python setup.py install git clone <pipeline> cd pipeline wget <data> && tar xzf <data> make && cd ../notebook && make cd ../ && make
  • 50. This is standard process in lab -- Our papers now have: • Source hosted on github; • Data hosted there or onAWS; • Long running data analysis => ‘make’ • Graphing and data digestion => IPython Notebook (also in github) Zhang et al. doi: 10.1371/journal.pone.0101271
  • 51. Research process Generate new results; encode in Makefile Summarize in IPython Notebook Push to githubDiscuss, explore
  • 52. Literate graphing & interactive exploration Camille Scott
  • 53. Why bother?? “There is no scientific knowledge of the individual.” (Aristotle) More pragmatically, we are tired of struggling to reproduce other people’s results. And, in the end, it’s not all that much extra work.
  • 54. What does this have to do with open science? This is a longer & larger conversation, but: All of our processes enable easy and efficient pre-publication sharing. Source code, analyses, preprints… When we share early, our ideas have a significant competitive advantage in the research marketplace of ideas.
  • 55. 4.Training In which I note that methods and tools do little without a trained hand wielding them, and a trained eye examining the results.
  • 56. Perspectives on training • Prediction: The single biggest challenge facing biology over the next 20 years is the lack of data analysis training (see: NIH DIWG report) • Data analysis is not turning the crank; it is an intellectual exercise on par with experimental design or paper writing. • Training is systematically undervalued in academia (!?)
  • 57. UC Davis and training My goal here is to support the coalescence and growth of a local community of practice around “data intensive biology”.
  • 58. Summer NGS workshop (2010-2017)
  • 59. General parameters: • Regular intensive workshops, half-day or longer. • Aimed at research practitioners (grad students & more senior); open to all (including outside community). • Novice (“zero entry”) on up. • Low cost for students. • Leverage global training initiatives.
  • 60. Thus far & near future ~12 workshops on bioinformatics in 2015. Trying out soon: • Half-day intro workshops; • Week-long advanced workshops; • Co-working hours. dib-training.readthedocs.org/
  • 61. The End. • If you think 5-10 years out, we face significant practical issues for data analysis in biology. • We need new algorithms/data structures, AND good implementations, AND better computational practice, AND training. • This can be either viewed with despair… or seen as an opportunity to seize the competitive advantage! (How I view it varies from day to day.)
  • 62. Thanks for listening! Please contact me at ctbrown@ucdavis.edu! Note: I work here now!

Editor's Notes

  1. A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
  2. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
  3. Taking advantage of structure within read
  4. Analyze data in cloud; import and export important; connect to other databases.
  5. Lure them in with bioinformatics and then show them that Michigan, in the summertime, is qite nice!