SlideShare a Scribd company logo
1 of 57
C . T I T U S B R O W N
C T B R O W N @ U C D A V I S . E D U
A S S O C I A T E P R O F E S S O R
P O P U L A T I O N H E A L T H A N D R E P R O D U C T I O N
S C H O O L O F V E T E R I N A R Y M E D I C I N E
U N I V E R S I T Y O F C A L I F O R N I A , D A V I S
Concepts and tools for exploring
very large sequencing data sets.
Some background & motivation:
 We primarily build tools to look at large sequencing
data sets.
 Our interest is in enabling scientists to move quickly
to hypotheses from data.
My goals
 Enable hypothesis-driven biology through better
hypothesis generation & refinement.
 Devalue “interest level” of sequence analysis and put
myself out of a job.
 Be a good mutualist!
Narrative arc
1. Shotgun metagenomics: can we reconstruct
community genomes?
2. Underlying technology-enabled approach – tools
and platforms are good.
3. My larger plan for world domination through
technology and training – a kinder, gentler world
(?).
Shotgun metagenomics
 Collect samples;
 Extract DNA;
 Feed into sequencer;
 Computationally analyze.
Wikipedia: Environmental shotgun
sequencing.png
Shotgun sequencing & assembly
http://eofdreams.com/library.html;
http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;
http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
To assemble, or not to assemble?
Goals: reconstruct phylogenetic content and predict
functional potential of ensemble.
 Should we analyze short reads directly?
OR
 Do we assemble short reads into longer contigs first,
and then analyze the contigs?
Howe et al., 2014
Assemblies yield much
more significant
similarity matches.
Assembly: good for annotation!
But! Isn’t assembly problematic?
 Chimeric misassemblies?
 Uneven coverage?
 Strain variation?
 Computationally challenging?
I. Benchmarking metagenome assembly
 Most assembly papers analyze novel data sets and
then have to argue that their result is ok (guilty!)
 Very few assembly benchmarks have been done.
 Even fewer (trustworthy) computational
time/memory comparisons have been done.
 And even fewer “assembly recipes” have been
written down clearly.
Shakya et al., 2013; pmid 23387867
A mock community!
 ~60 genomes, all sequenced;
 Lab mixed with 10:1 ratio of most abundant to least
abundant;
 2x101 reads, 107 mn reads total (Illumina);
 10.5 Gbp of sequence in toto.
 The paper also compared16s primer sets & 454
shotgun metagenome data => reconstruction.
Shakya et al., 2013; pmid 23387867
Paper conclusions
 “Metagenomic sequencing outperformed most SSU
rRNA gene primer sets used in this study.”
 “The Illumina short reads provided a very good estimates
of taxonomic distribution above the species level, with
only a two- to threefold overestimation of the actual
number of genera and orders.”
 “For the 454 data … the use of the default parameters
severely overestimated higher level diversity (~ 20- fold
for bacterial genera and identified > 100 spurious
eukaryotes).”
Shakya et al., 2013; pmid 23387867
How about assembly??
 Shakya et al. did not do assembly; no standard for
analysis at the time, not experts.
 But we work on assembly!
 And we’ve been working on a tutorial/process for
doing it!
Adapter trim &
quality filter
Diginorm to C=10
Trim high-
coverage reads at
low-abundance
k-mers
Diginorm to C=5
Partition
graph
Split into "groups"
Reinflate groups
(optional
Assemble!!!
Map reads to
assembly
Too big to
assemble?
Small enough to assemble?
Annotate contigs
with abundances
MG-RAST, etc.
The Kalamazoo Metagenomics Protocol
Derived from approach used in Howe et al., 2014
Computational protocol for assembly
Adapter trim &
quality filter
Diginorm to C=10
Trim high-
coverage reads at
low-abundance
k-mers
Diginorm to C=5
Partition
graph
Split into "groups"
Reinflate groups
(optional
Assemble!!!
Map reads to
assembly
Too big to
assemble?
Small enough to assemble?
Annotate contigs
with abundances
MG-RAST, etc.
Kalamazoo Metagenomics Protocol => benchmarking!
Assemble with Velvet, IDBA, SPAdes
Benchmarking process
 Apply various filtering treatments to the data (x3)
 Basic quality trimming and filtering
 + digital normalization
 + partitioning
 Apply different assemblers to the data for each
treatment (x3)
 IDBA
 SPAdes
 Velvet
 Measure compute time/memory req’d.
 Compare assembly results to “known” answer with
Quast.
Recovery, by assembler
Velvet IDBA Spades
Quality Quality Quality
Total length (>= 0 bp) 1.6E+08 2.0E+08 2.0E+08
Total length (>= 1000 bp) 1.6E+08 1.9E+08 1.9E+08
Largest contig 561,449 979,948 1,387,918
# misassembled contigs 631 1032 752
Genome fraction (%) 72.949 90.969 90.424
Duplication ratio 1.004 1.007 1.004
Conclusion: SPAdes and IDBA achieve similar results.
Dr. Sherine Awad
Treatments do not alter results very much.
IDBA
Default Diginorm Partition
Total length (>= 0 bp) 2.0E+08 2.0E+08 2.0E+08
Total length (>= 1000 bp) 1.9E+08 2.0E+08 1.9E+08
Largest contig 979,948 1,469,321 551,171
# misassembled contigs 1032 916 828
Unaligned length 10,709,716 10,637,811 10,644,357
Genome fraction (%) 90.969 91.003 90.082
Duplication ratio 1.007 1.008 1.007
Dr. Sherine Awad
Treatments do save compute time.
Velvet idba Spades
Time
(h:m:s)
RAM
(gb)
Time
(h:m:s)
RAM
(gb)
Time
(h:m:s)
RAM
(gb)
Quality 60:42:52 1,594 33:53:46 129 67:02:16 400
Diginorm 6:48:46 827 6:34:24 104 15:53:10 127
Partition 4:30:36 1,156 8:30:29 93 7:54:26 129
(Run on Michigan State HPC)
Dr. Sherine Awad
Need to understand:
 What is not being assembled and why?
 Low coverage?
 Strain variation?
 Something else?
 Effects of strain variation: no assembly.
 Additional contigs being assembled –
contamination? Spurious assembly?
Assembly conclusions
 90% recovery is not bad; relatively few
misassemblies, too.
 This was not a highly polymorphic community BUT
it did have several closely related strains; more
generally, we see that strains do generate chimeras,
but not between different species.
 …challenging to execute even with a
tutorial/protocol.
We need much deeper sampling!
Sharon et al., 2015 (Genome Res)
Overlap between synthetic long reads and short reads.
Benchmarking & protocols
 Our work is completely reproducible and open.
 You can re-run our benchmarks yourself if you want!
 We will be adding new assemblers in as time
permits.
 Protocol is open, versioned, citable… but also still a
work in progress :)
II: Shotgun sequencing and coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the
top through all of the reads.
Assembly depends on high coverage
HMP mock community
Main questions --
I. How do we know if we’ve sequenced enough?
II. Can we predict how much more we need to
sequence to see <insert some feature here>?
Note: necessary sequencing depth cannot
accurately be predicted solely from
SSU/amplicon data
Method 1: looking for WGS saturation
We can track how many sequences we
keep of the sequences we’ve seen, to
detect saturation.
Data from Shakya et al., 2013 (pmid: 23387867
We can detect saturation of
shotgun sequencing
Data from Shakya et al., 2013 (pmid: 23387867
We can detect saturation of
shotgun sequencing
C=10, for assembly
Estimating metagenome nt richness:
# bp at saturation / coverage
 MM5 deep carbon: 60 Mbp
 Iowa prairie soil: 12 Gbp
 Amazon Rain Forest Microbial Observatory soil: 26
Gbp
Assumes: few entirely erroneous reads (upper bound); at
saturation (lower bound).
WGS saturation approach:
 Tells us when we have enough sequence.
 Can’t be predictive… if you haven’t sampled
something, you can’t say anything about it.
Can we correlate deep amplicon sequencing with
shallower WGS?
Correlating 16s and shotgun WGS
How
much
of 16s
do you
see…
with how much shotgun sequencing?
Data from Shakya et al., 2013 (pmid: 23387867)
WGS saturation ~matches 16s saturation
< rRNA copy
number >
Method is robust to organisms unsampled by
amplicon sequencing.
Insensitive to
amplicon primer
bias.
Robust to genome
size differences,
eukaryotes, phage.
Data from Shakya et al., 2013 (pmid: 23387867
Can examine specific OTUs
Data from Shakya et al., 2013 (pmid: 23387867
OTU abundance is ~correct.
Data from Shakya et al., 2013 (pmid: 23387867
Running on real communities --
Running on real communities --
Concluding thoughts on metagenomes -
 The main obstacle to recovering genomic details of
communities is shallow sampling.
 Considerably deeper sampling is needed – 1000x
(petabasepair sampling)
 This will inevitably happen!
 …I would like to make sure the compute technology
is there, when it does.
More general: computation needs to scale!
Navin et al., 2011
Cancer investigation ~ metagenome investigation
Some basic math:
 1000 single cells from a tumor…
 …sequenced to 40x haploid coverage with Illumina…
 …yields 120 Gbp each cell…
 …or 120 Tbp of data.
 HiSeq X10 can do the sequencing in ~3 weeks.
 The variant calling will require 2,000 CPU weeks…
 …so, given ~2,000 computers, can do this all in one month.
 …but this will soon be done ~100s-1000s of times a month.
Similar math applies:
 Pathogen detection in blood;
 Environmental sequencing;
 Sequencing rare DNA from circulating blood.
 Two issues:
Volume of data & compute
infrastructure;
Latency in turnaround.
Streaming algorithms are good for biggish data…
1-pass
Data
Answer
Raw data
(~10-100 GB) Analysis
"Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Compression
(~2 GB)
Lossy compression can substantially
reduce data size while retaining
information needed for later (re)analysis.
…as is lossy compression.
Moving all sequence analysis generically to
semi-streaming:
~1.2 pass, sublinear memory
Paper at: https://github.com/ged-lab/2014-streaming
Moving some sequence analysis to streaming.
~1.2 pass, sublinear memory
Paper at: https://github.com/ged-lab/2014-streaming
First pass: digital normalization - reduced set of k-mers.
Second pass: spectral analysis of data with reduced k-mer set.
First pass: collection of low-abundance reads + analysis of saturated reads.
Second pass: analysis of collected low-abundance reads.
First pass: collection of low-abundance reads + analysis of saturated reads.
(a)
(b)
(c)
two-pass;
reduced memory
few-pass;
reduced memory
online; streaming.
Five super-awesome technologies…
1. Low-memory k-mer counting
(Zhang et al., PLoS One, 2014)
2. Compressible assembly graphs
(Pell et al., PNAS, 2012)
3. Streaming lossy compression of sequence data
(Brown et al., arXiv, 2012)
4. A semi-streaming framework for sequence analysis
5. Graph-alignment approaches for fun and profit.
…implemented in one super- awesome software
package…
github.com/ged-lab/khmer/
BSD licensed
Openly developed using good practice.
> 30 external contributors.
Thousands of downloads/month.
100+ citations in 4 years.
We think > 5000 people are using it; have heard
from 100s. Bundled with software that ~100k people
are using.
What’s next?
In transition! MSU to UC Davis.
 So, uh, I joined a Vet Med school -
“Companion animals have genomes too!”
 Expanding my work more to genomic…
 Co-incident to moving to Davis, I also became a
Moore Foundation Data Driven Discovery
Investigator.
Tackling data availability…
In 5-10 years, we will have nigh-infinite data.
(Genomic, transcriptomic, proteomic, metabolomic,
…?)
We currently have no good way of querying,
exploring, investigating, or mining these data sets,
especially across multiple locations..
Moreover, most data is unavailable until after
publication…
…which, in practice, means it will be lost.
…and data integration.
Once you have all the data, what do you do?
"Business as usual simply cannot work."
Looking at millions to billions of genomes.
(David Haussler, 2014)
Funded: distributed graph database server
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
ivory.idyll.org/blog/2014-moore-ddd-award.html
The larger research vision:
100% buzzword compliantTM
Enable and incentivize sharing by providing
immediate utility; frictionless sharing.
Permissionless innovation for e.g. new data
mining approaches.
Plan for poverty with federated infrastructure
built on open & cloud.
Solve people’s current problems, while
remaining agile for the future.
ivory.idyll.org/blog/2014-moore-ddd-award.html
Education and training
Biology is underprepared for data-intensive
investigation.
We must teach and train the next generations.
~10-20 workshops / year, novice -> masterclass; open
materials.
Deeply self-interested:
What problems does everyone have, now?
(Assembly)
What problems do leading-edge researchers have?
(Data integration)
dib-training.rtfd.org/
Thanks!
Please contact me at ctbrown@ucdavis.edu!

More Related Content

What's hot

2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsmikaelhuss
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streamingc.titus.brown
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocolsc.titus.brown
 
Jan2016 dnanexus giab uses andrew carroll
Jan2016 dnanexus giab uses andrew carrollJan2016 dnanexus giab uses andrew carroll
Jan2016 dnanexus giab uses andrew carrollGenomeInABottle
 
Aug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsAug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsGenomeInABottle
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesScott Edmunds
 
Jan2016 bio nano han cao
Jan2016 bio nano han caoJan2016 bio nano han cao
Jan2016 bio nano han caoGenomeInABottle
 
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsCross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics Christopher Mason
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERHPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERcscpconf
 
Intro to metagenomic binning
Intro to metagenomic binningIntro to metagenomic binning
Intro to metagenomic binningA. Murat Eren
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedSpark Summit
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekData Driven Innovation
 

What's hot (20)

2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
 
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
 
Jan2016 dnanexus giab uses andrew carroll
Jan2016 dnanexus giab uses andrew carrollJan2016 dnanexus giab uses andrew carroll
Jan2016 dnanexus giab uses andrew carroll
 
Jan2016 pac bio giab
Jan2016 pac bio giabJan2016 pac bio giab
Jan2016 pac bio giab
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
Aug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsAug2015 analysis team spiral genetics
Aug2015 analysis team spiral genetics
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challenges
 
Jan2016 bio nano han cao
Jan2016 bio nano han caoJan2016 bio nano han cao
Jan2016 bio nano han cao
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsCross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
Genome Assembly 2018
Genome Assembly 2018Genome Assembly 2018
Genome Assembly 2018
 
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERHPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
 
Intro to metagenomic binning
Intro to metagenomic binningIntro to metagenomic binning
Intro to metagenomic binning
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
 

Viewers also liked

Where to focus event innovation? - An audience led approach
Where to focus event innovation? - An audience led approachWhere to focus event innovation? - An audience led approach
Where to focus event innovation? - An audience led approachLive Union
 
Hohmann Learning spaces Warwick english
Hohmann Learning spaces Warwick englishHohmann Learning spaces Warwick english
Hohmann Learning spaces Warwick englishTina Hohmann
 
Celebrating 30 years
Celebrating 30 yearsCelebrating 30 years
Celebrating 30 yearskfitzsy
 
R E V I V A L C O L L E G E S A Presentation
R E V I V A L   C O L L E G E  S A PresentationR E V I V A L   C O L L E G E  S A Presentation
R E V I V A L C O L L E G E S A PresentationIvin
 
Growing Through China: A Comprehensive Look at Market Opportunities
Growing Through China: A Comprehensive Look at Market Opportunities Growing Through China: A Comprehensive Look at Market Opportunities
Growing Through China: A Comprehensive Look at Market Opportunities Kegler Brown Hill + Ritter
 
SNS Strategy Guide
SNS Strategy GuideSNS Strategy Guide
SNS Strategy Guidezerofe
 
Eyeblaster Analytics Bulleting Online Video
Eyeblaster  Analytics  Bulleting  Online VideoEyeblaster  Analytics  Bulleting  Online Video
Eyeblaster Analytics Bulleting Online VideoEyeblaster Spain
 
Kakapo Keynote
Kakapo KeynoteKakapo Keynote
Kakapo KeynoteTakahe One
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
Come misurare i risultati sui social media
Come misurare i risultati sui social mediaCome misurare i risultati sui social media
Come misurare i risultati sui social mediaLoris Castagnini
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assemblyc.titus.brown
 
Persdev asr
Persdev asrPersdev asr
Persdev asrchander3
 
Tendencias En Comunicacion Digital Eyeblaster Oded Lida Ded09
Tendencias En Comunicacion Digital  Eyeblaster Oded Lida Ded09Tendencias En Comunicacion Digital  Eyeblaster Oded Lida Ded09
Tendencias En Comunicacion Digital Eyeblaster Oded Lida Ded09Eyeblaster Spain
 
The Loop Limketkai_rooms
The Loop Limketkai_roomsThe Loop Limketkai_rooms
The Loop Limketkai_roomsjessecadelina
 

Viewers also liked (20)

Where to focus event innovation? - An audience led approach
Where to focus event innovation? - An audience led approachWhere to focus event innovation? - An audience led approach
Where to focus event innovation? - An audience led approach
 
Alcohol # 1 concern march 16 2016
Alcohol # 1 concern march 16 2016Alcohol # 1 concern march 16 2016
Alcohol # 1 concern march 16 2016
 
Hohmann Learning spaces Warwick english
Hohmann Learning spaces Warwick englishHohmann Learning spaces Warwick english
Hohmann Learning spaces Warwick english
 
Celebrating 30 years
Celebrating 30 yearsCelebrating 30 years
Celebrating 30 years
 
R E V I V A L C O L L E G E S A Presentation
R E V I V A L   C O L L E G E  S A PresentationR E V I V A L   C O L L E G E  S A Presentation
R E V I V A L C O L L E G E S A Presentation
 
test_flash
test_flashtest_flash
test_flash
 
Growing Through China: A Comprehensive Look at Market Opportunities
Growing Through China: A Comprehensive Look at Market Opportunities Growing Through China: A Comprehensive Look at Market Opportunities
Growing Through China: A Comprehensive Look at Market Opportunities
 
Canada
CanadaCanada
Canada
 
2014 Workers' Compensation Seminar
2014 Workers' Compensation Seminar2014 Workers' Compensation Seminar
2014 Workers' Compensation Seminar
 
SNS Strategy Guide
SNS Strategy GuideSNS Strategy Guide
SNS Strategy Guide
 
Eyeblaster Analytics Bulleting Online Video
Eyeblaster  Analytics  Bulleting  Online VideoEyeblaster  Analytics  Bulleting  Online Video
Eyeblaster Analytics Bulleting Online Video
 
Kakapo Keynote
Kakapo KeynoteKakapo Keynote
Kakapo Keynote
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
Come misurare i risultati sui social media
Come misurare i risultati sui social mediaCome misurare i risultati sui social media
Come misurare i risultati sui social media
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
Persdev asr
Persdev asrPersdev asr
Persdev asr
 
Tendencias En Comunicacion Digital Eyeblaster Oded Lida Ded09
Tendencias En Comunicacion Digital  Eyeblaster Oded Lida Ded09Tendencias En Comunicacion Digital  Eyeblaster Oded Lida Ded09
Tendencias En Comunicacion Digital Eyeblaster Oded Lida Ded09
 
Peuples inconnus
Peuples inconnusPeuples inconnus
Peuples inconnus
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
The Loop Limketkai_rooms
The Loop Limketkai_roomsThe Loop Limketkai_rooms
The Loop Limketkai_rooms
 

Similar to 2015 ohsu-metagenome

2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global communityExternalEvents
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentationlordjoe
 
Long vs short read sequencing. Long read sequencing technology is po.pdf
Long vs short read sequencing. Long read sequencing technology is po.pdfLong vs short read sequencing. Long read sequencing technology is po.pdf
Long vs short read sequencing. Long read sequencing technology is po.pdfbalrajashok
 
Introduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysisIntroduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysisJosh Neufeld
 
Web Apollo Workshop University of Exeter
Web Apollo Workshop University of ExeterWeb Apollo Workshop University of Exeter
Web Apollo Workshop University of ExeterMonica Munoz-Torres
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...BaoTramDuong2
 
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Dominic Suciu
 
Drosophila Three-Point Test Cross Lab Write-Up Instructions.docx
Drosophila Three-Point Test Cross Lab Write-Up Instructions.docxDrosophila Three-Point Test Cross Lab Write-Up Instructions.docx
Drosophila Three-Point Test Cross Lab Write-Up Instructions.docxharold7fisher61282
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08Computer Science Club
 
Lecture on the annotation of transposable elements
Lecture on the annotation of transposable elementsLecture on the annotation of transposable elements
Lecture on the annotation of transposable elementsfmaumus
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment DesignYaoyu Wang
 

Similar to 2015 ohsu-metagenome (20)

2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentation
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
Long vs short read sequencing. Long read sequencing technology is po.pdf
Long vs short read sequencing. Long read sequencing technology is po.pdfLong vs short read sequencing. Long read sequencing technology is po.pdf
Long vs short read sequencing. Long read sequencing technology is po.pdf
 
Introduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysisIntroduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysis
 
Web Apollo Workshop University of Exeter
Web Apollo Workshop University of ExeterWeb Apollo Workshop University of Exeter
Web Apollo Workshop University of Exeter
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...
 
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
 
Drosophila Three-Point Test Cross Lab Write-Up Instructions.docx
Drosophila Three-Point Test Cross Lab Write-Up Instructions.docxDrosophila Three-Point Test Cross Lab Write-Up Instructions.docx
Drosophila Three-Point Test Cross Lab Write-Up Instructions.docx
 
Apollo Workshop at KSU 2015
Apollo Workshop at KSU 2015Apollo Workshop at KSU 2015
Apollo Workshop at KSU 2015
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
Lecture on the annotation of transposable elements
Lecture on the annotation of transposable elementsLecture on the annotation of transposable elements
Lecture on the annotation of transposable elements
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
Iplant pag
Iplant pagIplant pag
Iplant pag
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 

More from c.titus.brown

More from c.titus.brown (16)

2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 
2014 bosc-keynote
2014 bosc-keynote2014 bosc-keynote
2014 bosc-keynote
 

Recently uploaded

Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxGood agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxSimeonChristian
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)itwameryclare
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 

Recently uploaded (20)

Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxGood agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 

2015 ohsu-metagenome

  • 1. C . T I T U S B R O W N C T B R O W N @ U C D A V I S . E D U A S S O C I A T E P R O F E S S O R P O P U L A T I O N H E A L T H A N D R E P R O D U C T I O N S C H O O L O F V E T E R I N A R Y M E D I C I N E U N I V E R S I T Y O F C A L I F O R N I A , D A V I S Concepts and tools for exploring very large sequencing data sets.
  • 2. Some background & motivation:  We primarily build tools to look at large sequencing data sets.  Our interest is in enabling scientists to move quickly to hypotheses from data.
  • 3. My goals  Enable hypothesis-driven biology through better hypothesis generation & refinement.  Devalue “interest level” of sequence analysis and put myself out of a job.  Be a good mutualist!
  • 4. Narrative arc 1. Shotgun metagenomics: can we reconstruct community genomes? 2. Underlying technology-enabled approach – tools and platforms are good. 3. My larger plan for world domination through technology and training – a kinder, gentler world (?).
  • 5. Shotgun metagenomics  Collect samples;  Extract DNA;  Feed into sequencer;  Computationally analyze. Wikipedia: Environmental shotgun sequencing.png
  • 6. Shotgun sequencing & assembly http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
  • 7. To assemble, or not to assemble? Goals: reconstruct phylogenetic content and predict functional potential of ensemble.  Should we analyze short reads directly? OR  Do we assemble short reads into longer contigs first, and then analyze the contigs?
  • 8. Howe et al., 2014 Assemblies yield much more significant similarity matches. Assembly: good for annotation!
  • 9. But! Isn’t assembly problematic?  Chimeric misassemblies?  Uneven coverage?  Strain variation?  Computationally challenging?
  • 10. I. Benchmarking metagenome assembly  Most assembly papers analyze novel data sets and then have to argue that their result is ok (guilty!)  Very few assembly benchmarks have been done.  Even fewer (trustworthy) computational time/memory comparisons have been done.  And even fewer “assembly recipes” have been written down clearly.
  • 11. Shakya et al., 2013; pmid 23387867
  • 12. A mock community!  ~60 genomes, all sequenced;  Lab mixed with 10:1 ratio of most abundant to least abundant;  2x101 reads, 107 mn reads total (Illumina);  10.5 Gbp of sequence in toto.  The paper also compared16s primer sets & 454 shotgun metagenome data => reconstruction. Shakya et al., 2013; pmid 23387867
  • 13. Paper conclusions  “Metagenomic sequencing outperformed most SSU rRNA gene primer sets used in this study.”  “The Illumina short reads provided a very good estimates of taxonomic distribution above the species level, with only a two- to threefold overestimation of the actual number of genera and orders.”  “For the 454 data … the use of the default parameters severely overestimated higher level diversity (~ 20- fold for bacterial genera and identified > 100 spurious eukaryotes).” Shakya et al., 2013; pmid 23387867
  • 14. How about assembly??  Shakya et al. did not do assembly; no standard for analysis at the time, not experts.  But we work on assembly!  And we’ve been working on a tutorial/process for doing it!
  • 15. Adapter trim & quality filter Diginorm to C=10 Trim high- coverage reads at low-abundance k-mers Diginorm to C=5 Partition graph Split into "groups" Reinflate groups (optional Assemble!!! Map reads to assembly Too big to assemble? Small enough to assemble? Annotate contigs with abundances MG-RAST, etc. The Kalamazoo Metagenomics Protocol Derived from approach used in Howe et al., 2014
  • 17. Adapter trim & quality filter Diginorm to C=10 Trim high- coverage reads at low-abundance k-mers Diginorm to C=5 Partition graph Split into "groups" Reinflate groups (optional Assemble!!! Map reads to assembly Too big to assemble? Small enough to assemble? Annotate contigs with abundances MG-RAST, etc. Kalamazoo Metagenomics Protocol => benchmarking! Assemble with Velvet, IDBA, SPAdes
  • 18. Benchmarking process  Apply various filtering treatments to the data (x3)  Basic quality trimming and filtering  + digital normalization  + partitioning  Apply different assemblers to the data for each treatment (x3)  IDBA  SPAdes  Velvet  Measure compute time/memory req’d.  Compare assembly results to “known” answer with Quast.
  • 19. Recovery, by assembler Velvet IDBA Spades Quality Quality Quality Total length (>= 0 bp) 1.6E+08 2.0E+08 2.0E+08 Total length (>= 1000 bp) 1.6E+08 1.9E+08 1.9E+08 Largest contig 561,449 979,948 1,387,918 # misassembled contigs 631 1032 752 Genome fraction (%) 72.949 90.969 90.424 Duplication ratio 1.004 1.007 1.004 Conclusion: SPAdes and IDBA achieve similar results. Dr. Sherine Awad
  • 20. Treatments do not alter results very much. IDBA Default Diginorm Partition Total length (>= 0 bp) 2.0E+08 2.0E+08 2.0E+08 Total length (>= 1000 bp) 1.9E+08 2.0E+08 1.9E+08 Largest contig 979,948 1,469,321 551,171 # misassembled contigs 1032 916 828 Unaligned length 10,709,716 10,637,811 10,644,357 Genome fraction (%) 90.969 91.003 90.082 Duplication ratio 1.007 1.008 1.007 Dr. Sherine Awad
  • 21. Treatments do save compute time. Velvet idba Spades Time (h:m:s) RAM (gb) Time (h:m:s) RAM (gb) Time (h:m:s) RAM (gb) Quality 60:42:52 1,594 33:53:46 129 67:02:16 400 Diginorm 6:48:46 827 6:34:24 104 15:53:10 127 Partition 4:30:36 1,156 8:30:29 93 7:54:26 129 (Run on Michigan State HPC) Dr. Sherine Awad
  • 22. Need to understand:  What is not being assembled and why?  Low coverage?  Strain variation?  Something else?  Effects of strain variation: no assembly.  Additional contigs being assembled – contamination? Spurious assembly?
  • 23. Assembly conclusions  90% recovery is not bad; relatively few misassemblies, too.  This was not a highly polymorphic community BUT it did have several closely related strains; more generally, we see that strains do generate chimeras, but not between different species.  …challenging to execute even with a tutorial/protocol.
  • 24. We need much deeper sampling! Sharon et al., 2015 (Genome Res) Overlap between synthetic long reads and short reads.
  • 25. Benchmarking & protocols  Our work is completely reproducible and open.  You can re-run our benchmarks yourself if you want!  We will be adding new assemblers in as time permits.  Protocol is open, versioned, citable… but also still a work in progress :)
  • 26. II: Shotgun sequencing and coverage “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 27. Assembly depends on high coverage HMP mock community
  • 28. Main questions -- I. How do we know if we’ve sequenced enough? II. Can we predict how much more we need to sequence to see <insert some feature here>? Note: necessary sequencing depth cannot accurately be predicted solely from SSU/amplicon data
  • 29. Method 1: looking for WGS saturation We can track how many sequences we keep of the sequences we’ve seen, to detect saturation.
  • 30. Data from Shakya et al., 2013 (pmid: 23387867 We can detect saturation of shotgun sequencing
  • 31. Data from Shakya et al., 2013 (pmid: 23387867 We can detect saturation of shotgun sequencing C=10, for assembly
  • 32. Estimating metagenome nt richness: # bp at saturation / coverage  MM5 deep carbon: 60 Mbp  Iowa prairie soil: 12 Gbp  Amazon Rain Forest Microbial Observatory soil: 26 Gbp Assumes: few entirely erroneous reads (upper bound); at saturation (lower bound).
  • 33. WGS saturation approach:  Tells us when we have enough sequence.  Can’t be predictive… if you haven’t sampled something, you can’t say anything about it. Can we correlate deep amplicon sequencing with shallower WGS?
  • 34. Correlating 16s and shotgun WGS How much of 16s do you see… with how much shotgun sequencing?
  • 35. Data from Shakya et al., 2013 (pmid: 23387867) WGS saturation ~matches 16s saturation < rRNA copy number >
  • 36. Method is robust to organisms unsampled by amplicon sequencing. Insensitive to amplicon primer bias. Robust to genome size differences, eukaryotes, phage. Data from Shakya et al., 2013 (pmid: 23387867
  • 37. Can examine specific OTUs Data from Shakya et al., 2013 (pmid: 23387867
  • 38. OTU abundance is ~correct. Data from Shakya et al., 2013 (pmid: 23387867
  • 39. Running on real communities --
  • 40. Running on real communities --
  • 41. Concluding thoughts on metagenomes -  The main obstacle to recovering genomic details of communities is shallow sampling.  Considerably deeper sampling is needed – 1000x (petabasepair sampling)  This will inevitably happen!  …I would like to make sure the compute technology is there, when it does.
  • 42. More general: computation needs to scale! Navin et al., 2011
  • 43. Cancer investigation ~ metagenome investigation Some basic math:  1000 single cells from a tumor…  …sequenced to 40x haploid coverage with Illumina…  …yields 120 Gbp each cell…  …or 120 Tbp of data.  HiSeq X10 can do the sequencing in ~3 weeks.  The variant calling will require 2,000 CPU weeks…  …so, given ~2,000 computers, can do this all in one month.  …but this will soon be done ~100s-1000s of times a month.
  • 44. Similar math applies:  Pathogen detection in blood;  Environmental sequencing;  Sequencing rare DNA from circulating blood.  Two issues: Volume of data & compute infrastructure; Latency in turnaround.
  • 45. Streaming algorithms are good for biggish data… 1-pass Data Answer
  • 46. Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Lossy compression can substantially reduce data size while retaining information needed for later (re)analysis. …as is lossy compression.
  • 47. Moving all sequence analysis generically to semi-streaming: ~1.2 pass, sublinear memory Paper at: https://github.com/ged-lab/2014-streaming
  • 48. Moving some sequence analysis to streaming. ~1.2 pass, sublinear memory Paper at: https://github.com/ged-lab/2014-streaming First pass: digital normalization - reduced set of k-mers. Second pass: spectral analysis of data with reduced k-mer set. First pass: collection of low-abundance reads + analysis of saturated reads. Second pass: analysis of collected low-abundance reads. First pass: collection of low-abundance reads + analysis of saturated reads. (a) (b) (c) two-pass; reduced memory few-pass; reduced memory online; streaming.
  • 49. Five super-awesome technologies… 1. Low-memory k-mer counting (Zhang et al., PLoS One, 2014) 2. Compressible assembly graphs (Pell et al., PNAS, 2012) 3. Streaming lossy compression of sequence data (Brown et al., arXiv, 2012) 4. A semi-streaming framework for sequence analysis 5. Graph-alignment approaches for fun and profit.
  • 50. …implemented in one super- awesome software package… github.com/ged-lab/khmer/ BSD licensed Openly developed using good practice. > 30 external contributors. Thousands of downloads/month. 100+ citations in 4 years. We think > 5000 people are using it; have heard from 100s. Bundled with software that ~100k people are using.
  • 51. What’s next? In transition! MSU to UC Davis.  So, uh, I joined a Vet Med school - “Companion animals have genomes too!”  Expanding my work more to genomic…  Co-incident to moving to Davis, I also became a Moore Foundation Data Driven Discovery Investigator.
  • 52. Tackling data availability… In 5-10 years, we will have nigh-infinite data. (Genomic, transcriptomic, proteomic, metabolomic, …?) We currently have no good way of querying, exploring, investigating, or mining these data sets, especially across multiple locations.. Moreover, most data is unavailable until after publication… …which, in practice, means it will be lost.
  • 53. …and data integration. Once you have all the data, what do you do? "Business as usual simply cannot work." Looking at millions to billions of genomes. (David Haussler, 2014)
  • 54. Funded: distributed graph database server Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI) ivory.idyll.org/blog/2014-moore-ddd-award.html
  • 55. The larger research vision: 100% buzzword compliantTM Enable and incentivize sharing by providing immediate utility; frictionless sharing. Permissionless innovation for e.g. new data mining approaches. Plan for poverty with federated infrastructure built on open & cloud. Solve people’s current problems, while remaining agile for the future. ivory.idyll.org/blog/2014-moore-ddd-award.html
  • 56. Education and training Biology is underprepared for data-intensive investigation. We must teach and train the next generations. ~10-20 workshops / year, novice -> masterclass; open materials. Deeply self-interested: What problems does everyone have, now? (Assembly) What problems do leading-edge researchers have? (Data integration) dib-training.rtfd.org/
  • 57. Thanks! Please contact me at ctbrown@ucdavis.edu!

Editor's Notes

  1. ~Easy to say how much you need for a single genome.
  2. Note: 16s is higher copy number, more sensitive than WGS.
  3. otu5 is acidobacterium; one species, Acidobacterium capsulatum, with one rRNA; 4.6% of BA community, 4.7% of Illumina reads; # otu2 is chlorobium; five species, total of 10 rRNA; 9.1% of Illumina. Correction factor of 5.
  4. Applicable to many basic sequence analysis problems: error removal, species sorting, and de novo sequence assembly.
  5. Hard to tell how many people are using it because it’s freely available in several locations.
  6. Analyze data in cloud; import and export important; connect to other databases.
  7. Work with other Moore DDD folk on the data mining aspect. Start with cross validation, move to more sophisticated in-server implementations.
  8. Passionate about training; necessary fro advancement of field; also deeply self-interested because I find out what the real problems are. (“Some people can do assembly” is not “everyone can do assembly”)