SlideShare a Scribd company logo
1 of 68
Memory- and time-efficient 
approaches to sequence 
analysis with streaming 
algorithms 
C. Titus Brown 
ctb@msu.edu
Part I: Digital normalization
Problem: De Bruijn assembly graphs scale 
with data size, not information. 
Conway T C , Bromage A J Bioinformatics 2011;27:479-486 
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, 
please email: journals.permissions@oup.com
This is the effect of errors: 
Single nucleotide variations cause long branches
This is the effect of errors: 
Single nucleotide variations cause long branches; 
They don’t rejoin quickly.
Can we change this scaling behavior? 
Conway T C , Bromage A J Bioinformatics 2011;27:479-486 
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, 
please email: journals.permissions@oup.com
An apparent digression: 
Much of next-gen sequencing is redundant.
Shotgun sequencing and 
coverage 
“Coverage” is simply the average number of reads that overlap 
each true base in genome. 
Here, the coverage is ~10 – just draw a line straight down from the 
top through all of the reads.
Random sampling => deep sampling 
needed 
Typically 10-100x needed for robust recovery (300 Gbp for human)
An apparent digression: 
Much of next-gen sequencing is redundant. 
Can we eliminate this redundancy?
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Basic diginorm algorithm 
We can build the approach on anything that lets us estimate coverage of a read. 
for read in dataset: 
if estimated_coverage(read) < CUTOFF: 
update_kmer_counts(read) 
save(read) 
else: 
# discard read 
Note, single pass; sublinear memory.
The median k-mer count in a “sentence” is a 
~good estimator of coverage. 
This gives us a 
reference-free 
measure of 
coverage.
Digital normalization is 
streaming
Digital normalization is 
streaming
Digital normalization is 
streaming
Digital normalization is 
streaming
Digital normalization is 
streaming
Digital normalization is 
streaming
Digital normalization retains information, while 
discarding data and errors
Digital normalization is 
streaming error correction
Digital normalization retains information, while 
discarding data and errors
Contig assembly now scales with underlying genome 
size 
 Transcriptomes, microbial genomes incl MDA, and 
most metagenomes can be assembled in under 50 
GB of RAM, with ~identical or improved results.
Victory! (?) 
Conway T C , Bromage A J Bioinformatics 2011;27:479-486 
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, 
please email: journals.permissions@oup.com
A few “minor” drawbacks… 
1. Repeats are eliminated preferentially. 
2. Genuine graph tips are truncated. 
3. Polyploidy is downsampled. 
4. It’s not clear what happens to polymorphism. 
(For these reasons, we have been pursuing alternate 
approaches.) 
Partially discussed in Brown et al., 2012 (arXiv)
But still quite useful… 
1. Assembling soil metagenomes. 
Howe et al., PNAS, 2014 (w/Tiedje) 
2. Understanding bone-eating worm symbionts. 
Goffredi et al., ISME, 2014. 
3. An ultra-deep look at the lamprey transcriptome. 
Scott et al., in preparation (w/Li) 
4. Understanding development in Molgulid ascidians. 
Stolfi et al, eLife 2014; etc.
…and widely used (?) 
Estimated ~1000 users of our software. 
Diginorm algorithm now included in Trinity software 
from Broad Institute (~10,000 users) 
Illumina TruSeq long-read technology now 
incorporates our approach (~100,000 users)
Part II: Wait, did you say 
streaming?
Diginorm can detect graph 
saturation
Graph saturation 
for read in dataset: 
if estimated_coverage(read) < CUTOFF: 
update_kmer_counts(read) 
save(read) 
else: 
# high coverage read: do something clever!
“Few-pass” approach 
By 20% of the way through 100x data set, more 
than half the reads are saturated to 20x
Graph saturation 
for read in dataset: 
if estimated_coverage(read) < CUTOFF: 
update_kmer_counts(read) 
save(read) 
else: 
# high coverage read: do something clever!
(A) Streaming error detection for 
metagenomes and transcriptomes 
 Illumina has between 0.1% and 1% error rate. 
 These errors confound mapping, assembly, etc. 
(Think: what if you had error free reads? Life would be 
much better)
Spectral error detection for genomes 
Chaisson et al., 2009 
True k-mers 
Erroneous k-mers
Spectral error detection on 
reads -- 
Error location!
…spectral error detection for reads => 
transcriptome, metagenome 
True k-mers 
Erroneous k-mers 
Chaisson et al., 2009
Spectral error detection on 
variable coverage data 
How many of the errors can we pinpoint exactly? 
f saturated Specificity Sensitivity 
Genome 100% 71.4% 77.9% 
Transcriptome 92% 67.7% 63.8% 
Metagenome 96% 71.2% 68.9% 
Real E. coli 100% 51.1% 72.4%
(B) Streaming error trimming for all shotgun 
data 
We can trim reads at first error. 
f saturated error rate 
total bases 
trimmed 
errors 
remaining 
Genome 100% 0.63% 31.90% 0.00% 
Transcriptome 
92% 0.65% 34.34% 0.07% 
Metagenome 
96% 0.62% 31.70% 0.04% 
Real E. coli 100% 1.59% 12.96% 0.05%
(C) Streaming error correction 
 Once you can do error detection and trimming on a 
streaming basis, why not error correction? 
 …using a new approach…
Streaming error correction of genomic, transcriptomic, 
metagenomic data via graph alignment 
Jason Pell, Jordan Fish, Michael Crusoe
Pair-HMM-based graph 
alignment 
Jordan Fish and Michael Crusoe
…a bit more complex... 
Jordan Fish and Michael Crusoe
Error correction on simulated E. 
coli data 
TP FP TN FN 
Streaming 3,494,631 3,865 460,601,171 5,533 
(corrected) (mistakes) (OK) (missed) 
1% error rate, 100x coverage. 
Michael Crusoe, Jordan Fish, Jason Pell
A few additional thoughts -- 
 Sequence-to-graph alignment is a very general 
concept. 
 Could replace mapping, variant calling, BLAST, 
HMMER… 
“Ask me for anything but time!” 
-- Napoleon Bonaparte
(D) Calculating read error rates 
by position within read 
 Shotgun data is randomly 
sampled; 
 Any variation in mismatches with 
reference by position is likely due 
to errors or bias. 
Reads 
Assemble 
Map reads to 
assembly 
Calculate position-specific 
mismatches
Sequencing run error profiles 
Via bowtie mapping against reference -- 
Reads from Shakya et al., pmid 23387867
We can do this sub-linearly from data w/no 
reference! 
Reads from Shakya et al., pmid 23387867
Reference-free error profile 
analysis 
1. Requires no prior information! 
2. Immediate feedback on sequencing quality (for cores 
& users) 
3. Fast, lightweight (~100 MB, ~2 minutes) 
4. Works for any shotgun sample (genomic, 
metagenomic, transcriptomic). 
5. Not affected by polymorphisms.
Reference-free error profile 
analysis 
7. …if we know where the errors are, we can trim them. 
8. …if we know where the errors are, we can correct them. 
9. …if we look at differences by graph position instead of by 
read position, we can call variants. 
=> Streaming, online variant calling?
Future thoughts / streaming 
How far can we take this?
Streaming approach supports more compute-intensive 
interludes – remapping, etc. 
Rimmer et al., 2014
Streaming online reference-free variant calling. 
Single pass, reference free, tunable, streaming 
online variant calling.
Streaming with reads… 
Sequence... 
Graph 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
.... 
Variants
Analysis is done after 
sequencing. 
Sequencing Analysis
Streaming with bases 
k bases... 
Graph 
k+1 
k bases... k+1 
k+2 
k bases... k+1 
k bases... k+1 
k bases... k+1 
... 
k bases... k+1 
Variants
Integrate sequencing and 
analysis 
Sequencing 
Analysis 
Are we done yet?
Directions for streaming graph 
analysis 
 Generate error profile for shotgun reads; 
 Variable coverage error trimming; 
 Streaming low-memory error correction for genomes, 
metagenomes, and transcriptomes; 
 Strain variant detection & resolution; 
 Streaming variant analysis. 
Michael Crusoe, Jordan Fish & Jason Pell
Our software is open source 
Methods that aren’t broadly available are limited in their 
utility! 
 Everything I talked about is in our github repository, 
http://github.com/ged-lab/khmer 
 …it’s not necessarily trivial to use… 
 …but we’re happy to help.
We have recipes!
Planned work: distributed graph database server 
Web interface + API 
Compute server 
(Galaxy? 
Arvados?) 
Data/ 
Info 
Raw data sets 
Public 
servers 
"Walled 
garden" 
server 
Private 
server 
Graph query layer 
Upload/submit 
(NCBI, KBase) 
Import 
(MG-RAST, 
SRA, EBI) 
ivory.idyll.org/blog/2014-moore-ddd-talk.html
Thanks for listening!

More Related Content

What's hot

Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesScott Edmunds
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assemblyc.titus.brown
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsmikaelhuss
 
20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal ClubMed_KU
 
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERHPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERcscpconf
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitData Con LA
 
Closing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceClosing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceJustin Johnson
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopMorgan Langille
 
Bioinformatics
BioinformaticsBioinformatics
BioinformaticsAznaShihab
 
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseqDenis C. Bauer
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?Nick Loman
 
DNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectiveDNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectivePalaniappan SP
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFlávio Codeço Coelho
 

What's hot (20)

Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challenges
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club
 
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERHPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn Langit
 
Closing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceClosing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real Science
 
Genome Big Data
Genome Big DataGenome Big Data
Genome Big Data
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics Workshop
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge Graphs
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?
 
DNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectiveDNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data Perspective
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
 

Viewers also liked

Digital Footprints: Using the Internet to enhance your career prospects
Digital Footprints: Using the Internet to enhance your career prospectsDigital Footprints: Using the Internet to enhance your career prospects
Digital Footprints: Using the Internet to enhance your career prospectsJudith Baines
 
The song of the birds
The song of the birdsThe song of the birds
The song of the birdsDaniel Chua
 
BEACON's Cyberinfrastructure Needs
BEACON's Cyberinfrastructure NeedsBEACON's Cyberinfrastructure Needs
BEACON's Cyberinfrastructure Needsc.titus.brown
 
用寧靜心擁抱世界
用寧靜心擁抱世界用寧靜心擁抱世界
用寧靜心擁抱世界tina59520
 
Trabajo De IngléS
Trabajo De IngléSTrabajo De IngléS
Trabajo De IngléSdaniva
 
10 Biggest Brain Damaging Habits
10  Biggest  Brain  Damaging  Habits10  Biggest  Brain  Damaging  Habits
10 Biggest Brain Damaging Habitslewisj2111
 
Understanding Facebook Places
Understanding Facebook PlacesUnderstanding Facebook Places
Understanding Facebook PlacesMarco Pacifico
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
Celebrating 30 years
Celebrating 30 yearsCelebrating 30 years
Celebrating 30 yearskfitzsy
 
Trainings Evaluation Reports WPS Phase-II Bhakkar
Trainings Evaluation Reports WPS Phase-II BhakkarTrainings Evaluation Reports WPS Phase-II Bhakkar
Trainings Evaluation Reports WPS Phase-II BhakkarZafar Ahmad
 
Eyeblaster Analytics Bulleting Online Video
Eyeblaster  Analytics  Bulleting  Online VideoEyeblaster  Analytics  Bulleting  Online Video
Eyeblaster Analytics Bulleting Online VideoEyeblaster Spain
 

Viewers also liked (20)

Matchmoving Introduction
Matchmoving IntroductionMatchmoving Introduction
Matchmoving Introduction
 
Digital Footprints: Using the Internet to enhance your career prospects
Digital Footprints: Using the Internet to enhance your career prospectsDigital Footprints: Using the Internet to enhance your career prospects
Digital Footprints: Using the Internet to enhance your career prospects
 
The song of the birds
The song of the birdsThe song of the birds
The song of the birds
 
Circles of San Antonio Community Coalition Overview
Circles of San Antonio Community Coalition OverviewCircles of San Antonio Community Coalition Overview
Circles of San Antonio Community Coalition Overview
 
BEACON's Cyberinfrastructure Needs
BEACON's Cyberinfrastructure NeedsBEACON's Cyberinfrastructure Needs
BEACON's Cyberinfrastructure Needs
 
Ny Vraa Bioenergi
Ny Vraa BioenergiNy Vraa Bioenergi
Ny Vraa Bioenergi
 
用寧靜心擁抱世界
用寧靜心擁抱世界用寧靜心擁抱世界
用寧靜心擁抱世界
 
Trabajo De IngléS
Trabajo De IngléSTrabajo De IngléS
Trabajo De IngléS
 
Teatro-cas
Teatro-casTeatro-cas
Teatro-cas
 
10 Biggest Brain Damaging Habits
10  Biggest  Brain  Damaging  Habits10  Biggest  Brain  Damaging  Habits
10 Biggest Brain Damaging Habits
 
Curriculum oscar
Curriculum oscarCurriculum oscar
Curriculum oscar
 
Understanding Facebook Places
Understanding Facebook PlacesUnderstanding Facebook Places
Understanding Facebook Places
 
Coalition Orientation for SACADA Board Members
Coalition Orientation for SACADA Board MembersCoalition Orientation for SACADA Board Members
Coalition Orientation for SACADA Board Members
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
Celebrating 30 years
Celebrating 30 yearsCelebrating 30 years
Celebrating 30 years
 
Roman roads
Roman roadsRoman roads
Roman roads
 
Byplansarkitekt Lise Degn
Byplansarkitekt Lise DegnByplansarkitekt Lise Degn
Byplansarkitekt Lise Degn
 
Ramazan
RamazanRamazan
Ramazan
 
Trainings Evaluation Reports WPS Phase-II Bhakkar
Trainings Evaluation Reports WPS Phase-II BhakkarTrainings Evaluation Reports WPS Phase-II Bhakkar
Trainings Evaluation Reports WPS Phase-II Bhakkar
 
Eyeblaster Analytics Bulleting Online Video
Eyeblaster  Analytics  Bulleting  Online VideoEyeblaster  Analytics  Bulleting  Online Video
Eyeblaster Analytics Bulleting Online Video
 

Similar to 2014 anu-canberra-streaming

2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizonac.titus.brown
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptxc.titus.brown
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08Computer Science Club
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talkc.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of BioinformaticsManaging & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of BioinformaticsRaul Chong
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured predictionzukun
 
Hadoop for Bioinformatics
Hadoop for BioinformaticsHadoop for Bioinformatics
Hadoop for BioinformaticsDeepak Singh
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-researchc.titus.brown
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 

Similar to 2014 anu-canberra-streaming (20)

2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
digital datacomm
digital datacommdigital datacomm
digital datacomm
 
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of BioinformaticsManaging & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
 
2012 stamps-mbl-1
2012 stamps-mbl-12012 stamps-mbl-1
2012 stamps-mbl-1
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured prediction
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
Hadoop for Bioinformatics
Hadoop for BioinformaticsHadoop for Bioinformatics
Hadoop for Bioinformatics
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 

More from c.titus.brown

More from c.titus.brown (19)

2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 
2014 bosc-keynote
2014 bosc-keynote2014 bosc-keynote
2014 bosc-keynote
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 

Recently uploaded

COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)AkefAfaneh2
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicinesherlingomez2
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 

Recently uploaded (20)

COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 

2014 anu-canberra-streaming

  • 1. Memory- and time-efficient approaches to sequence analysis with streaming algorithms C. Titus Brown ctb@msu.edu
  • 2. Part I: Digital normalization
  • 3. Problem: De Bruijn assembly graphs scale with data size, not information. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 4. This is the effect of errors: Single nucleotide variations cause long branches
  • 5. This is the effect of errors: Single nucleotide variations cause long branches; They don’t rejoin quickly.
  • 6. Can we change this scaling behavior? Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 7. An apparent digression: Much of next-gen sequencing is redundant.
  • 8. Shotgun sequencing and coverage “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 9. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (300 Gbp for human)
  • 10. An apparent digression: Much of next-gen sequencing is redundant. Can we eliminate this redundancy?
  • 17. Basic diginorm algorithm We can build the approach on anything that lets us estimate coverage of a read. for read in dataset: if estimated_coverage(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read Note, single pass; sublinear memory.
  • 18. The median k-mer count in a “sentence” is a ~good estimator of coverage. This gives us a reference-free measure of coverage.
  • 25. Digital normalization retains information, while discarding data and errors
  • 26. Digital normalization is streaming error correction
  • 27. Digital normalization retains information, while discarding data and errors
  • 28. Contig assembly now scales with underlying genome size  Transcriptomes, microbial genomes incl MDA, and most metagenomes can be assembled in under 50 GB of RAM, with ~identical or improved results.
  • 29. Victory! (?) Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 30. A few “minor” drawbacks… 1. Repeats are eliminated preferentially. 2. Genuine graph tips are truncated. 3. Polyploidy is downsampled. 4. It’s not clear what happens to polymorphism. (For these reasons, we have been pursuing alternate approaches.) Partially discussed in Brown et al., 2012 (arXiv)
  • 31. But still quite useful… 1. Assembling soil metagenomes. Howe et al., PNAS, 2014 (w/Tiedje) 2. Understanding bone-eating worm symbionts. Goffredi et al., ISME, 2014. 3. An ultra-deep look at the lamprey transcriptome. Scott et al., in preparation (w/Li) 4. Understanding development in Molgulid ascidians. Stolfi et al, eLife 2014; etc.
  • 32. …and widely used (?) Estimated ~1000 users of our software. Diginorm algorithm now included in Trinity software from Broad Institute (~10,000 users) Illumina TruSeq long-read technology now incorporates our approach (~100,000 users)
  • 33. Part II: Wait, did you say streaming?
  • 34. Diginorm can detect graph saturation
  • 35. Graph saturation for read in dataset: if estimated_coverage(read) < CUTOFF: update_kmer_counts(read) save(read) else: # high coverage read: do something clever!
  • 36. “Few-pass” approach By 20% of the way through 100x data set, more than half the reads are saturated to 20x
  • 37. Graph saturation for read in dataset: if estimated_coverage(read) < CUTOFF: update_kmer_counts(read) save(read) else: # high coverage read: do something clever!
  • 38. (A) Streaming error detection for metagenomes and transcriptomes  Illumina has between 0.1% and 1% error rate.  These errors confound mapping, assembly, etc. (Think: what if you had error free reads? Life would be much better)
  • 39. Spectral error detection for genomes Chaisson et al., 2009 True k-mers Erroneous k-mers
  • 40. Spectral error detection on reads -- Error location!
  • 41. …spectral error detection for reads => transcriptome, metagenome True k-mers Erroneous k-mers Chaisson et al., 2009
  • 42. Spectral error detection on variable coverage data How many of the errors can we pinpoint exactly? f saturated Specificity Sensitivity Genome 100% 71.4% 77.9% Transcriptome 92% 67.7% 63.8% Metagenome 96% 71.2% 68.9% Real E. coli 100% 51.1% 72.4%
  • 43. (B) Streaming error trimming for all shotgun data We can trim reads at first error. f saturated error rate total bases trimmed errors remaining Genome 100% 0.63% 31.90% 0.00% Transcriptome 92% 0.65% 34.34% 0.07% Metagenome 96% 0.62% 31.70% 0.04% Real E. coli 100% 1.59% 12.96% 0.05%
  • 44. (C) Streaming error correction  Once you can do error detection and trimming on a streaming basis, why not error correction?  …using a new approach…
  • 45. Streaming error correction of genomic, transcriptomic, metagenomic data via graph alignment Jason Pell, Jordan Fish, Michael Crusoe
  • 46. Pair-HMM-based graph alignment Jordan Fish and Michael Crusoe
  • 47. …a bit more complex... Jordan Fish and Michael Crusoe
  • 48. Error correction on simulated E. coli data TP FP TN FN Streaming 3,494,631 3,865 460,601,171 5,533 (corrected) (mistakes) (OK) (missed) 1% error rate, 100x coverage. Michael Crusoe, Jordan Fish, Jason Pell
  • 49.
  • 50.
  • 51. A few additional thoughts --  Sequence-to-graph alignment is a very general concept.  Could replace mapping, variant calling, BLAST, HMMER… “Ask me for anything but time!” -- Napoleon Bonaparte
  • 52. (D) Calculating read error rates by position within read  Shotgun data is randomly sampled;  Any variation in mismatches with reference by position is likely due to errors or bias. Reads Assemble Map reads to assembly Calculate position-specific mismatches
  • 53. Sequencing run error profiles Via bowtie mapping against reference -- Reads from Shakya et al., pmid 23387867
  • 54. We can do this sub-linearly from data w/no reference! Reads from Shakya et al., pmid 23387867
  • 55. Reference-free error profile analysis 1. Requires no prior information! 2. Immediate feedback on sequencing quality (for cores & users) 3. Fast, lightweight (~100 MB, ~2 minutes) 4. Works for any shotgun sample (genomic, metagenomic, transcriptomic). 5. Not affected by polymorphisms.
  • 56. Reference-free error profile analysis 7. …if we know where the errors are, we can trim them. 8. …if we know where the errors are, we can correct them. 9. …if we look at differences by graph position instead of by read position, we can call variants. => Streaming, online variant calling?
  • 57. Future thoughts / streaming How far can we take this?
  • 58. Streaming approach supports more compute-intensive interludes – remapping, etc. Rimmer et al., 2014
  • 59. Streaming online reference-free variant calling. Single pass, reference free, tunable, streaming online variant calling.
  • 60. Streaming with reads… Sequence... Graph Sequence... Sequence... Sequence... Sequence... Sequence... Sequence... Sequence... .... Variants
  • 61. Analysis is done after sequencing. Sequencing Analysis
  • 62. Streaming with bases k bases... Graph k+1 k bases... k+1 k+2 k bases... k+1 k bases... k+1 k bases... k+1 ... k bases... k+1 Variants
  • 63. Integrate sequencing and analysis Sequencing Analysis Are we done yet?
  • 64. Directions for streaming graph analysis  Generate error profile for shotgun reads;  Variable coverage error trimming;  Streaming low-memory error correction for genomes, metagenomes, and transcriptomes;  Strain variant detection & resolution;  Streaming variant analysis. Michael Crusoe, Jordan Fish & Jason Pell
  • 65. Our software is open source Methods that aren’t broadly available are limited in their utility!  Everything I talked about is in our github repository, http://github.com/ged-lab/khmer  …it’s not necessarily trivial to use…  …but we’re happy to help.
  • 67. Planned work: distributed graph database server Web interface + API Compute server (Galaxy? Arvados?) Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI) ivory.idyll.org/blog/2014-moore-ddd-talk.html

Editor's Notes

  1. A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
  2. A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
  3. Note that any such measure will do.
  4. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
  5. A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
  6. The point is to enable biology; volume and velocity of data from sequencers is blocking.
  7. Update from Jordan
  8. Analyze data in cloud; import and export important; connect to other databases.