SlideShare a Scribd company logo
1 of 28
K-mers in Metagenomics
by donovan parks
2 of 27
metagenomics
environmental
sample
extract and
sequence DNA
QC and error
correct reads
(K-mers!)
assemble
(K-mers!)
bin genomes
(K-mers!)
assign taxonomy
(and function)
(K-mers!)
refine genomes
(K-mers!)
Assigning Taxonomic Labels to
Metagenomic DNA Sequences
4 of 27
a plethora of approaches
 Homology: BLAST, MEGAN
 Composition: Kraken, CLARK, Naïve Bayes
 Hybrid: PhymmBL, FCP, PhyloPythia
 Phylogenetic: Treephyler, AMPHORA, GraftM
 Marker genes: 16S profiling, MetaPhlAn, PhyloSift
classifyallreadsclassifysubset
5 of 27
exploiting genomic (K-mer) signatures
 PhymmBL (K≤8): interpolated Markov model
 PhyloPythia (K ≈6): multiclass support vector machine
 Naïve Bayes (K ≈15): probability of observing a K-mer
 Kraken (K ≈31): exact K-mer matching
 CLARK (K ≈31): exact matching of discriminative K-mers
denseprofilessparseprofiles
6 of 27
Kraken: K-mer LCA database
Wood and Salzberg, Genome Biology, 2014
Reference Genomes
(2,256 RefSeq Genomes)
Lowest common ancestor
database
K-mer LCA
ACC … GT g__Escherichia
ACG … GT s__E. coli
AGT … AA p__Proteobacteria
…
TGA … TT d__Bacteria
Extract
K-mers
(default, K = 31)
7 of 27
Kraken: classification tree
Wood and Salzberg, Genome Biology, 2014
8 of 27
assessment of methods
Results from Ounit et al., BMC Genomics, 2015
and Wood and Salzberg, Genome Biology, 2014
Classifier Precision Sensitivity Speed
Megablast 99.0 79.0 -
Naïve Bayes (K = 15) 82.3 82.3 8
Naïve Bayes (K = 11) 59.0 59.0 20
PhymmBL 82.3 82.3 -
CLARK 99.3 77.2 3.1 million
Kraken (K = 31) 99.3 77.8 2.3 million
Kraken (K = 20) 80.2 82.7 1.5 million
 Precision: (correct classifications) / (total classifications)
 Sensitivity: (correct classifications) / (total reads)
 Speed: reads per minute
 Results for simple simulated dataset
9 of 27
impact of K and reference database size
Classifier Precision Sensitivity Speed
Megablast 99.0 79.0 -
Naïve Bayes (K = 15) 82.3 82.3 8
Naïve Bayes (K = 11) 59.0 59.0 20
PhymmBL 82.3 82.3 -
CLARK 99.3 77.2 3.1 million
Kraken (K = 31) 99.3 77.8 2.3 million
Kraken (K = 20) 80.2 82.7 1.5 million
Kraken-GB (K = 31) 99.5 93.8 -
 Performance is sensitive to K
 Kraken-GB: 8,517 reference genomes instead of 2,256
10 of 27
impact of taxonomic novelty
Results from Wood and Salzberg, Genome Biology, 2014
Taxonomic Novelty
Measured Rank Species Genus Family
Domain 24.4 7.9 2.8
Phylum 23.9 7.2 2.5
Class 24.7 7.1 2.0
Order 24.1 6.8 2.0
Family 25.4 8.5 -
Genus 26.3 - -
 Sensitivity decreases rapidly with
taxonomic novelty
11 of 27
Kraken: some practical numbers
 Applied to metagenome from coalbed methane well
 ~82 million paired end reads (2 x 100bp)
 ~30 minutes to process with 8 threads 
 Reference database requires ~70GB of RAM 
 Classified 7.7% of reads 
0
10
20
30
40
50
60
Relativeabundance(%)
16S profile
Kraken
12 of 27
take away points
 K-mers widely used to assign taxonomy to
metagenomic reads
 Active area of research
 Resolution limited by reference genomes
 16S profiling still the gold standard
 change is coming…
Recovering Population Genomes from
Metagenomic Data
shotgun
sequencing assembly
bin contigs into genomes
(genome-centric metagenomics)
metagenome
reads
contigs
14 of 27
recovering genomes from metagenomic data
shotgun
sequencing assembly
metagenome
reads
contigs
population genomes
identify
strain-specific SNPs
binning
classify using coverage
and k-mer profiles
15 of 27
differential coverage signal
contigs with
similar coverage
profiles likely
belong to the
same genome!
16 of 27
K-mers and coverage: complementary signals
microbial community from coalbed methane well
coverage
tetranucleotide (PC1)
Genome Comp. (%) Cont. (%) Length (Mbp)
Archaea
Methanobacteriaceae 1 98.4 1.6 2.32
Methanobacteriaceae 2 96.8 0.8 2.23
Methanobacteriaceae 3 88.6 0.0 1.57
Methanobacteriaceae 4 96.0 0.0 1.71
Bacteria
Actinobacteria 1 95.0 0.9 2.56
Actinobacteria 2 90.5 2.7 2.72
Actinobacteria 3 88.4 2.7 2.48
Clostridiales 1 92.6 9.4 2.91
Clostridiales 2 80.2 0.0 2.74
Elusimicrobia 95.7 2.2 2.03
Thermodesulfovibrionaceae 83.9 0.0 2.66
Syntrophus 92.9 0.8 2.31
Rikenellaceae 86.7 2.3 2.72
Candidate Phylum OP1 83.9 0.0 1.66
Rhodocyclaceae 69.0 1.63 3.73
17 of 27
many ways to combine coverage + K-mer profiles
 GroopM: http://minillinim.github.io/GroopM/
 DBB: https://github.com/dparks1134/DBB
 CONCOCT: https://github.com/BinPro/CONCOCT
 MetaWatt: http://sourceforge.net/projects/metawatt/
 MetaBAT: https://bitbucket.org/berkeleylab/metabat
18 of 27
MetaBAT overview
Kang et al., bioRxiv, 2014
19 of 27
MetaBAT: statistical model of tetranucleotide signatures
 Empirical parameters from ~1500 reference genomes
 Posterior probability that two contigs are from different
genomes:
Kang et al., bioRxiv, 2014
contig size = 10kb
𝑃 𝑖𝑛𝑡𝑒𝑟 𝐷 =
𝛼𝑃(𝐷|𝑖𝑛𝑡𝑒𝑟)
𝛼𝑃 𝐷 𝑖𝑛𝑡𝑒𝑟 + 𝑃(𝐷|𝑖𝑛𝑡𝑟𝑎)
tetranucleotide distance, D tetranucleotide distance, D
probability,P(inter|D)
20 of 27
rapidly filling out tree of life
60 bacterial phyla
>3000 population genomes
23 habitats
51 phyla with population
genome representatives
21 of 27
take away points
 Population genomes can be recovered
from metagenomic samples
 K-mer profiles complement differential
coverage signal
 Rapidly expanding reference genomes
 Improve gene-centric metagenomics
Assessing and Refining
Population Genomes
23 of 27
estimating quality of population genomes
Additional markers
refine quality estimates
Scaffolds
Gammaproteobacteria sp.
80 % complete, 20% contaminated
105 bacterial marker genes
estimates: 92% comp., 17% cont.
281 clade-specific marker genes
estimates: 83% comp., 22% cont.
Parks et al., Genome Res., 2015
Estimates ± 5%
24 of 27
varying quality of recovered genomes
microbial community from coalbed methane well
coverage
tetranucleotide (PC1)
Genome Comp. (%) Cont. (%) Length (Mbp)
Archaea
Methanobacteriaceae 1 98.4 1.6 2.32
Methanobacteriaceae 2 96.8 0.8 2.23
Methanobacteriaceae 3 88.6 0.0 1.57
Methanobacteriaceae 4 96.0 0.0 1.71
Bacteria
Actinobacteria 1 95.0 0.9 2.56
Actinobacteria 2 90.5 2.7 2.72
Actinobacteria 3 88.4 2.7 2.48
Clostridiales 1 92.6 9.4 2.91
Clostridiales 2 80.2 0.0 2.74
Elusimicrobia 95.7 2.2 2.03
Thermodesulfovibrionaceae 83.9 0.0 2.66
Syntrophus 92.9 0.8 2.31
Rikenellaceae 86.7 2.3 2.72
Candidate Phylum OP1 83.9 0.0 1.66
Rhodocyclaceae 69.0 1.63 3.73
25 of 27
identifying potential contamination
95th percentile
outliers… treat with caution
26 of 27
K-mer modeling: impact of evolution
Bacteria vs. Archaea
(Intra-genome 95th percentile; K=4)
Classes of Proteobacteria
(Intra-genome 95th percentiles; K=4)
27 of 27
final thoughts
 K-mers widely used in gene- and genome-centric
metagenomic
 Population genomes substantially improving diversity
of available reference genomes
 Big win for taxonomic attribution methods
 And CheckM, and many other bioinformatic programs
 How best to exploit population genomes
 Looking at 100,000+ reference genomes in next few years
 Issues in terms of scalability
 Using ‘noisy’ population genomes raises interesting questions
Thank you!

More Related Content

What's hot

What's hot (20)

Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Pubchem
PubchemPubchem
Pubchem
 
Fasta
FastaFasta
Fasta
 
BITS: Basics of Sequence similarity
BITS: Basics of Sequence similarityBITS: Basics of Sequence similarity
BITS: Basics of Sequence similarity
 
Jan2016 pac bio giab
Jan2016 pac bio giabJan2016 pac bio giab
Jan2016 pac bio giab
 
Genomic Data Analysis
Genomic Data AnalysisGenomic Data Analysis
Genomic Data Analysis
 
BLAST
BLASTBLAST
BLAST
 
fastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorfastp: the FASTQ pre-processor
fastp: the FASTQ pre-processor
 
Fasta
FastaFasta
Fasta
 
Needleman-wunch algorithm harshita
Needleman-wunch algorithm  harshitaNeedleman-wunch algorithm  harshita
Needleman-wunch algorithm harshita
 
Bioinformatics lecture 1
Bioinformatics lecture 1Bioinformatics lecture 1
Bioinformatics lecture 1
 
Biopython
BiopythonBiopython
Biopython
 
blast and fasta
 blast and fasta blast and fasta
blast and fasta
 
Scoring schemes in bioinformatics
Scoring schemes in bioinformaticsScoring schemes in bioinformatics
Scoring schemes in bioinformatics
 
Omics biotechnology
Omics biotechnologyOmics biotechnology
Omics biotechnology
 
Intro to illumina sequencing
Intro to illumina sequencingIntro to illumina sequencing
Intro to illumina sequencing
 
Gene Ontology Project
Gene Ontology ProjectGene Ontology Project
Gene Ontology Project
 
Flash introduction to Qiime2 -- 16S Amplicon analysis
Flash introduction to Qiime2 -- 16S Amplicon analysisFlash introduction to Qiime2 -- 16S Amplicon analysis
Flash introduction to Qiime2 -- 16S Amplicon analysis
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 

Viewers also liked

Phylogeny Driven Approaches to Genomic and Metagenomic Studies
Phylogeny Driven Approaches to Genomic and Metagenomic StudiesPhylogeny Driven Approaches to Genomic and Metagenomic Studies
Phylogeny Driven Approaches to Genomic and Metagenomic Studies
Jonathan Eisen
 
2009 hattori metagenomics
2009 hattori metagenomics2009 hattori metagenomics
2009 hattori metagenomics
drugmetabol
 
[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics
Mads Albertsen
 

Viewers also liked (20)

Microbial Metagenomics Drives a New Cyberinfrastructure
Microbial Metagenomics Drives a New CyberinfrastructureMicrobial Metagenomics Drives a New Cyberinfrastructure
Microbial Metagenomics Drives a New Cyberinfrastructure
 
Computational analysis of metagenomic data: delineation of compositional feat...
Computational analysis of metagenomic data: delineation of compositional feat...Computational analysis of metagenomic data: delineation of compositional feat...
Computational analysis of metagenomic data: delineation of compositional feat...
 
Phylogeny Driven Approaches to Genomic and Metagenomic Studies
Phylogeny Driven Approaches to Genomic and Metagenomic StudiesPhylogeny Driven Approaches to Genomic and Metagenomic Studies
Phylogeny Driven Approaches to Genomic and Metagenomic Studies
 
The Emerging Global Collaboratory for Microbial Metagenomics Researchers
The Emerging Global Collaboratory for Microbial Metagenomics ResearchersThe Emerging Global Collaboratory for Microbial Metagenomics Researchers
The Emerging Global Collaboratory for Microbial Metagenomics Researchers
 
Dr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
Dr. Ben Hause - Pathogen Discovery Using Metagenomic SequencingDr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
Dr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
 
Future of metagenomics
Future of metagenomicsFuture of metagenomics
Future of metagenomics
 
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICSPROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
 
Metagenomics
MetagenomicsMetagenomics
Metagenomics
 
Phytobiomes
Phytobiomes Phytobiomes
Phytobiomes
 
Advancing the Metagenomics Revolution
Advancing the Metagenomics RevolutionAdvancing the Metagenomics Revolution
Advancing the Metagenomics Revolution
 
Metagenomic
MetagenomicMetagenomic
Metagenomic
 
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
Viral Metagenomics (CABBIO 20150629 Buenos Aires)Viral Metagenomics (CABBIO 20150629 Buenos Aires)
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
 
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
 
introduction to metagenomics
introduction to metagenomicsintroduction to metagenomics
introduction to metagenomics
 
Multiple kernel learning applied to the integration of Tara oceans datasets
Multiple kernel learning applied to the integration of Tara oceans datasetsMultiple kernel learning applied to the integration of Tara oceans datasets
Multiple kernel learning applied to the integration of Tara oceans datasets
 
2009 hattori metagenomics
2009 hattori metagenomics2009 hattori metagenomics
2009 hattori metagenomics
 
metagenomics
metagenomicsmetagenomics
metagenomics
 
The Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics ResearchersThe Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics Researchers
 
[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics
 
Reframing Phylogenomics
Reframing PhylogenomicsReframing Phylogenomics
Reframing Phylogenomics
 

Similar to Parks kmer metagenomics

L14 human genome
L14 human genomeL14 human genome
L14 human genome
MUBOSScz
 
Q biomarkersomaticmutation
Q biomarkersomaticmutationQ biomarkersomaticmutation
Q biomarkersomaticmutation
Elsa von Licy
 
Next generation seqencing tecnologies and application vegetable crops
Next generation seqencing tecnologies and application vegetable cropsNext generation seqencing tecnologies and application vegetable crops
Next generation seqencing tecnologies and application vegetable crops
Pulipati Gangadhara Rao
 

Similar to Parks kmer metagenomics (20)

bai2
bai2bai2
bai2
 
Microarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarraysMicroarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarrays
 
defense 2.0
defense 2.0defense 2.0
defense 2.0
 
Rapid and accurate Cancer somatic mutation profiling with the qBiomarker Soma...
Rapid and accurate Cancer somatic mutation profiling with the qBiomarker Soma...Rapid and accurate Cancer somatic mutation profiling with the qBiomarker Soma...
Rapid and accurate Cancer somatic mutation profiling with the qBiomarker Soma...
 
L14 human genome
L14 human genomeL14 human genome
L14 human genome
 
Q biomarkersomaticmutation
Q biomarkersomaticmutationQ biomarkersomaticmutation
Q biomarkersomaticmutation
 
Microsatellite
MicrosatelliteMicrosatellite
Microsatellite
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020
 
Next generation seqencing tecnologies and application vegetable crops
Next generation seqencing tecnologies and application vegetable cropsNext generation seqencing tecnologies and application vegetable crops
Next generation seqencing tecnologies and application vegetable crops
 
Gene Expression Analysis by Real Time PCR
Gene Expression Analysis by Real Time PCRGene Expression Analysis by Real Time PCR
Gene Expression Analysis by Real Time PCR
 
9739142.ppt
9739142.ppt9739142.ppt
9739142.ppt
 
Beiko dcsi2013
Beiko dcsi2013Beiko dcsi2013
Beiko dcsi2013
 
CRISPR /Cas9
CRISPR /Cas9CRISPR /Cas9
CRISPR /Cas9
 
Sigma Xi 2016
Sigma Xi 2016Sigma Xi 2016
Sigma Xi 2016
 
PDC Libraries
PDC LibrariesPDC Libraries
PDC Libraries
 
Aptamer as therapeutic
Aptamer as therapeuticAptamer as therapeutic
Aptamer as therapeutic
 
Next generation genomics for chickpea (Cicer arietinum L.) improvement
Next generation genomics for chickpea (Cicer arietinum L.) improvementNext generation genomics for chickpea (Cicer arietinum L.) improvement
Next generation genomics for chickpea (Cicer arietinum L.) improvement
 
A novel phylum-level archaea characterized by combining single-cell and metag...
A novel phylum-level archaea characterized by combining single-cell and metag...A novel phylum-level archaea characterized by combining single-cell and metag...
A novel phylum-level archaea characterized by combining single-cell and metag...
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Poster ESCS 2020 - PROIMI - CONICET
Poster ESCS 2020 - PROIMI - CONICETPoster ESCS 2020 - PROIMI - CONICET
Poster ESCS 2020 - PROIMI - CONICET
 

Recently uploaded

Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Lokesh Kothari
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 

Recently uploaded (20)

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 

Parks kmer metagenomics

  • 2. 2 of 27 metagenomics environmental sample extract and sequence DNA QC and error correct reads (K-mers!) assemble (K-mers!) bin genomes (K-mers!) assign taxonomy (and function) (K-mers!) refine genomes (K-mers!)
  • 3. Assigning Taxonomic Labels to Metagenomic DNA Sequences
  • 4. 4 of 27 a plethora of approaches  Homology: BLAST, MEGAN  Composition: Kraken, CLARK, Naïve Bayes  Hybrid: PhymmBL, FCP, PhyloPythia  Phylogenetic: Treephyler, AMPHORA, GraftM  Marker genes: 16S profiling, MetaPhlAn, PhyloSift classifyallreadsclassifysubset
  • 5. 5 of 27 exploiting genomic (K-mer) signatures  PhymmBL (K≤8): interpolated Markov model  PhyloPythia (K ≈6): multiclass support vector machine  Naïve Bayes (K ≈15): probability of observing a K-mer  Kraken (K ≈31): exact K-mer matching  CLARK (K ≈31): exact matching of discriminative K-mers denseprofilessparseprofiles
  • 6. 6 of 27 Kraken: K-mer LCA database Wood and Salzberg, Genome Biology, 2014 Reference Genomes (2,256 RefSeq Genomes) Lowest common ancestor database K-mer LCA ACC … GT g__Escherichia ACG … GT s__E. coli AGT … AA p__Proteobacteria … TGA … TT d__Bacteria Extract K-mers (default, K = 31)
  • 7. 7 of 27 Kraken: classification tree Wood and Salzberg, Genome Biology, 2014
  • 8. 8 of 27 assessment of methods Results from Ounit et al., BMC Genomics, 2015 and Wood and Salzberg, Genome Biology, 2014 Classifier Precision Sensitivity Speed Megablast 99.0 79.0 - Naïve Bayes (K = 15) 82.3 82.3 8 Naïve Bayes (K = 11) 59.0 59.0 20 PhymmBL 82.3 82.3 - CLARK 99.3 77.2 3.1 million Kraken (K = 31) 99.3 77.8 2.3 million Kraken (K = 20) 80.2 82.7 1.5 million  Precision: (correct classifications) / (total classifications)  Sensitivity: (correct classifications) / (total reads)  Speed: reads per minute  Results for simple simulated dataset
  • 9. 9 of 27 impact of K and reference database size Classifier Precision Sensitivity Speed Megablast 99.0 79.0 - Naïve Bayes (K = 15) 82.3 82.3 8 Naïve Bayes (K = 11) 59.0 59.0 20 PhymmBL 82.3 82.3 - CLARK 99.3 77.2 3.1 million Kraken (K = 31) 99.3 77.8 2.3 million Kraken (K = 20) 80.2 82.7 1.5 million Kraken-GB (K = 31) 99.5 93.8 -  Performance is sensitive to K  Kraken-GB: 8,517 reference genomes instead of 2,256
  • 10. 10 of 27 impact of taxonomic novelty Results from Wood and Salzberg, Genome Biology, 2014 Taxonomic Novelty Measured Rank Species Genus Family Domain 24.4 7.9 2.8 Phylum 23.9 7.2 2.5 Class 24.7 7.1 2.0 Order 24.1 6.8 2.0 Family 25.4 8.5 - Genus 26.3 - -  Sensitivity decreases rapidly with taxonomic novelty
  • 11. 11 of 27 Kraken: some practical numbers  Applied to metagenome from coalbed methane well  ~82 million paired end reads (2 x 100bp)  ~30 minutes to process with 8 threads   Reference database requires ~70GB of RAM   Classified 7.7% of reads  0 10 20 30 40 50 60 Relativeabundance(%) 16S profile Kraken
  • 12. 12 of 27 take away points  K-mers widely used to assign taxonomy to metagenomic reads  Active area of research  Resolution limited by reference genomes  16S profiling still the gold standard  change is coming…
  • 13. Recovering Population Genomes from Metagenomic Data shotgun sequencing assembly bin contigs into genomes (genome-centric metagenomics) metagenome reads contigs
  • 14. 14 of 27 recovering genomes from metagenomic data shotgun sequencing assembly metagenome reads contigs population genomes identify strain-specific SNPs binning classify using coverage and k-mer profiles
  • 15. 15 of 27 differential coverage signal contigs with similar coverage profiles likely belong to the same genome!
  • 16. 16 of 27 K-mers and coverage: complementary signals microbial community from coalbed methane well coverage tetranucleotide (PC1) Genome Comp. (%) Cont. (%) Length (Mbp) Archaea Methanobacteriaceae 1 98.4 1.6 2.32 Methanobacteriaceae 2 96.8 0.8 2.23 Methanobacteriaceae 3 88.6 0.0 1.57 Methanobacteriaceae 4 96.0 0.0 1.71 Bacteria Actinobacteria 1 95.0 0.9 2.56 Actinobacteria 2 90.5 2.7 2.72 Actinobacteria 3 88.4 2.7 2.48 Clostridiales 1 92.6 9.4 2.91 Clostridiales 2 80.2 0.0 2.74 Elusimicrobia 95.7 2.2 2.03 Thermodesulfovibrionaceae 83.9 0.0 2.66 Syntrophus 92.9 0.8 2.31 Rikenellaceae 86.7 2.3 2.72 Candidate Phylum OP1 83.9 0.0 1.66 Rhodocyclaceae 69.0 1.63 3.73
  • 17. 17 of 27 many ways to combine coverage + K-mer profiles  GroopM: http://minillinim.github.io/GroopM/  DBB: https://github.com/dparks1134/DBB  CONCOCT: https://github.com/BinPro/CONCOCT  MetaWatt: http://sourceforge.net/projects/metawatt/  MetaBAT: https://bitbucket.org/berkeleylab/metabat
  • 18. 18 of 27 MetaBAT overview Kang et al., bioRxiv, 2014
  • 19. 19 of 27 MetaBAT: statistical model of tetranucleotide signatures  Empirical parameters from ~1500 reference genomes  Posterior probability that two contigs are from different genomes: Kang et al., bioRxiv, 2014 contig size = 10kb 𝑃 𝑖𝑛𝑡𝑒𝑟 𝐷 = 𝛼𝑃(𝐷|𝑖𝑛𝑡𝑒𝑟) 𝛼𝑃 𝐷 𝑖𝑛𝑡𝑒𝑟 + 𝑃(𝐷|𝑖𝑛𝑡𝑟𝑎) tetranucleotide distance, D tetranucleotide distance, D probability,P(inter|D)
  • 20. 20 of 27 rapidly filling out tree of life 60 bacterial phyla >3000 population genomes 23 habitats 51 phyla with population genome representatives
  • 21. 21 of 27 take away points  Population genomes can be recovered from metagenomic samples  K-mer profiles complement differential coverage signal  Rapidly expanding reference genomes  Improve gene-centric metagenomics
  • 23. 23 of 27 estimating quality of population genomes Additional markers refine quality estimates Scaffolds Gammaproteobacteria sp. 80 % complete, 20% contaminated 105 bacterial marker genes estimates: 92% comp., 17% cont. 281 clade-specific marker genes estimates: 83% comp., 22% cont. Parks et al., Genome Res., 2015 Estimates ± 5%
  • 24. 24 of 27 varying quality of recovered genomes microbial community from coalbed methane well coverage tetranucleotide (PC1) Genome Comp. (%) Cont. (%) Length (Mbp) Archaea Methanobacteriaceae 1 98.4 1.6 2.32 Methanobacteriaceae 2 96.8 0.8 2.23 Methanobacteriaceae 3 88.6 0.0 1.57 Methanobacteriaceae 4 96.0 0.0 1.71 Bacteria Actinobacteria 1 95.0 0.9 2.56 Actinobacteria 2 90.5 2.7 2.72 Actinobacteria 3 88.4 2.7 2.48 Clostridiales 1 92.6 9.4 2.91 Clostridiales 2 80.2 0.0 2.74 Elusimicrobia 95.7 2.2 2.03 Thermodesulfovibrionaceae 83.9 0.0 2.66 Syntrophus 92.9 0.8 2.31 Rikenellaceae 86.7 2.3 2.72 Candidate Phylum OP1 83.9 0.0 1.66 Rhodocyclaceae 69.0 1.63 3.73
  • 25. 25 of 27 identifying potential contamination 95th percentile outliers… treat with caution
  • 26. 26 of 27 K-mer modeling: impact of evolution Bacteria vs. Archaea (Intra-genome 95th percentile; K=4) Classes of Proteobacteria (Intra-genome 95th percentiles; K=4)
  • 27. 27 of 27 final thoughts  K-mers widely used in gene- and genome-centric metagenomic  Population genomes substantially improving diversity of available reference genomes  Big win for taxonomic attribution methods  And CheckM, and many other bioinformatic programs  How best to exploit population genomes  Looking at 100,000+ reference genomes in next few years  Issues in terms of scalability  Using ‘noisy’ population genomes raises interesting questions

Editor's Notes

  1. Basic metagenomics workflow gene- and genome-centric metagenomics
  2. Goal: assign taxonomy to metagenomic reads Challenge: reads are short (currently 100 to 300bp) >>100 million reads limited reference genomes (~2000 finished; ~25,000 draft) Uses: profiling of microbial communities preprocessing for assembly
  3. Show benefits of combining signals Show results of alternative K values Lots of approaches Naïve bayes vs. IMM
  4. Show benefits of combining signals Show results of alternative K values Lots of approaches Naïve bayes vs. IMM
  5. Show benefits of combining signals Show results of alternative K values Lots of approaches Naïve bayes vs. IMM
  6. Show benefits of combining signals Show results of alternative K values Lots of approaches Naïve bayes vs. IMM
  7. Ideally contigs from same genome would have the same coverage and genomic signature Of course, there is variation which needs to be modelled leading to an interesting unsupervised or semi-supervised clustering problem
  8. All these methods are unsupervised clustering algorithms utilizing differential coverage, k-mer profiles, and occasionally GC as features
  9. Ideally contigs from same genome would have the same coverage and genomic signature Of course, there is variation which needs to be modelled leading to an interesting unsupervised or semi-supervised clustering problem