SlideShare a Scribd company logo
1 of 34
Next-Generation
Genomics
Using Spark and ADAM
Timothy Danford
Tamr Inc.
AMPLab
Next
Generation?
We come in
peace.
What even is
genomics?
Organism Cell Genome
One chromosome
One chromosome
per person
One chromosome
per person
defines a
reference
chromosome
One chromosome
per person
defines a
reference
chromosome
and
location
“… decoding the Book of Life”
Ortellius, 1570
Google, 2005
Lambert et al. “Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer's disease” (2013)
Down the
Long Slide,
To
Happiness
Endlessly
We often treat
‘bioinformatics’ as a
black box
Vials into Files
What’s
In
The Box?
My God, It’s Full of
Pipelines
My God, It’s Full of
Pipelines
A Tale of Three File Formats
BAM Files: Do You Read
Me?
Compressed text files & custom index formats
User-defined attributes
Multi-record structure
“Not wishing to be outdone
by Amazon, Sanger
Institute develops drone
deliver system for BAM
files.”
Open the Pod Bay Doors,
Pal
I Had a Dream
It Would End This Way
What to do, what to do?
Bioinformaticians
❤️
Probabilistic
Models
Our Data Scattered Back and
Forth
Across Space by this Gadget
Why Are We Still Defining
File Formats By Hand?
• Instead of defining custom file
formats for each data type and
access pattern…
• Parquet creates a
compressed format for each
Avro-defined data model.
• Improvement over existing
formats1
• 20-22% for BAM
• ~95% for VCF
1
compression % quoted from 1K Genomes
Spark + Genomics =
ADAM
• Hosted at Berkeley and the
AMPLab
• Apache 2 License
• Contributors from both
research and commercial
organizations
• Core spatial primitives,
variant calling
• Avro and Parquet for data
models and file formats
Core Genomics Primitives:
The Needs of the Many
The Terrible Trouble
with Existing Pipelines
Cibulskis et al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples” (2013)
“I think you know what the
problem is, just as well as I
do.”
A single piece of a
filtering stage for a
somatic variant caller
“11-base-pair window
centered on a candidate
mutation” actually turns
out to be optimized for
a particular file format
and sort order
“Myths of Bioinformatics
Software”
1. Somebody will build on your code
2. You should have assembled a team to build your software
3. If you choose the right license, more people will use and build on your software.
4. Making software free for commercial use shows you are not against companies.
5. You should maintain your software indefinitely
6. Your “stable URL” can exist forever
7. You should make your software “idiot proof”
8. You used the right programming language for the task.
Lior Pachter
https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/
W
e
Can
M
ake
O
ur
O
w
n
M
yths
Thanks to...
And thank you! Questions?

More Related Content

What's hot

Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAMfnothaft
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMfnothaft
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMfnothaft
 
Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014StampedeCon
 
A Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsA Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsMaribel Acosta Deibe
 
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyCrowdsourcing the Quality of Knowledge Graphs:A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia StudyMaribel Acosta Deibe
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planC. Tobin Magle
 
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...GigaScience, BGI Hong Kong
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit
 
Getting Started with RNA-Seq Data Analysis
Getting Started with RNA-Seq Data AnalysisGetting Started with RNA-Seq Data Analysis
Getting Started with RNA-Seq Data AnalysisAndreas Wilm
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangementGuy Coates
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomicsGuy Coates
 
Introduction to r
Introduction to rIntroduction to r
Introduction to rgslicraf
 
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...Spark Summit
 

What's hot (20)

Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAM
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 
Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014
 
A Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsA Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia Mappings
 
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyCrowdsourcing the Quality of Knowledge Graphs:A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management plan
 
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 
Lisp Machine Prunciples
Lisp Machine PrunciplesLisp Machine Prunciples
Lisp Machine Prunciples
 
Ngsp
NgspNgsp
Ngsp
 
Automating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtoolsAutomating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtools
 
Getting Started with RNA-Seq Data Analysis
Getting Started with RNA-Seq Data AnalysisGetting Started with RNA-Seq Data Analysis
Getting Started with RNA-Seq Data Analysis
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomics
 
Introduction to r
Introduction to rIntroduction to r
Introduction to r
 
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
 

Viewers also liked

Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserAndy Petrella
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't SpecialAllen Day, PhD
 
Building cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerBuilding cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerJacob Feala
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
 
Deep Advances in Generative Modeling
Deep Advances in Generative ModelingDeep Advances in Generative Modeling
Deep Advances in Generative Modelingindico data
 
Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Ha Phuong
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 

Viewers also liked (9)

Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
Building cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerBuilding cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and Docker
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Deep Advances in Generative Modeling
Deep Advances in Generative ModelingDeep Advances in Generative Modeling
Deep Advances in Generative Modeling
 
Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 

Similar to Strata-Hadoop 2015 Presentation

ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014fnothaft
 
Closing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceClosing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceJustin Johnson
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsmikaelhuss
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseRai University
 
The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960mare34
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsDuncan Hull
 
amrapali builders -- hacking the genome.pdf
amrapali builders -- hacking the genome.pdfamrapali builders -- hacking the genome.pdf
amrapali builders -- hacking the genome.pdfamrapalibuildersreviews
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Webebiquity
 
Next generation sequencing requires next generation publishing: the Biodivers...
Next generation sequencing requires next generation publishing: the Biodivers...Next generation sequencing requires next generation publishing: the Biodivers...
Next generation sequencing requires next generation publishing: the Biodivers...Vince Smith
 
Ontologies for life sciences: examples from the gene ontology
Ontologies for life sciences: examples from the gene ontologyOntologies for life sciences: examples from the gene ontology
Ontologies for life sciences: examples from the gene ontologyMelanie Courtot
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talkc.titus.brown
 
Bioinformatics Final Report
Bioinformatics Final ReportBioinformatics Final Report
Bioinformatics Final ReportShruthi Choudary
 
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)Numenta
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingTheodore J. LaGrow
 
Thoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestThoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestKeith Bradnam
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 

Similar to Strata-Hadoop 2015 Presentation (20)

ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Closing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceClosing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real Science
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 database
 
The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of Bioinformatics
 
amrapali builders -- hacking the genome.pdf
amrapali builders -- hacking the genome.pdfamrapali builders -- hacking the genome.pdf
amrapali builders -- hacking the genome.pdf
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
 
Next generation sequencing requires next generation publishing: the Biodivers...
Next generation sequencing requires next generation publishing: the Biodivers...Next generation sequencing requires next generation publishing: the Biodivers...
Next generation sequencing requires next generation publishing: the Biodivers...
 
Ontologies for life sciences: examples from the gene ontology
Ontologies for life sciences: examples from the gene ontologyOntologies for life sciences: examples from the gene ontology
Ontologies for life sciences: examples from the gene ontology
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
Bioinformatics Final Report
Bioinformatics Final ReportBioinformatics Final Report
Bioinformatics Final Report
 
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
 
Thoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestThoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contest
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 

Recently uploaded

SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 

Recently uploaded (20)

SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 

Strata-Hadoop 2015 Presentation

Editor's Notes

  1. I’m nervous, so I’ll be speaking fast. Before we dive in, let me ask a couple of questions: biologists? Spark experts? This entire presentation is a lie. There are always at least three different constituencies in the room: * biologists * programmers * someone thinking about how to build a business around this I am going to try and split the difference, but I won’t be able to satisfy everyone. In all the places where I have to skip over the truth, maybe there will be at least a breadcrumb back to the truth This isn’t a technical talk. Let’s talk about the title –
  2. Next generations? I didn’t realize that there was a *first* generation! Bioinformatics is a field with a long history, thirty or more years as a separate discipline. At the same time, the fundamental technology is changing. So if I talk about ‘problems’ today, it’s OK [animation] I come in peace! Bioinformatics software development has been *remarkalbly* effective, for decades. If there are problems to be solved, these are the result of new technologies, new conceptions of scale. So that’s “next generation,” but what about…
  3. Genomics? What even is genomics? Who here has heard the terms ‘chromosome’ and ‘gene’ before, and could explain the difference? So before we dive into the main part of the talk, I’m going to spend a few minutes discussing some of the basic biological concepts.
  4. Fundamentally, we’re interested in studying individuals (and populations of individuals) Each individual is *itself* a population: of cells But each of those cells has, ideally, an identical genome. The genome is a collection of 23 molecules. These are called ‘polymers,’ they’re built (like Legos) out of a small number of repeated interlocking parts – these are the A, T, G, and C you’ve probably heard about. The content of the genome is determined by the linear order in which these letters are arranged. (Linear is important!)
  5. Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
  6. Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
  7. Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
  8. Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
  9. Here is Bill Clinton (and Craig Venter and Francis Crick), announcing in June of 2000 the “rough draft” of the Human Genome – this is the Human Genome Project.
  10. 1570: Theatrum Orbis Terrarum “Theater of the world” First modern atlas. A direct byproduct of the first 100 years of PRINTING, and a tool for describing and exploring the world around us. It’s direct descendants are still with us, today!
  11. Google maps! But what does the genomic version of this look like?
  12. Mapmakers today focus on *annotation* of the maps themselves. The core technologies are 2D planar and spherical geometry, geometric operations composed out of latitudes and longitudes.
  13. This is a manhattan plot, of alzheimer’s related genes and sequence markers. Now let’s shift gears, and talk about how this was performed – through sequencers. Sequencers are microscopes that read the genome.
  14. If there’s one graph you should remember, in order to understand the last (and the next) ten years of bioinformatics and genomics, it’s this one The Human Genome Project was thousands of researchers, billions of dollars, spent over a decade, all to sequence on-the-order-of half a dozen individuals. Today, we’re close to the “thousand dollar genome” – and already we’re seeing prototype sequencers with the form factor of a USB stick. So sequencers will drive everything before it – but sequencers are only ever half the story.
  15. Bioinformatics is a computational reversal of the sequencing process. [ANIMATE] But to most
  16. So… what’s in the box?
  17. It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?) It’s never *one* pipeline, we do this once for every person Let me talk a little bit about the structure of one of these pipelines Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem But of course, it’s never one pipeline [ANIMATE] It’s a pipeline per person [ANIMATE] But since each pipeline runs (essentially) serially, scaling up is easy: scale out! [ANIMATE]
  18. It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?) It’s never *one* pipeline, we do this once for every person Let me talk a little bit about the structure of one of these pipelines Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem But of course, it’s never one pipeline [ANIMATE] It’s a pipeline per person [ANIMATE] But since each pipeline runs (essentially) serially, scaling up is easy: scale out! [ANIMATE]
  19. It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?) It’s never *one* pipeline, we do this once for every person Let me talk a little bit about the structure of one of these pipelines Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem But of course, it’s never one pipeline [ANIMATE] It’s a pipeline per person [ANIMATE] But since each pipeline runs (essentially) serially, scaling up is easy: scale out! [ANIMATE]
  20. That was the data side, but let’s open up the computation as well. Take one of those boxes, that I drew earlier. Here’s alignment, but it could be… [ANIMATE] any bioinformatics tool. I assert that there are *two* things going on inside any bioinformatics tool – [ANIMATE] There is the method, and there is the implementation of that method. I think this is an important distinction to make… But even that is a lie, because there is a third thing… [ANIMATE] “Platform.” That’s why I’ve included this code snippet up above. So what’s the problem? Faster sequencers means we sequence more people, but we have tools that work and a natural path to parallelism! Why does there need to be a “next generation?” The answer, of course, is that when you have all that data, you want to *USE* all that data.
  21. When you want to *use* all the data, now your entire system will start to show cracks. This is an example, variant calling. But [ANIMATE] God help you if you want to combine statistical information at an earlier phase of the process. But this is by no means a unique problem. And what is one solution? You might have guessed it from the title to my talk….
  22. There’s more parallelism that we can extract from our pipelines.
  23. Spark. The reason this works is that Spark naturally handles pipelines, and automatically performs shuffles when appropriate, but also…