SlideShare a Scribd company logo
1 of 50
Download to read offline
Lightning fast genomics 
With Spark and ADAM
Who are we? 
Andy 
@Noootsab 
@NextLab_be 
@Wajug co-driver 
@Devoxx4Kids organizer 
Maths & CS 
Data lover: geo, open, massive 
Fool 
Xavier 
@xtordoir 
SilicoCloud 
-> Physics 
-> Data analysis 
-> genomics 
-> scalable systems 
-> ...
Genomics 
What is genomics about? 
Medical Diagnostics 
Drug response 
Diseases mechanisms
Genomics 
What is genomics about? 
- A human genome is a 3 billion long sequence (of 
nucleic acids: “bases”) 
- 1 per 1000 base is variable in human population 
- Genomes encode bio-molecules (tens of thousands) 
- These molecules interact together 
...and with environment 
→ Biological systems are very complex
Genomics 
State of the art 
- growing technological capacity 
- cost reduction 
- growing data._
Genomics 
State of the art 
- I.T. becomes bottleneck (cost and latency) 
- sacrifice data with sampling or cut-offs 
Andrea Sboner et al
Genomics 
Blocking points 
- “legacy stack” not designed scalable (C, perl, …) 
- HPC approach not a fit (data intensive)
Genomics 
Future of genomics 
- Personal genomes (e.g. 1,000,000 genomes for cancer 
research) 
- New sequencing technologies 
- Sequence “stuff” as needed (e.g. microbiome, 
diagnostics) 
- medicalCondition = f(genomics, environmentHistory)
Genomics 
Needs of scalability → Scala & Spark 
Needs of simplicity, clarity → ADAM
Parquet 101 
Columnar storage 
Row oriented 
Column oriented
Parquet 101 
Columnar storage 
> Homogeneous collocated data 
> Better range access 
> Better encoding
Parquet 101 
Efficient encoding of nested typed structures 
message Document { 
required int64 DocId; 
optional group Links { 
repeated int64 Backward; 
repeated int64 Forward; 
} 
repeated group Name { 
repeated group Language { 
required string Code; 
optional string Country; 
} 
optional string Url; 
} 
}
Parquet 101 
Efficient encoding of nested typed structures 
message Document { 
required int64 DocId; 
optional group Links { 
repeated int64 Backward; 
repeated int64 Forward; 
} 
repeated group Name { 
repeated group Language { 
required string Code; 
optional string Country; 
} 
optional string Url; 
} 
} 
Nested structure →Tree 
Empty levels →Branch pruning 
Repetitions →Metadata (index) 
Types → Safe/Fast codec
Parquet 101 
Efficient encoding of nested typed structures 
ref: https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Parquet 101 
Optimized distributed storage (f.i. in HDFS) 
ref: http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
Parquet 101 
Efficient (schema based) serialization: AVRO 
JSON Schema IDL 
{ 
"namespace": "example.avro", 
"type": "record", 
"name": "User", 
"fields": [ 
{"name": "name", "type": "string"}, 
{"name": "favorite_number", "type": ["int", "null"]}, 
{"name": "favorite_color", "type": ["string", "null"]} 
] 
} 
record User { 
string name; 
union { null, int } favorite_number = null; 
union { null, string } favorite_color = null; 
}
Parquet 101 
Efficient (schema based) serialization: AVRO 
JSON Schema Part of the: 
{ 
"namespace": "example.avro", 
"type": "record", 
"name": "User", 
"fields": [ 
{"name": "name", "type": "string"}, 
{"name": "favorite_number", "type": ["int", "null"]}, 
{"name": "favorite_color", "type": ["string", "null"]} 
] 
} 
● protocol 
● serialization 
→less metadata 
Define: IDL → JSON 
Send: Binary → JSON
ADAM 
Credits: AmpLab (UC Berkeley)
ADAM 
Overview (Sequencing) 
- DNA is a molecule 
…or a Seq[Char] 
(A, T, G, C) alphabet
ADAM 
Sequencing 
- Massively parallel sequencing of random 100-150 
bases reads (20,000,000 reads per genome) 
- 30-60x coverage for quality 
- All this mess must be re-organised! 
→ ADAM
ADAM 
Variants Calling 
- From an organized set of reads (ADAM Pileup) 
- Detect variants (Variant Calling) 
→ AVOCADO
ADAM 
Genomics specifications 
- SAM, BAM, VCF 
- Indexable 
- libraries 
- ~ scalable: hadoop-bam
ADAM 
ADAM model 
- schema based (Avro), libraries are generated 
- no storage spec here!
ADAM 
ADAM model 
- Parquet storage 
- evenly distribute data 
- storage optimized for read/query 
- better compression
ADAM 
ADAM API 
- AdamContext provides functions to read from HDFS
ADAM 
ADAM API 
- Scala classes generated from Avro 
- Data loaded as RDDs (Spark’s Resilient Distributed 
Datasets) 
- functions on RDDs (write to HDFS, genomic objects 
manipulations)
ADAM 
ADAM API 
- e.g. reading genotypes
ADAM 
ADAM Benchmark 
- It scales! 
- Data is more compact 
- Read perf is better 
- Code is simpler
Stratification using 1000Genomes 
As usual… let’s get some data. 
Genomes relate to health and are private. 
Still, there are options!
Stratification using 1000Genomes 
http://www.1000genomes.org/ 
(Nowadays targeting 2000 genomes) 
ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg
Stratification using 1000Genomes
Stratification using 1000Genomes
Stratification using 1000Genomes 
Study genetic variations in populations (needs 
more contextual data for healthcare). 
To validate the interest in ADAM, we’ll do some 
qualitative exploration of the data. 
Question: it is possible to predict the 
appartenance of a given genome to a 
subpopulation?
Stratification using 1000Genomes 
We can run an unsupervised algorithm on a 
massive number of genomes. 
The idea is to find clusters that would match 
subpopulations. 
Actually, it’s important because it reflects 
populations histories: gene flows, selection, ...
Stratification using 1000Genomes 
From the 200Tb of data, we’ll focus on the 6th 
chromosome, actually only its variants 
ref: http://en.wikipedia.org/wiki/Chromosome
Genome Data 
Data structure
Genome Data 
Data structure 
Panel: Map[SampleID, Population]
Genome Data 
Data structure 
Genotypes in VCF format 
Basically a text file. Ours were downloaded from S3. 
Converted to ADAM Genotypes
Machine Learning model 
Clustering: KMeans 
ref: http://en.wikipedia.org/wiki/K-means_clustering
Machine Learning model 
Clustering: KMeans 
PreProcess = {A,C,T,G}² → {0,1,2} 
Space = {0,1,2}¹⁷⁰⁰⁰⁰⁰⁰⁰ 
Distance = Euclidian (L2) ⁽*⁾ 
⁽*⁾MLlib restriction, although, here: L2~L1 
SPARK-3012 
ref: http://en.wikipedia.org/wiki/K-means_clustering
Machine Learning model 
MLLib, KMeans 
MLLib: 
● Machine Learning Algorithms 
● Data structures (e.g. Vector)
Machine Learning model 
MLLib KMeans 
DataFrame Map: 
● key = Sample 
● value = Vector of Genotypes alleles (sorted by Variant)
Mashup 
prediction 
Sample [NA20332] is in cluster #0 for population Some(ASW) 
Sample [NA20334] is in cluster #2 for population Some(ASW) 
Sample [HG00120] is in cluster #2 for population Some(GBR) 
Sample [NA18560] is in cluster #1 for population Some(CHB)
Mashup 
#0 #1 #2 
GBR 0 0 89 
ASW 54 0 7 
CHB 0 97 0
Cluster 
4 m3.xlarge instances (ec2) 
16 cores + 60G
Cluster 
Performances
Cluster 
40 m3.xlarge 
160 cores + 600G
Conclusions and future work 
● ADAM and Spark provide tools to 
manipulate genomics data in a scalable way 
● Simple APIs in Scala 
● MLLib for machine learning 
→ implement less naïve algorithms 
→ cross medical and environmental data with 
genomes
Acknowledgments 
Acknowledgements 
Scala.IO 
AmpLab 
Matt Massie Frank Nothaft 
Vincent Botta
That’s all Folks 
Apparently, we’re supposed to stay on stage 
Waiting for questions 
Hoping for none 
Looking at the bar 
And the lunch 
Oh there are beers 
And candies 
who can read this?

More Related Content

What's hot (11)

Mean centre of population
Mean centre of populationMean centre of population
Mean centre of population
 
Types of research
Types of researchTypes of research
Types of research
 
Seguridad de los tratamientos ACO: riesgo de sangrados
Seguridad de los tratamientos ACO: riesgo de sangradosSeguridad de los tratamientos ACO: riesgo de sangrados
Seguridad de los tratamientos ACO: riesgo de sangrados
 
Comparative effectiveness of acei and arbi
Comparative effectiveness of acei and arbiComparative effectiveness of acei and arbi
Comparative effectiveness of acei and arbi
 
Research in geography
Research in geographyResearch in geography
Research in geography
 
SGLT2 inhibitor trials
SGLT2 inhibitor trialsSGLT2 inhibitor trials
SGLT2 inhibitor trials
 
27 shweta journal club presentation
27 shweta journal club presentation27 shweta journal club presentation
27 shweta journal club presentation
 
2012 data analysis
2012 data analysis2012 data analysis
2012 data analysis
 
27 shweta lamsal journal-club-presentation
27 shweta lamsal journal-club-presentation27 shweta lamsal journal-club-presentation
27 shweta lamsal journal-club-presentation
 
Residual mapping
Residual mappingResidual mapping
Residual mapping
 
15 nirdesh baral-journal-club-presentation
15 nirdesh baral-journal-club-presentation15 nirdesh baral-journal-club-presentation
15 nirdesh baral-journal-club-presentation
 

Similar to Lightning fast genomics with Spark, Adam and Scala

BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
CS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databasesCS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databasesGabe Rudy
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLDatabricks
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSLynn Langit
 
Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...Maté Ongenaert
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the CloudDataMine Lab
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbMongoDB
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009Ian Foster
 
NOSQL and Cassandra
NOSQL and CassandraNOSQL and Cassandra
NOSQL and Cassandrarantav
 
AWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of MarylandAWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of MarylandAmazon Web Services
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBMongoDB
 
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkAccelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkDatabricks
 
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaScaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaDatabricks
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
 

Similar to Lightning fast genomics with Spark, Adam and Scala (20)

BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
CS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databasesCS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databases
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
 
Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
 
NOSQL and Cassandra
NOSQL and CassandraNOSQL and Cassandra
NOSQL and Cassandra
 
AWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of MarylandAWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of Maryland
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDB
 
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkAccelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
 
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaScaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
User biglm
User biglmUser biglm
User biglm
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 

More from Andy Petrella

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best PracicesAndy Petrella
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data MappingAndy Petrella
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooksAndy Petrella
 
Governance compliance
Governance   complianceGovernance   compliance
Governance complianceAndy Petrella
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPRAndy Petrella
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and howAndy Petrella
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scalaAndy Petrella
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Andy Petrella
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformAndy Petrella
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserAndy Petrella
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open ScienceAndy Petrella
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 

More from Andy Petrella (20)

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
 
Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and how
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 

Recently uploaded

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Lightning fast genomics with Spark, Adam and Scala

  • 1. Lightning fast genomics With Spark and ADAM
  • 2. Who are we? Andy @Noootsab @NextLab_be @Wajug co-driver @Devoxx4Kids organizer Maths & CS Data lover: geo, open, massive Fool Xavier @xtordoir SilicoCloud -> Physics -> Data analysis -> genomics -> scalable systems -> ...
  • 3. Genomics What is genomics about? Medical Diagnostics Drug response Diseases mechanisms
  • 4. Genomics What is genomics about? - A human genome is a 3 billion long sequence (of nucleic acids: “bases”) - 1 per 1000 base is variable in human population - Genomes encode bio-molecules (tens of thousands) - These molecules interact together ...and with environment → Biological systems are very complex
  • 5. Genomics State of the art - growing technological capacity - cost reduction - growing data._
  • 6. Genomics State of the art - I.T. becomes bottleneck (cost and latency) - sacrifice data with sampling or cut-offs Andrea Sboner et al
  • 7. Genomics Blocking points - “legacy stack” not designed scalable (C, perl, …) - HPC approach not a fit (data intensive)
  • 8. Genomics Future of genomics - Personal genomes (e.g. 1,000,000 genomes for cancer research) - New sequencing technologies - Sequence “stuff” as needed (e.g. microbiome, diagnostics) - medicalCondition = f(genomics, environmentHistory)
  • 9. Genomics Needs of scalability → Scala & Spark Needs of simplicity, clarity → ADAM
  • 10. Parquet 101 Columnar storage Row oriented Column oriented
  • 11. Parquet 101 Columnar storage > Homogeneous collocated data > Better range access > Better encoding
  • 12. Parquet 101 Efficient encoding of nested typed structures message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } }
  • 13. Parquet 101 Efficient encoding of nested typed structures message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Nested structure →Tree Empty levels →Branch pruning Repetitions →Metadata (index) Types → Safe/Fast codec
  • 14. Parquet 101 Efficient encoding of nested typed structures ref: https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 15. Parquet 101 Optimized distributed storage (f.i. in HDFS) ref: http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
  • 16. Parquet 101 Efficient (schema based) serialization: AVRO JSON Schema IDL { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } record User { string name; union { null, int } favorite_number = null; union { null, string } favorite_color = null; }
  • 17. Parquet 101 Efficient (schema based) serialization: AVRO JSON Schema Part of the: { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } ● protocol ● serialization →less metadata Define: IDL → JSON Send: Binary → JSON
  • 18. ADAM Credits: AmpLab (UC Berkeley)
  • 19. ADAM Overview (Sequencing) - DNA is a molecule …or a Seq[Char] (A, T, G, C) alphabet
  • 20. ADAM Sequencing - Massively parallel sequencing of random 100-150 bases reads (20,000,000 reads per genome) - 30-60x coverage for quality - All this mess must be re-organised! → ADAM
  • 21. ADAM Variants Calling - From an organized set of reads (ADAM Pileup) - Detect variants (Variant Calling) → AVOCADO
  • 22. ADAM Genomics specifications - SAM, BAM, VCF - Indexable - libraries - ~ scalable: hadoop-bam
  • 23. ADAM ADAM model - schema based (Avro), libraries are generated - no storage spec here!
  • 24. ADAM ADAM model - Parquet storage - evenly distribute data - storage optimized for read/query - better compression
  • 25. ADAM ADAM API - AdamContext provides functions to read from HDFS
  • 26. ADAM ADAM API - Scala classes generated from Avro - Data loaded as RDDs (Spark’s Resilient Distributed Datasets) - functions on RDDs (write to HDFS, genomic objects manipulations)
  • 27. ADAM ADAM API - e.g. reading genotypes
  • 28. ADAM ADAM Benchmark - It scales! - Data is more compact - Read perf is better - Code is simpler
  • 29. Stratification using 1000Genomes As usual… let’s get some data. Genomes relate to health and are private. Still, there are options!
  • 30. Stratification using 1000Genomes http://www.1000genomes.org/ (Nowadays targeting 2000 genomes) ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg
  • 33. Stratification using 1000Genomes Study genetic variations in populations (needs more contextual data for healthcare). To validate the interest in ADAM, we’ll do some qualitative exploration of the data. Question: it is possible to predict the appartenance of a given genome to a subpopulation?
  • 34. Stratification using 1000Genomes We can run an unsupervised algorithm on a massive number of genomes. The idea is to find clusters that would match subpopulations. Actually, it’s important because it reflects populations histories: gene flows, selection, ...
  • 35. Stratification using 1000Genomes From the 200Tb of data, we’ll focus on the 6th chromosome, actually only its variants ref: http://en.wikipedia.org/wiki/Chromosome
  • 36. Genome Data Data structure
  • 37. Genome Data Data structure Panel: Map[SampleID, Population]
  • 38. Genome Data Data structure Genotypes in VCF format Basically a text file. Ours were downloaded from S3. Converted to ADAM Genotypes
  • 39. Machine Learning model Clustering: KMeans ref: http://en.wikipedia.org/wiki/K-means_clustering
  • 40. Machine Learning model Clustering: KMeans PreProcess = {A,C,T,G}² → {0,1,2} Space = {0,1,2}¹⁷⁰⁰⁰⁰⁰⁰⁰ Distance = Euclidian (L2) ⁽*⁾ ⁽*⁾MLlib restriction, although, here: L2~L1 SPARK-3012 ref: http://en.wikipedia.org/wiki/K-means_clustering
  • 41. Machine Learning model MLLib, KMeans MLLib: ● Machine Learning Algorithms ● Data structures (e.g. Vector)
  • 42. Machine Learning model MLLib KMeans DataFrame Map: ● key = Sample ● value = Vector of Genotypes alleles (sorted by Variant)
  • 43. Mashup prediction Sample [NA20332] is in cluster #0 for population Some(ASW) Sample [NA20334] is in cluster #2 for population Some(ASW) Sample [HG00120] is in cluster #2 for population Some(GBR) Sample [NA18560] is in cluster #1 for population Some(CHB)
  • 44. Mashup #0 #1 #2 GBR 0 0 89 ASW 54 0 7 CHB 0 97 0
  • 45. Cluster 4 m3.xlarge instances (ec2) 16 cores + 60G
  • 47. Cluster 40 m3.xlarge 160 cores + 600G
  • 48. Conclusions and future work ● ADAM and Spark provide tools to manipulate genomics data in a scalable way ● Simple APIs in Scala ● MLLib for machine learning → implement less naïve algorithms → cross medical and environmental data with genomes
  • 49. Acknowledgments Acknowledgements Scala.IO AmpLab Matt Massie Frank Nothaft Vincent Botta
  • 50. That’s all Folks Apparently, we’re supposed to stay on stage Waiting for questions Hoping for none Looking at the bar And the lunch Oh there are beers And candies who can read this?