SlideShare a Scribd company logo
1 of 49
© 2015 MapR Technologies 1© 2015 MapR Technologies
Hadoop for Genomics: What you need to know
© 2015 MapR Technologies 2
Target Application: Alleviate / Prevent (Deterministic) Suffering
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
© 2015 MapR Technologies 3
DNA Sequencing, pre-2004
years
CPU
transistors/mm2
HDD
GB/mm2
DNA
bp/$, pre-2004
© 2015 MapR Technologies 4
DNA Sequencing, 2004 Disruption
years
CPU
transistors/mm2
HDD
GB/mm2DNA
bp/$, post-2004
DNA
bp/$, pre-2004
© 2015 MapR Technologies 5
DNA Sequencing, 2004 Disruption
years
CPU
transistors/mm2
HDD
GB/mm2DNA
bp/$, post-2004
DNA
bp/$, pre-2004
Similar disruption occurred for
Internet traffic in mid-1990s
© 2015 MapR Technologies 6
Effect: Many DNA-Based Apps Coming…
• 2014: US$ 2B, mostly
research, mostly
chemical costs
• 2020: US$ 20B,
mostly clinical, mostly
analytics costs
Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning
0
5
10
15
20
25
2014 2020
Clinical
Non-Clinical
© 2015 MapR Technologies 7
http://steamcommunity.com/app/203160/discussions/0/846956188647169800/
http://www.vox.com/2015/2/1/7955921/lara-croft-moores-law
What Does Moore’s Law Feel Like? #Dataviz:
Lara Croft 230=>40,000 Polygons (1996-2014)
© 2015 MapR Technologies 8
Application: Forensics
http://cgi.uconn.edu/stranger-visions-forensic-art-exhibit/
http://snapshot.parabon-nanolabs.com/
http://www.nature.com/news/mugshots-built-from-dna-data-1.14899
© 2015 MapR Technologies 9
Growth in Resource Capacity
© 2015 MapR Technologies 10
Disruption Circa 2000
NASDAQ
Composite
© 2015 MapR Technologies 11
What Happened?
What did winners
do right to survive
the .com recession?
NASDAQ
Composite
© 2015 MapR Technologies 12
Early 1990s: Early eCommerce Vendor Setup
Storage
read/write
read/write
Website
Back Office
© 2015 MapR Technologies 13
Late 1990s: Workload became too big
Storage
read/write
read/write
Website WebsiteWebsite Website
Back Office Back Office
© 2015 MapR Technologies 14
Google Publishes
• 2003: Google Filesystem (aka GFS)
– http://research.google.com/archive/gfs.html
• 2004: MapReduce
– http://research.google.com/archive/mapreduce.html
• 2006: BigTable
– http://research.google.com/archive/bigtable.html
© 2015 MapR Technologies 15
Scale-out with Google FS + MapReduce
read/write
read/write
Website WebsiteWebsite Website
Storage + Compute Cluster
Back Office Back Office
© 2015 MapR Technologies 16
Apache Software Foundation: Fast Follower of Google
MapReduce Hadoop
Google FS
Hadoop FS
BigTable
HBase
© 2015 MapR Technologies 17
DNA Sequencing, post-2004 DNA Sequence
NASDAQ
Composite
© 2015 MapR Technologies 18
DNA Sequencing, pre-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
Sequencer
© 2015 MapR Technologies 19
DNA Sequencing, post-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
DNA Sequencer Cluster (e.g. Illumina X-Ten)
HPC bottleneck
Sequencer
back-pressure
© 2015 MapR Technologies 20
Solution: Implemented 2014 @ Sequencer Vendor
(with MapR)
write-only
DNA Sequencer Cluster (e.g. Illumina X-Ten
Storage + Compute Cluster
Decentralize I/O
Decentralize I/O
© 2015 MapR Technologies 21
Allows Secondary Analytics to Scale Out
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
© 2015 MapR Technologies 22
Allows Secondary Analytics to Scale Out
GATK / HPC
method: flat after
chromosome split
Hadoop / Spark
method
© 2015 MapR Technologies 23
Secondary Analytics: Acute Pain Point
FastQ
Reads
Aligned
Reads
Variants
ADAM + Avocado
Matrix rotation
is very I/O
intense
Velvet: Algorithms for de novo short read assembly
using de Bruijn graphs, Zerbino & Birney. 2008
Local de novo
is best…
…only feasible
with efficient
rotations
© 2015 MapR Technologies 24
Columnar Storage => Efficient Rotations
Genome Data
Format Definition
(A 1 Z)
(B 1 Z)
(C 1 Z)
A 1 Z B 1 Z C 1 Z
A B C 1 1 1 Z Z Z
Record 1
Record 2
Record 3
RowBased
ColBased
Sorting
Group
MLLib
© 2015 MapR Technologies 25
Avro & Parquet
• Apache fast followers of Google Protocol Buffers.
• Application data is abstracted from structure. Storage and
versioning efficiently handled internally.
• Read/write codecs auto-generated for any language.
• Avro: row-based records.
• Parquet: columnar Avro. Improves compression and I/O profile.
• ADAM: Genomics specific formats in Parquet. Effectively
optimized BAM and VCF for distributed computing.
© 2015 MapR Technologies 26
Downstream Analytics: GWAS/PheWAS
FastQ
Reads
Aligned
Reads
Variants
Function
Phenotypes
Scalable
GWAS/PheWA
S: “Green
Field” Territory
ADAM + Avocado
© 2015 MapR Technologies 27
Compute Engine
Data Workflow
Adam Pipeline
FastQ BAM ADAM
ADAM-
VCF
VCF
AvocadoADAM ADAM
Aligner
Super Fast
• In-memory
• Scalable
compute context
© 2015 MapR Technologies 28
Target Application: Alleviate / Prevent Suffering
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
© 2015 MapR Technologies 29
GWAS Overview (Genome-wide Association Study)
• Which genome features are associated with phenotype X?
https://en.wikipedia.org/wiki/Genome-wide_association_study
© 2015 MapR Technologies 30
PheWAS Overview (Phenome-wide …)
• Which phenotypes are associated with genome variant X?
http://www.tcpinnovations.com/drugbaron/phewas-the-tool-thats-revolutionizing-drug-development-that-youve-likely-never-heard-of/
© 2015 MapR Technologies 31
Genome × Phenome Analysis
For given population,
given SNP 𝛿, and
given phenotype ϕ:
Count the number
of occurrences as the
value of the matrix
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
SPARSE Billion + Phenotypes
SPARSEBillion+Genotypes
© 2015 MapR Technologies 32
Disease Cause via Genome × Phenome Matrix Factorization
• Row Eigenvectors of X represent
– Sets of related phenotypes (by SNP)
• Column Eigenvectors of Y represent
– Sets of related SNPS (by phenotype)
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Principal
Column
Vector
Archetype
Genotypes
Archetype
Phenotypes
Principal
Row
Vector
Sparse Matrix
Package is Actively
Developed in Spark
Community
© 2015 MapR Technologies 33
Generalized Approach: Genome × Phenome Tensor
• Maintain individual identity
• Aggregating individuals gives up statistical power
• Leverage pedigrees – Individuals are not independent observations
Variants
Phenotypes
Variants
Phenotypes
© 2015 MapR Technologies 34
Scalable Variant Store => Root out Disease Causes
Model P ~ F(G)
Fortunately, this has already been done…
Genotypes Med Record Phenotypes, e.g.
disease risk, drug response
© 2015 MapR Technologies 35
Largest Biometric Database in the World
PEOPLE
1.2B
PEOPLE
© 2015 MapR Technologies 36
Why Create Aadhaar?
• India: 1.2 billion residents
– 640,000 villages, ~60% lives under $2/day
– ~75% literacy, <3% pay income tax, <20% have bank accounts
– ~800 million mobile, ~200-300 million migrant workers
• Govt. spends about $25-40 billion on direct subsidies
– Residents have no standard identity document
– Most programs plagued with ghost and multiple identities causing
leakage of 30-40%
Standardize identity => Stop leakage
© 2015 MapR Technologies 37
Aadhaar Biometric Capture & Index
Raw
Digital
Fingerprint
© 2015 MapR Technologies 38
Aadhaar Biometric ID Creation
F(x): unique features
G(x): uncommon features
H(x): other features
• 900MM people loaded in 4
years
• In production
– 1MM registrations/day
– 200+ trillion lookups/day
• All built on MapR-DB (HBase)
Low Entropy +
Unique
Low Entropy +
Infrequent
© 2015 MapR Technologies 39
How Does this Relate to Genomics?
F-1(x): common features
F(x): unique features
G(x): uncommon features
H(x): other features
Same data shape and size
• Aadhaar: 1B humans, 5MB minutia
• Genome: 7B humans, ~3M variants
© 2015 MapR Technologies 40
How Does this Relate to Genomics?
F-1(x): common features
F(x): unique features
G(x): uncommon features
H(x): other features
Phenotype:
healthy or sick?
Phenotype Partition
=>
Low Entropy
© 2015 MapR Technologies 41
≈
individuals
fingerprint minutiae
Find rare minutiae to
uniquely identify
medicalrecords
genetic variants
Find shared variants
to get disease root
cause
Takeaway 1: Don’t reinvent the wheel
© 2015 MapR Technologies 42
Takeaway 2: Evolution, not Revolution
DNA Sequence
NASDAQ
Composite
© 2015 MapR Technologies 43
Thank You
@allenday // @mapr
Now a few slides about MapR’s product…
…and proposed next actions
© 2015 MapR Technologies 44
The MapR Advantage
• Scale Reliability Across the Enterprise
– Advanced multi-tenancy
– Business continuity – HA, DR
• Speed
– 2-7x faster than other Hadoop distro’s
– Ultra-fast data ingest (100M data points per sec)
– NFS & R/W file system
• Real-time & Self-Service Data Exploration
– On-the-fly SQL without up-front schema
– Fast lookups and queries
Best Hadoop Platform for Data Warehouse Optimization & Analytics
Security
Streaming
NoSQL & Search
Provisioning
&
coordination
ML, Graph
W orkflow
& Data Governance
Batch
SQL
INTEGRATED
COMMERCIAL
ENGINES
TOOLSCOMPUTE
ENGINES
Batch
Interactive
Real-time
Online
Others
Management
Operations
Governance
Audits
Security
MapR-FS MapR-DB
MapR Data Platform
© 2015 MapR Technologies 45© 2015 MapR Technologies
Genome Sequencing Quick Start Solution
© 2015 MapR Technologies 46
Quick Start Solutions: Speeding Time-to-Value
SOLUTION
TEMPLATE
KNOWLEDGE
TRANSFER
DEPLOYMENT
ARCHITECTURE
Data Warehouse
Optimization and Analytics
Security Log Analytics
Recommendation Engine
Genome Sequencing
© 2015 MapR Technologies 47
What’s in the Genome Sequencing Quick Start Solution?
6 nodes of
MapR software
3-4 week
engagement
3 Hadoop
Professional
Certifications
© 2015 MapR Technologies 48
Service Offering 1 – Resequencing with Hadoop
Reduces Storage
Hardware
Requirements
Accelerates Data
Processing Time
Minimal impact to
existing data
pipelines
Service Offering 2 – Variant Analysis with NoSQL
Present data for
exploration
Operationalize
complex workflows
Web-scale
performance
© 2015 MapR Technologies 49
Quick Start Service Engagement
Engagement includes:
1. Identification of data sources, transformations and reporting engines
2. Access and use of the solution template including source code
3. Training on customizing the solution template to the organization’s requirement
4. Deployment architecture document that enables a production deployment plan for the specific solution
SOLUTION
TEMPLATE
KNOWLEDGE
TRANSFER
DEPLOYMENT
ARCHITECTURE

More Related Content

Viewers also liked

You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
IBM Software Day 2013. Smarter analytics and big data. building the next gene...
IBM Software Day 2013. Smarter analytics and big data. building the next gene...IBM Software Day 2013. Smarter analytics and big data. building the next gene...
IBM Software Day 2013. Smarter analytics and big data. building the next gene...IBM (Middle East and Africa)
 
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...EMC
 
Giovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDrivenGiovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDrivenBigDataExpo
 
Fracture du pied chez l'enfant
Fracture du pied chez l'enfantFracture du pied chez l'enfant
Fracture du pied chez l'enfantAyoub EL KADDOURI
 
Four Graphics credentials
Four Graphics credentialsFour Graphics credentials
Four Graphics credentialsEmile Melki
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataTrieu Nguyen
 
Build_Buy_StreamAnalytix_WhitePaper
Build_Buy_StreamAnalytix_WhitePaperBuild_Buy_StreamAnalytix_WhitePaper
Build_Buy_StreamAnalytix_WhitePaperJane Roberts
 
Building an ai with raspberry pi
Building an ai with raspberry piBuilding an ai with raspberry pi
Building an ai with raspberry piHaesung Lee
 
Praktiline pilvekonverents - IT haldust hõlbustavad uuendused
Praktiline pilvekonverents - IT haldust hõlbustavad uuendusedPraktiline pilvekonverents - IT haldust hõlbustavad uuendused
Praktiline pilvekonverents - IT haldust hõlbustavad uuendusedPrimend
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Caserta
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR Technologies
 
Azure Large Scale Deployments - Tales from the Trenches
Azure Large Scale Deployments - Tales from the TrenchesAzure Large Scale Deployments - Tales from the Trenches
Azure Large Scale Deployments - Tales from the TrenchesAaron Saikovski
 
The Disruption of Big Data - AWS India Summit 2012
The Disruption of Big Data - AWS India Summit 2012The Disruption of Big Data - AWS India Summit 2012
The Disruption of Big Data - AWS India Summit 2012Amazon Web Services
 
1524 how ibm's big data solution can help you gain insight into your data cen...
1524 how ibm's big data solution can help you gain insight into your data cen...1524 how ibm's big data solution can help you gain insight into your data cen...
1524 how ibm's big data solution can help you gain insight into your data cen...IBM
 
How Verizon Innovates Through AI-Driven DevOps with Dynatrace
How Verizon Innovates Through AI-Driven DevOps with DynatraceHow Verizon Innovates Through AI-Driven DevOps with Dynatrace
How Verizon Innovates Through AI-Driven DevOps with DynatraceAmazon Web Services
 
Play Framework - Toulouse JUG - nov 2011
Play Framework - Toulouse JUG - nov 2011Play Framework - Toulouse JUG - nov 2011
Play Framework - Toulouse JUG - nov 2011Sylvain Wallez
 

Viewers also liked (20)

You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
IBM Software Day 2013. Smarter analytics and big data. building the next gene...
IBM Software Day 2013. Smarter analytics and big data. building the next gene...IBM Software Day 2013. Smarter analytics and big data. building the next gene...
IBM Software Day 2013. Smarter analytics and big data. building the next gene...
 
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
 
Giovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDrivenGiovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDriven
 
Fracture du pied chez l'enfant
Fracture du pied chez l'enfantFracture du pied chez l'enfant
Fracture du pied chez l'enfant
 
Four Graphics credentials
Four Graphics credentialsFour Graphics credentials
Four Graphics credentials
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
 
Build_Buy_StreamAnalytix_WhitePaper
Build_Buy_StreamAnalytix_WhitePaperBuild_Buy_StreamAnalytix_WhitePaper
Build_Buy_StreamAnalytix_WhitePaper
 
Building an ai with raspberry pi
Building an ai with raspberry piBuilding an ai with raspberry pi
Building an ai with raspberry pi
 
Praktiline pilvekonverents - IT haldust hõlbustavad uuendused
Praktiline pilvekonverents - IT haldust hõlbustavad uuendusedPraktiline pilvekonverents - IT haldust hõlbustavad uuendused
Praktiline pilvekonverents - IT haldust hõlbustavad uuendused
 
Click or clunk
Click or clunkClick or clunk
Click or clunk
 
Pesla
PeslaPesla
Pesla
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community Edition
 
Azure Large Scale Deployments - Tales from the Trenches
Azure Large Scale Deployments - Tales from the TrenchesAzure Large Scale Deployments - Tales from the Trenches
Azure Large Scale Deployments - Tales from the Trenches
 
The Disruption of Big Data - AWS India Summit 2012
The Disruption of Big Data - AWS India Summit 2012The Disruption of Big Data - AWS India Summit 2012
The Disruption of Big Data - AWS India Summit 2012
 
1524 how ibm's big data solution can help you gain insight into your data cen...
1524 how ibm's big data solution can help you gain insight into your data cen...1524 how ibm's big data solution can help you gain insight into your data cen...
1524 how ibm's big data solution can help you gain insight into your data cen...
 
How Verizon Innovates Through AI-Driven DevOps with Dynatrace
How Verizon Innovates Through AI-Driven DevOps with DynatraceHow Verizon Innovates Through AI-Driven DevOps with Dynatrace
How Verizon Innovates Through AI-Driven DevOps with Dynatrace
 
iOS and Android apps automation
iOS and Android apps automationiOS and Android apps automation
iOS and Android apps automation
 
Play Framework - Toulouse JUG - nov 2011
Play Framework - Toulouse JUG - nov 2011Play Framework - Toulouse JUG - nov 2011
Play Framework - Toulouse JUG - nov 2011
 

Similar to Hadoop for Genomics: What you need to know

2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersAllen Day, PhD
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Allen Day, PhD
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsMapR Technologies
 
Trends in Annotation of Genomic Data
Trends in Annotation of Genomic DataTrends in Annotation of Genomic Data
Trends in Annotation of Genomic Databiobase
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit
 
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– PresentationDmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– PresentationWolfgang G. Hoeck
 
Using Public Access Clinical Databases to Interpret NGS Variants
Using Public Access Clinical Databases to Interpret NGS VariantsUsing Public Access Clinical Databases to Interpret NGS Variants
Using Public Access Clinical Databases to Interpret NGS VariantsGolden Helix Inc
 
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...Gabe Rudy
 
2017 amp benchmarking_poster_justin
2017 amp benchmarking_poster_justin2017 amp benchmarking_poster_justin
2017 amp benchmarking_poster_justinGenomeInABottle
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GenomeInABottle
 
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...Torsten Seemann
 
Forum on Personalized Medicine: Challenges for the next decade
Forum on Personalized Medicine: Challenges for the next decadeForum on Personalized Medicine: Challenges for the next decade
Forum on Personalized Medicine: Challenges for the next decadeJoaquin Dopazo
 
Wheat rusttoolbox early_warning_jgh2
Wheat rusttoolbox early_warning_jgh2Wheat rusttoolbox early_warning_jgh2
Wheat rusttoolbox early_warning_jgh2ICARDA
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshopGenomeInABottle
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
 
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...DATAVERSITY
 

Similar to Hadoop for Genomics: What you need to know (20)

2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data Engineers
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for Genomics
 
Trends in Annotation of Genomic Data
Trends in Annotation of Genomic DataTrends in Annotation of Genomic Data
Trends in Annotation of Genomic Data
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 
Dmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– PresentationDmla0910 – Hoeck– Presentation
Dmla0910 – Hoeck– Presentation
 
transformers_multimodal_ehr.pdf
transformers_multimodal_ehr.pdftransformers_multimodal_ehr.pdf
transformers_multimodal_ehr.pdf
 
Using Public Access Clinical Databases to Interpret NGS Variants
Using Public Access Clinical Databases to Interpret NGS VariantsUsing Public Access Clinical Databases to Interpret NGS Variants
Using Public Access Clinical Databases to Interpret NGS Variants
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
 
2017 amp benchmarking_poster_justin
2017 amp benchmarking_poster_justin2017 amp benchmarking_poster_justin
2017 amp benchmarking_poster_justin
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
 
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
 
Forum on Personalized Medicine: Challenges for the next decade
Forum on Personalized Medicine: Challenges for the next decadeForum on Personalized Medicine: Challenges for the next decade
Forum on Personalized Medicine: Challenges for the next decade
 
Wheat rusttoolbox early_warning_jgh2
Wheat rusttoolbox early_warning_jgh2Wheat rusttoolbox early_warning_jgh2
Wheat rusttoolbox early_warning_jgh2
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
 

More from Allen Day, PhD

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...Allen Day, PhD
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...Allen Day, PhD
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser UniversityAllen Day, PhD
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - WageningenAllen Day, PhD
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - AmsterdamAllen Day, PhD
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen ChinaAllen Day, PhD
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedAllen Day, PhD
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big DataAllen Day, PhD
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data AnalyticsAllen Day, PhD
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
 
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design PatternsAllen Day, PhD
 

More from Allen Day, PhD (16)

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, Abbreviated
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
 

Recently uploaded

Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls JaipurCall Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipurparulsinha
 
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...Taniya Sharma
 
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls AvailableVip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls AvailableNehru place Escorts
 
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Kochi Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Bangalore Call Girls Hebbal Kempapura Number 7001035870 Meetin With Bangalor...
Bangalore Call Girls Hebbal Kempapura Number 7001035870  Meetin With Bangalor...Bangalore Call Girls Hebbal Kempapura Number 7001035870  Meetin With Bangalor...
Bangalore Call Girls Hebbal Kempapura Number 7001035870 Meetin With Bangalor...narwatsonia7
 
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...hotbabesbook
 
Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...aartirawatdelhi
 
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.MiadAlsulami
 
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service CoimbatoreCall Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatorenarwatsonia7
 
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on DeliveryCall Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Deliverynehamumbai
 
Call Girls Siliguri Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Siliguri Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Siliguri Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Siliguri Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Call Girls Varanasi Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Varanasi Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Varanasi Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Varanasi Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore EscortsVIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escortsaditipandeya
 

Recently uploaded (20)

Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls JaipurCall Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
 
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
 
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
 
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCREscort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
 
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
 
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls AvailableVip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
 
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Kochi Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service Available
 
Bangalore Call Girls Hebbal Kempapura Number 7001035870 Meetin With Bangalor...
Bangalore Call Girls Hebbal Kempapura Number 7001035870  Meetin With Bangalor...Bangalore Call Girls Hebbal Kempapura Number 7001035870  Meetin With Bangalor...
Bangalore Call Girls Hebbal Kempapura Number 7001035870 Meetin With Bangalor...
 
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
 
Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service Available
 
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
 
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
 
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
 
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service CoimbatoreCall Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
 
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on DeliveryCall Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
 
Call Girls Siliguri Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Siliguri Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Siliguri Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Siliguri Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Varanasi Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Varanasi Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Varanasi Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Varanasi Just Call 9907093804 Top Class Call Girl Service Available
 
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore EscortsVIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escorts
 

Hadoop for Genomics: What you need to know

  • 1. © 2015 MapR Technologies 1© 2015 MapR Technologies Hadoop for Genomics: What you need to know
  • 2. © 2015 MapR Technologies 2 Target Application: Alleviate / Prevent (Deterministic) Suffering Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  • 3. © 2015 MapR Technologies 3 DNA Sequencing, pre-2004 years CPU transistors/mm2 HDD GB/mm2 DNA bp/$, pre-2004
  • 4. © 2015 MapR Technologies 4 DNA Sequencing, 2004 Disruption years CPU transistors/mm2 HDD GB/mm2DNA bp/$, post-2004 DNA bp/$, pre-2004
  • 5. © 2015 MapR Technologies 5 DNA Sequencing, 2004 Disruption years CPU transistors/mm2 HDD GB/mm2DNA bp/$, post-2004 DNA bp/$, pre-2004 Similar disruption occurred for Internet traffic in mid-1990s
  • 6. © 2015 MapR Technologies 6 Effect: Many DNA-Based Apps Coming… • 2014: US$ 2B, mostly research, mostly chemical costs • 2020: US$ 20B, mostly clinical, mostly analytics costs Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning 0 5 10 15 20 25 2014 2020 Clinical Non-Clinical
  • 7. © 2015 MapR Technologies 7 http://steamcommunity.com/app/203160/discussions/0/846956188647169800/ http://www.vox.com/2015/2/1/7955921/lara-croft-moores-law What Does Moore’s Law Feel Like? #Dataviz: Lara Croft 230=>40,000 Polygons (1996-2014)
  • 8. © 2015 MapR Technologies 8 Application: Forensics http://cgi.uconn.edu/stranger-visions-forensic-art-exhibit/ http://snapshot.parabon-nanolabs.com/ http://www.nature.com/news/mugshots-built-from-dna-data-1.14899
  • 9. © 2015 MapR Technologies 9 Growth in Resource Capacity
  • 10. © 2015 MapR Technologies 10 Disruption Circa 2000 NASDAQ Composite
  • 11. © 2015 MapR Technologies 11 What Happened? What did winners do right to survive the .com recession? NASDAQ Composite
  • 12. © 2015 MapR Technologies 12 Early 1990s: Early eCommerce Vendor Setup Storage read/write read/write Website Back Office
  • 13. © 2015 MapR Technologies 13 Late 1990s: Workload became too big Storage read/write read/write Website WebsiteWebsite Website Back Office Back Office
  • 14. © 2015 MapR Technologies 14 Google Publishes • 2003: Google Filesystem (aka GFS) – http://research.google.com/archive/gfs.html • 2004: MapReduce – http://research.google.com/archive/mapreduce.html • 2006: BigTable – http://research.google.com/archive/bigtable.html
  • 15. © 2015 MapR Technologies 15 Scale-out with Google FS + MapReduce read/write read/write Website WebsiteWebsite Website Storage + Compute Cluster Back Office Back Office
  • 16. © 2015 MapR Technologies 16 Apache Software Foundation: Fast Follower of Google MapReduce Hadoop Google FS Hadoop FS BigTable HBase
  • 17. © 2015 MapR Technologies 17 DNA Sequencing, post-2004 DNA Sequence NASDAQ Composite
  • 18. © 2015 MapR Technologies 18 DNA Sequencing, pre-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node Sequencer
  • 19. © 2015 MapR Technologies 19 DNA Sequencing, post-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node DNA Sequencer Cluster (e.g. Illumina X-Ten) HPC bottleneck Sequencer back-pressure
  • 20. © 2015 MapR Technologies 20 Solution: Implemented 2014 @ Sequencer Vendor (with MapR) write-only DNA Sequencer Cluster (e.g. Illumina X-Ten Storage + Compute Cluster Decentralize I/O Decentralize I/O
  • 21. © 2015 MapR Technologies 21 Allows Secondary Analytics to Scale Out Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  • 22. © 2015 MapR Technologies 22 Allows Secondary Analytics to Scale Out GATK / HPC method: flat after chromosome split Hadoop / Spark method
  • 23. © 2015 MapR Technologies 23 Secondary Analytics: Acute Pain Point FastQ Reads Aligned Reads Variants ADAM + Avocado Matrix rotation is very I/O intense Velvet: Algorithms for de novo short read assembly using de Bruijn graphs, Zerbino & Birney. 2008 Local de novo is best… …only feasible with efficient rotations
  • 24. © 2015 MapR Technologies 24 Columnar Storage => Efficient Rotations Genome Data Format Definition (A 1 Z) (B 1 Z) (C 1 Z) A 1 Z B 1 Z C 1 Z A B C 1 1 1 Z Z Z Record 1 Record 2 Record 3 RowBased ColBased Sorting Group MLLib
  • 25. © 2015 MapR Technologies 25 Avro & Parquet • Apache fast followers of Google Protocol Buffers. • Application data is abstracted from structure. Storage and versioning efficiently handled internally. • Read/write codecs auto-generated for any language. • Avro: row-based records. • Parquet: columnar Avro. Improves compression and I/O profile. • ADAM: Genomics specific formats in Parquet. Effectively optimized BAM and VCF for distributed computing.
  • 26. © 2015 MapR Technologies 26 Downstream Analytics: GWAS/PheWAS FastQ Reads Aligned Reads Variants Function Phenotypes Scalable GWAS/PheWA S: “Green Field” Territory ADAM + Avocado
  • 27. © 2015 MapR Technologies 27 Compute Engine Data Workflow Adam Pipeline FastQ BAM ADAM ADAM- VCF VCF AvocadoADAM ADAM Aligner Super Fast • In-memory • Scalable compute context
  • 28. © 2015 MapR Technologies 28 Target Application: Alleviate / Prevent Suffering Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  • 29. © 2015 MapR Technologies 29 GWAS Overview (Genome-wide Association Study) • Which genome features are associated with phenotype X? https://en.wikipedia.org/wiki/Genome-wide_association_study
  • 30. © 2015 MapR Technologies 30 PheWAS Overview (Phenome-wide …) • Which phenotypes are associated with genome variant X? http://www.tcpinnovations.com/drugbaron/phewas-the-tool-thats-revolutionizing-drug-development-that-youve-likely-never-heard-of/
  • 31. © 2015 MapR Technologies 31 Genome × Phenome Analysis For given population, given SNP 𝛿, and given phenotype ϕ: Count the number of occurrences as the value of the matrix 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 SPARSE Billion + Phenotypes SPARSEBillion+Genotypes
  • 32. © 2015 MapR Technologies 32 Disease Cause via Genome × Phenome Matrix Factorization • Row Eigenvectors of X represent – Sets of related phenotypes (by SNP) • Column Eigenvectors of Y represent – Sets of related SNPS (by phenotype) 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Principal Column Vector Archetype Genotypes Archetype Phenotypes Principal Row Vector Sparse Matrix Package is Actively Developed in Spark Community
  • 33. © 2015 MapR Technologies 33 Generalized Approach: Genome × Phenome Tensor • Maintain individual identity • Aggregating individuals gives up statistical power • Leverage pedigrees – Individuals are not independent observations Variants Phenotypes Variants Phenotypes
  • 34. © 2015 MapR Technologies 34 Scalable Variant Store => Root out Disease Causes Model P ~ F(G) Fortunately, this has already been done… Genotypes Med Record Phenotypes, e.g. disease risk, drug response
  • 35. © 2015 MapR Technologies 35 Largest Biometric Database in the World PEOPLE 1.2B PEOPLE
  • 36. © 2015 MapR Technologies 36 Why Create Aadhaar? • India: 1.2 billion residents – 640,000 villages, ~60% lives under $2/day – ~75% literacy, <3% pay income tax, <20% have bank accounts – ~800 million mobile, ~200-300 million migrant workers • Govt. spends about $25-40 billion on direct subsidies – Residents have no standard identity document – Most programs plagued with ghost and multiple identities causing leakage of 30-40% Standardize identity => Stop leakage
  • 37. © 2015 MapR Technologies 37 Aadhaar Biometric Capture & Index Raw Digital Fingerprint
  • 38. © 2015 MapR Technologies 38 Aadhaar Biometric ID Creation F(x): unique features G(x): uncommon features H(x): other features • 900MM people loaded in 4 years • In production – 1MM registrations/day – 200+ trillion lookups/day • All built on MapR-DB (HBase) Low Entropy + Unique Low Entropy + Infrequent
  • 39. © 2015 MapR Technologies 39 How Does this Relate to Genomics? F-1(x): common features F(x): unique features G(x): uncommon features H(x): other features Same data shape and size • Aadhaar: 1B humans, 5MB minutia • Genome: 7B humans, ~3M variants
  • 40. © 2015 MapR Technologies 40 How Does this Relate to Genomics? F-1(x): common features F(x): unique features G(x): uncommon features H(x): other features Phenotype: healthy or sick? Phenotype Partition => Low Entropy
  • 41. © 2015 MapR Technologies 41 ≈ individuals fingerprint minutiae Find rare minutiae to uniquely identify medicalrecords genetic variants Find shared variants to get disease root cause Takeaway 1: Don’t reinvent the wheel
  • 42. © 2015 MapR Technologies 42 Takeaway 2: Evolution, not Revolution DNA Sequence NASDAQ Composite
  • 43. © 2015 MapR Technologies 43 Thank You @allenday // @mapr Now a few slides about MapR’s product… …and proposed next actions
  • 44. © 2015 MapR Technologies 44 The MapR Advantage • Scale Reliability Across the Enterprise – Advanced multi-tenancy – Business continuity – HA, DR • Speed – 2-7x faster than other Hadoop distro’s – Ultra-fast data ingest (100M data points per sec) – NFS & R/W file system • Real-time & Self-Service Data Exploration – On-the-fly SQL without up-front schema – Fast lookups and queries Best Hadoop Platform for Data Warehouse Optimization & Analytics Security Streaming NoSQL & Search Provisioning & coordination ML, Graph W orkflow & Data Governance Batch SQL INTEGRATED COMMERCIAL ENGINES TOOLSCOMPUTE ENGINES Batch Interactive Real-time Online Others Management Operations Governance Audits Security MapR-FS MapR-DB MapR Data Platform
  • 45. © 2015 MapR Technologies 45© 2015 MapR Technologies Genome Sequencing Quick Start Solution
  • 46. © 2015 MapR Technologies 46 Quick Start Solutions: Speeding Time-to-Value SOLUTION TEMPLATE KNOWLEDGE TRANSFER DEPLOYMENT ARCHITECTURE Data Warehouse Optimization and Analytics Security Log Analytics Recommendation Engine Genome Sequencing
  • 47. © 2015 MapR Technologies 47 What’s in the Genome Sequencing Quick Start Solution? 6 nodes of MapR software 3-4 week engagement 3 Hadoop Professional Certifications
  • 48. © 2015 MapR Technologies 48 Service Offering 1 – Resequencing with Hadoop Reduces Storage Hardware Requirements Accelerates Data Processing Time Minimal impact to existing data pipelines Service Offering 2 – Variant Analysis with NoSQL Present data for exploration Operationalize complex workflows Web-scale performance
  • 49. © 2015 MapR Technologies 49 Quick Start Service Engagement Engagement includes: 1. Identification of data sources, transformations and reporting engines 2. Access and use of the solution template including source code 3. Training on customizing the solution template to the organization’s requirement 4. Deployment architecture document that enables a production deployment plan for the specific solution SOLUTION TEMPLATE KNOWLEDGE TRANSFER DEPLOYMENT ARCHITECTURE

Editor's Notes

  1. cinical
  2. 35
  3. Increase GDP by 2%
  4. BOOM LSH
  5. Why MapR is the best Hadoop Platform for Data warhouse optimization? For business-critical applications you must have data protection and security (availability, data protection, and recovery), high performance (with random read-write system), multi-tenancy (to support multiple business units, isolate applications or user data,…), provide good resource and workload management to support multiple applications, and open standards to integrate with the rest of the IT ecosystem. You also need a platform that is capable of super fast data ingestion from multiple sources and be able to make critical analytics and decisions at speed (in milliseconds), and at scale. Examples include breach detection based on information from multiple sources, fraud detection on millions of transactions that are based on individual patterns, fleet management and routing taking into account current conditions….This requires a Hadoop platform that can go beyond batch and support streaming writes so data can be constantly writing to the system while analysis is being conducted. High performance to meet the business needs and real-time operations the ability to perform online database operations to react to the business situation and impact business as it happens not report on it one week, month or quarter later. Data Agility is needed for Business Agility. Drill provides instant ANSI SQL for Hadoop & NoSQL. You can explore data in its native format without expensive and time consuming transformation. You can analyze evolving and semi-structured/nested data from NoSQL databases, find what is of value and THEN model this in your DW schema for downstream ad-hoc reporting by 100’s or 1000’s of concurrent users.
  6. MapR Quick Start Solutions are a set of purpose-built solutions for the most critical and valuable use cases for Hadoop. These solutions, which include pre-built templates for each of the areas listed below, let you quickly get started with Hadoop and achieve faster time-to-value. We currently have offers around DWO, security log analytics, and recommendation engines with more planned for 2015.