More Related Content Similar to Hadoop for Genomics: What you need to know Similar to Hadoop for Genomics: What you need to know (20) More from Allen Day, PhD (16) Hadoop for Genomics: What you need to know1. © 2015 MapR Technologies 1© 2015 MapR Technologies
Hadoop for Genomics: What you need to know
2. © 2015 MapR Technologies 2
Target Application: Alleviate / Prevent (Deterministic) Suffering
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
3. © 2015 MapR Technologies 3
DNA Sequencing, pre-2004
years
CPU
transistors/mm2
HDD
GB/mm2
DNA
bp/$, pre-2004
4. © 2015 MapR Technologies 4
DNA Sequencing, 2004 Disruption
years
CPU
transistors/mm2
HDD
GB/mm2DNA
bp/$, post-2004
DNA
bp/$, pre-2004
5. © 2015 MapR Technologies 5
DNA Sequencing, 2004 Disruption
years
CPU
transistors/mm2
HDD
GB/mm2DNA
bp/$, post-2004
DNA
bp/$, pre-2004
Similar disruption occurred for
Internet traffic in mid-1990s
6. © 2015 MapR Technologies 6
Effect: Many DNA-Based Apps Coming…
• 2014: US$ 2B, mostly
research, mostly
chemical costs
• 2020: US$ 20B,
mostly clinical, mostly
analytics costs
Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning
0
5
10
15
20
25
2014 2020
Clinical
Non-Clinical
7. © 2015 MapR Technologies 7
http://steamcommunity.com/app/203160/discussions/0/846956188647169800/
http://www.vox.com/2015/2/1/7955921/lara-croft-moores-law
What Does Moore’s Law Feel Like? #Dataviz:
Lara Croft 230=>40,000 Polygons (1996-2014)
8. © 2015 MapR Technologies 8
Application: Forensics
http://cgi.uconn.edu/stranger-visions-forensic-art-exhibit/
http://snapshot.parabon-nanolabs.com/
http://www.nature.com/news/mugshots-built-from-dna-data-1.14899
9. © 2015 MapR Technologies 9
Growth in Resource Capacity
10. © 2015 MapR Technologies 10
Disruption Circa 2000
NASDAQ
Composite
11. © 2015 MapR Technologies 11
What Happened?
What did winners
do right to survive
the .com recession?
NASDAQ
Composite
12. © 2015 MapR Technologies 12
Early 1990s: Early eCommerce Vendor Setup
Storage
read/write
read/write
Website
Back Office
13. © 2015 MapR Technologies 13
Late 1990s: Workload became too big
Storage
read/write
read/write
Website WebsiteWebsite Website
Back Office Back Office
14. © 2015 MapR Technologies 14
Google Publishes
• 2003: Google Filesystem (aka GFS)
– http://research.google.com/archive/gfs.html
• 2004: MapReduce
– http://research.google.com/archive/mapreduce.html
• 2006: BigTable
– http://research.google.com/archive/bigtable.html
15. © 2015 MapR Technologies 15
Scale-out with Google FS + MapReduce
read/write
read/write
Website WebsiteWebsite Website
Storage + Compute Cluster
Back Office Back Office
16. © 2015 MapR Technologies 16
Apache Software Foundation: Fast Follower of Google
MapReduce Hadoop
Google FS
Hadoop FS
BigTable
HBase
17. © 2015 MapR Technologies 17
DNA Sequencing, post-2004 DNA Sequence
NASDAQ
Composite
18. © 2015 MapR Technologies 18
DNA Sequencing, pre-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
Sequencer
19. © 2015 MapR Technologies 19
DNA Sequencing, post-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
DNA Sequencer Cluster (e.g. Illumina X-Ten)
HPC bottleneck
Sequencer
back-pressure
20. © 2015 MapR Technologies 20
Solution: Implemented 2014 @ Sequencer Vendor
(with MapR)
write-only
DNA Sequencer Cluster (e.g. Illumina X-Ten
Storage + Compute Cluster
Decentralize I/O
Decentralize I/O
21. © 2015 MapR Technologies 21
Allows Secondary Analytics to Scale Out
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
22. © 2015 MapR Technologies 22
Allows Secondary Analytics to Scale Out
GATK / HPC
method: flat after
chromosome split
Hadoop / Spark
method
23. © 2015 MapR Technologies 23
Secondary Analytics: Acute Pain Point
FastQ
Reads
Aligned
Reads
Variants
ADAM + Avocado
Matrix rotation
is very I/O
intense
Velvet: Algorithms for de novo short read assembly
using de Bruijn graphs, Zerbino & Birney. 2008
Local de novo
is best…
…only feasible
with efficient
rotations
24. © 2015 MapR Technologies 24
Columnar Storage => Efficient Rotations
Genome Data
Format Definition
(A 1 Z)
(B 1 Z)
(C 1 Z)
A 1 Z B 1 Z C 1 Z
A B C 1 1 1 Z Z Z
Record 1
Record 2
Record 3
RowBased
ColBased
Sorting
Group
MLLib
25. © 2015 MapR Technologies 25
Avro & Parquet
• Apache fast followers of Google Protocol Buffers.
• Application data is abstracted from structure. Storage and
versioning efficiently handled internally.
• Read/write codecs auto-generated for any language.
• Avro: row-based records.
• Parquet: columnar Avro. Improves compression and I/O profile.
• ADAM: Genomics specific formats in Parquet. Effectively
optimized BAM and VCF for distributed computing.
26. © 2015 MapR Technologies 26
Downstream Analytics: GWAS/PheWAS
FastQ
Reads
Aligned
Reads
Variants
Function
Phenotypes
Scalable
GWAS/PheWA
S: “Green
Field” Territory
ADAM + Avocado
27. © 2015 MapR Technologies 27
Compute Engine
Data Workflow
Adam Pipeline
FastQ BAM ADAM
ADAM-
VCF
VCF
AvocadoADAM ADAM
Aligner
Super Fast
• In-memory
• Scalable
compute context
28. © 2015 MapR Technologies 28
Target Application: Alleviate / Prevent Suffering
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
29. © 2015 MapR Technologies 29
GWAS Overview (Genome-wide Association Study)
• Which genome features are associated with phenotype X?
https://en.wikipedia.org/wiki/Genome-wide_association_study
30. © 2015 MapR Technologies 30
PheWAS Overview (Phenome-wide …)
• Which phenotypes are associated with genome variant X?
http://www.tcpinnovations.com/drugbaron/phewas-the-tool-thats-revolutionizing-drug-development-that-youve-likely-never-heard-of/
31. © 2015 MapR Technologies 31
Genome × Phenome Analysis
For given population,
given SNP 𝛿, and
given phenotype ϕ:
Count the number
of occurrences as the
value of the matrix
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
SPARSE Billion + Phenotypes
SPARSEBillion+Genotypes
32. © 2015 MapR Technologies 32
Disease Cause via Genome × Phenome Matrix Factorization
• Row Eigenvectors of X represent
– Sets of related phenotypes (by SNP)
• Column Eigenvectors of Y represent
– Sets of related SNPS (by phenotype)
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Principal
Column
Vector
Archetype
Genotypes
Archetype
Phenotypes
Principal
Row
Vector
Sparse Matrix
Package is Actively
Developed in Spark
Community
33. © 2015 MapR Technologies 33
Generalized Approach: Genome × Phenome Tensor
• Maintain individual identity
• Aggregating individuals gives up statistical power
• Leverage pedigrees – Individuals are not independent observations
Variants
Phenotypes
Variants
Phenotypes
34. © 2015 MapR Technologies 34
Scalable Variant Store => Root out Disease Causes
Model P ~ F(G)
Fortunately, this has already been done…
Genotypes Med Record Phenotypes, e.g.
disease risk, drug response
35. © 2015 MapR Technologies 35
Largest Biometric Database in the World
PEOPLE
1.2B
PEOPLE
36. © 2015 MapR Technologies 36
Why Create Aadhaar?
• India: 1.2 billion residents
– 640,000 villages, ~60% lives under $2/day
– ~75% literacy, <3% pay income tax, <20% have bank accounts
– ~800 million mobile, ~200-300 million migrant workers
• Govt. spends about $25-40 billion on direct subsidies
– Residents have no standard identity document
– Most programs plagued with ghost and multiple identities causing
leakage of 30-40%
Standardize identity => Stop leakage
37. © 2015 MapR Technologies 37
Aadhaar Biometric Capture & Index
Raw
Digital
Fingerprint
38. © 2015 MapR Technologies 38
Aadhaar Biometric ID Creation
F(x): unique features
G(x): uncommon features
H(x): other features
• 900MM people loaded in 4
years
• In production
– 1MM registrations/day
– 200+ trillion lookups/day
• All built on MapR-DB (HBase)
Low Entropy +
Unique
Low Entropy +
Infrequent
39. © 2015 MapR Technologies 39
How Does this Relate to Genomics?
F-1(x): common features
F(x): unique features
G(x): uncommon features
H(x): other features
Same data shape and size
• Aadhaar: 1B humans, 5MB minutia
• Genome: 7B humans, ~3M variants
40. © 2015 MapR Technologies 40
How Does this Relate to Genomics?
F-1(x): common features
F(x): unique features
G(x): uncommon features
H(x): other features
Phenotype:
healthy or sick?
Phenotype Partition
=>
Low Entropy
41. © 2015 MapR Technologies 41
≈
individuals
fingerprint minutiae
Find rare minutiae to
uniquely identify
medicalrecords
genetic variants
Find shared variants
to get disease root
cause
Takeaway 1: Don’t reinvent the wheel
42. © 2015 MapR Technologies 42
Takeaway 2: Evolution, not Revolution
DNA Sequence
NASDAQ
Composite
43. © 2015 MapR Technologies 43
Thank You
@allenday // @mapr
Now a few slides about MapR’s product…
…and proposed next actions
44. © 2015 MapR Technologies 44
The MapR Advantage
• Scale Reliability Across the Enterprise
– Advanced multi-tenancy
– Business continuity – HA, DR
• Speed
– 2-7x faster than other Hadoop distro’s
– Ultra-fast data ingest (100M data points per sec)
– NFS & R/W file system
• Real-time & Self-Service Data Exploration
– On-the-fly SQL without up-front schema
– Fast lookups and queries
Best Hadoop Platform for Data Warehouse Optimization & Analytics
Security
Streaming
NoSQL & Search
Provisioning
&
coordination
ML, Graph
W orkflow
& Data Governance
Batch
SQL
INTEGRATED
COMMERCIAL
ENGINES
TOOLSCOMPUTE
ENGINES
Batch
Interactive
Real-time
Online
Others
Management
Operations
Governance
Audits
Security
MapR-FS MapR-DB
MapR Data Platform
45. © 2015 MapR Technologies 45© 2015 MapR Technologies
Genome Sequencing Quick Start Solution
46. © 2015 MapR Technologies 46
Quick Start Solutions: Speeding Time-to-Value
SOLUTION
TEMPLATE
KNOWLEDGE
TRANSFER
DEPLOYMENT
ARCHITECTURE
Data Warehouse
Optimization and Analytics
Security Log Analytics
Recommendation Engine
Genome Sequencing
47. © 2015 MapR Technologies 47
What’s in the Genome Sequencing Quick Start Solution?
6 nodes of
MapR software
3-4 week
engagement
3 Hadoop
Professional
Certifications
48. © 2015 MapR Technologies 48
Service Offering 1 – Resequencing with Hadoop
Reduces Storage
Hardware
Requirements
Accelerates Data
Processing Time
Minimal impact to
existing data
pipelines
Service Offering 2 – Variant Analysis with NoSQL
Present data for
exploration
Operationalize
complex workflows
Web-scale
performance
49. © 2015 MapR Technologies 49
Quick Start Service Engagement
Engagement includes:
1. Identification of data sources, transformations and reporting engines
2. Access and use of the solution template including source code
3. Training on customizing the solution template to the organization’s requirement
4. Deployment architecture document that enables a production deployment plan for the specific solution
SOLUTION
TEMPLATE
KNOWLEDGE
TRANSFER
DEPLOYMENT
ARCHITECTURE
Editor's Notes cinical 35 Increase GDP by 2% BOOM LSH Why MapR is the best Hadoop Platform for Data warhouse optimization? For business-critical applications you must have data protection and security (availability, data protection, and recovery), high performance (with random read-write system), multi-tenancy (to support multiple business units, isolate applications or user data,…), provide good resource and workload management to support multiple applications, and open standards to integrate with the rest of the IT ecosystem.
You also need a platform that is capable of super fast data ingestion from multiple sources and be able to make critical analytics and decisions at speed (in milliseconds), and at scale. Examples include breach detection based on information from multiple sources, fraud detection on millions of transactions that are based on individual patterns, fleet management and routing taking into account current conditions….This requires a Hadoop platform that can go beyond batch and support streaming writes so data can be constantly writing to the system while analysis is being conducted. High performance to meet the business needs and real-time operations the ability to perform online database operations to react to the business situation and impact business as it happens not report on it one week, month or quarter later.
Data Agility is needed for Business Agility. Drill provides instant ANSI SQL for Hadoop & NoSQL. You can explore data in its native format without expensive and time consuming transformation. You can analyze evolving and semi-structured/nested data from NoSQL databases, find what is of value and THEN model this in your DW schema for downstream ad-hoc reporting by 100’s or 1000’s of concurrent users.
MapR Quick Start Solutions are a set of purpose-built solutions for the most critical and valuable use cases for Hadoop. These solutions, which include pre-built templates for each of the areas listed below, let you quickly get started with Hadoop and achieve faster time-to-value. We currently have offers around DWO, security log analytics, and recommendation engines with more planned for 2015.