SlideShare a Scribd company logo
1 of 73
© 2014 MapR Technologies 1© 2014 MapR Technologies
Hadoop for Genomics: What you need to know
© 2014 MapR Technologies 2
DNA Sequencing, pre-2004
years
CPU
transistors/mm2
HDD
GB/mm2
DNA
bp/$, pre-2004
© 2014 MapR Technologies 3
DNA Sequencing, 2004 Disruption
years
CPU
transistors/mm2
HDD
GB/mm2DNA
bp/$, post-2004
DNA
bp/$, pre-2004
© 2014 MapR Technologies 4
DNA Sequencing, 2004 Disruption
years
CPU
transistors/mm2
HDD
GB/mm2DNA
bp/$, post-2004
DNA
bp/$, pre-2004
Similar disruption occurred for
Internet traffic in mid-1990s
© 2014 MapR Technologies 5
Effect: Many DNA-Based Apps Coming…
• 2014: US$ 2B, mostly
research, mostly
chemical costs
• 2020: US$ 20B,
mostly clinical, mostly
analytics costs
Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning
0
5
10
15
20
25
2014 2020
Clinical
Non-Clinical
© 2014 MapR Technologies 6
Genomics Value Chain
Order Test
from Clinic
Extract
Biosample
BioBank
Biosample
DNA
Extraction
Sequence
Biosample
Secondary
Analytics
Tertiary
Analytics
Reporting
to Clinic
Academic R&D
Pharma R&D
Clinic Therapy
Increased scale requirement
Increased feature set requirement
© 2014 MapR Technologies 7
Genomics Value Chain
Sequence
Biosample
Secondary
Analytics
Tertiary
Analytics
Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK (manual)
Pharma R&D OK, e.g. ILMN XTen Not OK (GATK) Missing, manual
Clinic Therapy OK, e.g. ILMN XTen Missing Missing
Increased scale requirement
Increased feature set requirement
Requirements
• Data Intense
• Batch
• High utilization
• Low COGS
Requirements
• Data Intense
• Interactive
• Easy to integrate
• Expressive
© 2014 MapR Technologies 8
Target Application: Alleviate / Prevent (Deterministic) Suffering
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
© 2014 MapR Technologies 9
http://steamcommunity.com/app/203160/discussions/0/846956188647169800/
http://www.vox.com/2015/2/1/7955921/lara-croft-moores-law
What Does Moore’s Law Feel Like? #Dataviz:
Lara Croft 230=>40,000 Polygons (1996-2014)
© 2014 MapR Technologies 10
Application: Forensics
http://cgi.uconn.edu/stranger-visions-forensic-art-exhibit/
http://snapshot.parabon-nanolabs.com/
http://www.nature.com/news/mugshots-built-from-dna-data-1.14899
© 2014 MapR Technologies 11
Growth in Resource Capacity
© 2014 MapR Technologies 12
Disruption Circa 2000
NASDAQ
Composite
© 2014 MapR Technologies 13
What Happened?
What did winners
do right to survive
the .com recession?
NASDAQ
Composite
© 2014 MapR Technologies 14
Early 1990s: Early eCommerce Vendor Setup
Storage
read/write
read/write
Website
Back Office
© 2014 MapR Technologies 15
Early 1990s: Early eCommerce Vendor Setup
Storage
read/write
read/write
Website
Back Office
<= SAN & NAS, Oracle
<= HPC
© 2014 MapR Technologies 16
Late 1990s: Workload became too big
Storage
read/write
read/write
Website WebsiteWebsite Website
Back Office Back Office
© 2014 MapR Technologies 17
Survivor Strategy Revealed: Google Publishes
• 2003: Google Filesystem (aka GFS)
– http://research.google.com/archive/gfs.html
• 2004: MapReduce
– http://research.google.com/archive/mapreduce.html
• 2006: BigTable
– http://research.google.com/archive/bigtable.html
© 2014 MapR Technologies 18
Scale-out with Google FS + MapReduce
read/write
read/write
Website WebsiteWebsite Website
Storage + Compute Cluster
Back Office Back Office
© 2014 MapR Technologies 19© 2014 MapR Technologies
Genomics: Internet Boom Déjà Vu
© 2014 MapR Technologies 20
DNA Sequencing, post-2004 DNA Sequence
NASDAQ
Composite
© 2014 MapR Technologies 21
DNA Sequencing, pre-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
Sequencer
SAN & NAS =>
HPC =>
© 2014 MapR Technologies 22
DNA Sequencing, post-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
DNA Sequencer Cluster (e.g. Illumina X-Ten)
© 2014 MapR Technologies 23
DNA Sequencing, post-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
DNA Sequencer Cluster (e.g. Illumina X-Ten)
HPC bottleneck
Sequencer
back-pressure
© 2014 MapR Technologies 24
DNA Sequencing, post-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
DNA Sequencer Cluster (e.g. Illumina X-Ten)
HPC bottleneck
Sequencer
back-pressure
NAS doesn’t look like a
great solution anymore…
© 2014 MapR Technologies 25
Solution: Implemented 2014 @ Complete Genomics
with MapR
write-only
DNA Sequencer Cluster (e.g. Illumina X-Ten
Storage + Compute Cluster
Decentralize I/O
Decentralize I/O
© 2014 MapR Technologies 26
Application Server
mapr-nfsserver
Linux NFS Client
Mapr client API
Loopback Mount:
localhost:/mapr /mapr
mapr-fileserver
S1
mapr-fileserver
S2
mapr-fileserver
S3
mapr-fileserver
S4
mapr-fileserver
S5
Chunk 1
256MB
MapR Inline Compression
1 2 3 4 5
1 2Chunk 2
256MB 3Chunk 3
256MB
4Chunk 4
256MB 5Chunk 5
256MB
Translate NFS into API Calls
1 1 1
4 4
2
3
2 2
3 3
4
55 5
MapR Data Platform
Network Security :
MapR RPC Full Wire Encryption
Client -> Server Communication
Server -> Server Communication
Supported Compression algorithms
( per Directory )
LZ4, LZF, ZLIB
Network Traffic will be
compressed automatically
MapR NFS Gateway on Application Servers
© 2014 MapR Technologies 27
[WHITEBOARD BREAK]
© 2014 MapR Technologies 28© 2014 MapR Technologies
[REDACTED]
© 2014 MapR Technologies 29
Allows Secondary Analytics to Scale Out
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
© 2014 MapR Technologies 30
Secondary Analytics: Acute Pain Point
FastQ
Reads
Aligned
Reads
Variants
ADAM + Avocado
Matrix rotation
is very I/O
intense
Velvet: Algorithms for de novo short read assembly
using de Bruijn graphs, Zerbino & Birney. 2008
Local de novo
is best…
…only feasible
with efficient
rotations
© 2014 MapR Technologies 31
Apache Parquet
© 2014 MapR Technologies 32
Row-Oriented Format
read1 chr1 10000 read2 TTGGAG ABCDEF
read2 chr1 20000 - TCGTAA ABCDEF
read3 chr2 5000 - GGGAAC ABCDEF
read4 chr3 1000000 read6 CCCTAC ABCDEF
read5 chr4 900000 - TTTAAG ABCDEF
0
5
20
40
57
ID Reference Position Next ID Sequence Quality
© 2014 MapR Technologies 33
Row-Oriented Splitting
© 2014 MapR Technologies 34
Column-Oriented Format
read1
read2
read3
read4
read5
chr1
chr1
chr2
chr3
chr4
10000
20000
5000
1000000
900000
read2
-
-
read6
-
TTGGAG
TCGTAA
GGGAAC
CCCTAC
TTTAAG
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ID Reference Position Next ID Sequence Quality
© 2014 MapR Technologies 35
Column-Oriented Format Partitioning
read1
read2
read3
read4
read5
chr1
chr1
chr2
chr3
chr4
10000
20000
5000
1000000
900000
read2
-
-
read6
-
TTGGAG
TCGTAA
TTGGAG
GGGAAC
TTTAAG
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ID Reference Position Next ID Sequence Quality
© 2014 MapR Technologies 36
Column-Oriented Format Splitting
© 2014 MapR Technologies 37
Apache Parquet
© 2014 MapR Technologies 38
Apache Parquet
http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
© 2014 MapR Technologies 39
Allows Secondary Analytics to Scale Out
GATK / HPC
method: flat after
chromosome split
Hadoop / Spark
method
© 2014 MapR Technologies 40© 2014 MapR Technologies
Tertiary Analytics
© 2014 MapR Technologies 41
Downstream Analytics: GWAS/PheWAS
FastQ
Reads
Aligned
Reads
Variants
Function
Phenotypes
Scalable
GWAS/PheWA
S: “Green
Field” Territory
ADAM + Avocado
© 2014 MapR Technologies 42
Target Application: Alleviate / Prevent Suffering
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
© 2014 MapR Technologies 43
GWAS Overview (Genome-wide Association Study)
• Which genome features are associated with phenotype X?
https://en.wikipedia.org/wiki/Genome-wide_association_study
© 2014 MapR Technologies 44
PheWAS Overview (Phenome-wide …)
• Which phenotypes are associated with genome variant X?
http://www.tcpinnovations.com/drugbaron/phewas-the-tool-thats-revolutionizing-drug-development-that-youve-likely-never-heard-of/
© 2014 MapR Technologies 45
Genome × Phenome Analysis
For given population,
given SNP 𝛿, and
given phenotype ϕ:
Count the number
of occurrences as the
value of the matrix
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
SPARSE Billion + Phenotypes
SPARSEBillion+Genotypes
© 2014 MapR Technologies 46
Disease Cause via Genome × Phenome Matrix Factorization
• Row Eigenvectors of X represent
– Sets of related phenotypes (by SNP)
• Column Eigenvectors of Y represent
– Sets of related SNPS (by phenotype)
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Principal
Column
Vector
Archetype
Genotypes
Archetype
Phenotypes
Principal
Row
Vector
Sparse Matrix
Package is Actively
Developed in Spark
Community
© 2014 MapR Technologies 47
Generalized Approach: Genome × Phenome Tensor
• Maintain individual identity
• Aggregating individuals gives up statistical power
• Leverage pedigrees – Individuals are not independent observations
Variants
Phenotypes
Variants
Phenotypes
© 2014 MapR Technologies 48
Scalable Variant Store => Root out Disease Causes
Model P ~ F(G)
Fortunately, this has already been done…
Genotypes Med Record Phenotypes, e.g.
disease risk, drug response
© 2014 MapR Technologies 49
Largest Biometric Database in the World
PEOPLE
1.2B
PEOPLE
© 2014 MapR Technologies 50
Why Create Aadhaar?
• India: 1.2 billion residents
– 640,000 villages, ~60% lives under $2/day
– ~75% literacy, <3% pay income tax, <20% have bank accounts
– ~800 million mobile, ~200-300 million migrant workers
• Govt. spends about $25-40 billion on direct subsidies
– Residents have no standard identity document
– Most programs plagued with ghost and multiple identities causing
leakage of 30-40%
Standardize identity => Stop leakage
© 2014 MapR Technologies 51
Aadhaar Biometric Capture & Index
Raw
Digital
Fingerprint
© 2014 MapR Technologies 52
Aadhaar Biometric ID Creation
F(x): unique features
G(x): uncommon features
H(x): other features
• 900MM people loaded in 4
years
• In production
– 1MM registrations/day
– 200+ trillion lookups/day
• All built on MapR-DB (HBase)
Low Entropy +
Unique
Low Entropy +
Infrequent
© 2014 MapR Technologies 53
Consistent, Low Latency
--- M7 Read Latency --- Others Read Latency
© 2014 MapR Technologies 54
How Does this Relate to Genomics?
F-1(x): common features
F(x): unique features
G(x): uncommon features
H(x): other features
Same data shape and size
• Aadhaar: 1B humans, 5MB minutia
• Genome: 7B humans, ~3M variants
© 2014 MapR Technologies 55
How Does this Relate to Genomics?
F-1(x): common features
F(x): unique features
G(x): uncommon features
H(x): other features
Phenotype:
healthy or sick?
Phenotype Partition
=>
Low Entropy
© 2014 MapR Technologies 56
≈
individuals
fingerprint minutiae
Find rare minutiae to
uniquely identify
medicalrecords
genetic variants
Find shared variants
to get disease root
cause
Takeaway 1: Don’t reinvent the wheel
© 2014 MapR Technologies 57
Takeaway 2: Evolution, not Revolution
DNA Sequence
NASDAQ
Composite
© 2014 MapR Technologies 58
Thank You
@allenday // @mapr
Now a few slides about MapR’s product…
…and proposed next actions
© 2014 MapR Technologies 59
“Quick Start” Package
Engagement includes:
1. Identification of data sources, transformations and reporting engines
2. Access and use of the solution template including source code
3. Training on customizing the solution template to the organization’s requirement
4. Deployment architecture document that enables a production deployment plan for the specific solution
SOLUTION
TEMPLATE
KNOWLEDGE
TRANSFER
DEPLOYMENT
ARCHITECTURE
© 2014 MapR Technologies 60
“Quick Start” 1 – Resequencing with Hadoop
Reduces Storage
Hardware
Requirements
Accelerates Data
Processing Time
Minimal impact to
existing data
pipelines
“Quick Start” 2 – Variant Analysis with NoSQL
Present data for
exploration
Operationalize
complex workflows
Web-scale
performance
© 2014 MapR Technologies 62
Genomics Value Chain
Sequence
Biosample
Secondary
Analytics
Tertiary
Analytics
Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK
Pharma R&D OK, e.g. ILMN XTen Not OK Missing
Clinic Therapy OK, e.g. ILMN XTen Missing Missing
© 2014 MapR Technologies 63
Genomics Value Chain
Sequence
Biosample
Secondary
Analytics
Tertiary
Analytics
Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK
Pharma R&D OK, e.g. ILMN XTen Not OK Missing
Clinic Therapy OK, e.g. ILMN XTen Missing Missing
Addressed by
Quick Start 1
Addressed by
Quick Start 2
© 2014 MapR Technologies 64© 2014 MapR Technologies
BONUS ROUND
© 2014 MapR Technologies 65© 2014 MapR Technologies
Genealogy Company
Slides credit: Bill Yetman, Hadoop Summit 2014
http://slidesha.re/1vRh3kY
© 2014 MapR Technologies 66
GERMLINE is…
• …an algorithm that finds hidden relationships within a pool of
DNA
• …the reference implementation of that algorithm written in C++.
• You can find it here:
http://www1.cs.columbia.edu/~gusev/germline/
6
6
© 2014 MapR Technologies 67
Projected GERMLINE run times (in hours)
6
7
Hours
Samples
0
100
200
300
400
500
600
700
2,500
12,500
22,500
32,500
42,500
52,500
62,500
72,500
82,500
92,500
102,500
112,500
122,500
GERMLINE run times
Projected GERMLINE run
times
700 hours = 29+ days
EXPONENTIAL COMPLEXITY
© 2014 MapR Technologies 68
GERMLINE: What’s the Problem?
• GERMLINE (the implementation) was not meant to be used in
an industrial setting
– Stateless, single threaded, prone to swapping (heavy memory usage)
– GERMLINE performs poorly on large data sets
• Our metrics predicted exactly where the process would slow to
a crawl
• Put simply: GERMLINE couldn't scale
6
8
© 2014 MapR Technologies 69
Run times for matching (in hours)
6
9
Hours
Samples
0
20
40
60
80
100
120
140
160
180
GERMLINE run times
Jermline run times
Projected GERMLINE
run times
EXPONENTIAL LINEAR
HBase
Refactor
© 2014 MapR Technologies 70
• Paper submitted describing the implementation
• Releasing as an Open Source project soon
• [HBase Schema/Algorithm Slides]
7
0
© 2014 MapR Technologies 71© 2014 MapR Technologies
Further Growth & Optimization
© 2014 MapR Technologies 72
Underdog (Strand Phasing) performance
– Went from 12 hours to process 1,000 samples
to under 25 minutes with a MapReduce
implementation
7
2
With improved accuracy!
Underdog
replaces
Beagle
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
Total Run Size Total Beagle-Underdog Duration
© 2014 MapR Technologies 73
Pipeline steps and incremental change…
– Incremental change over time
– Supporting the business in a “just in time” Agile way
7
3
0
50000
100000
150000
200000
250000
500
3622
7243
9615
12353
16333
19522
22861
26642
31172
35986
40852
45252
49817
54738
61675
69496
77257
84337
90074
97448
104684
111937
119669
127194
134970
142232
149988
157710
165685
173719
181617
189817
197853
205855
213471
221290
228912
236516
243550
251315
259164
267266
275335
283114
291017
298823
306556
314662
322655
330745
338813
346847
354938
362954
371064
379208
387334
395432
Beagle-Underdog Phasing
Pipeline Finalize
Relationship Processing
Germline-Jermline Results Processing
Germline-Jermline Processing
Beagle Post Phasing
Admixture
Plink Prep
Pipeline Initialization
Jermline replaces
Germline
Ethnicity V2 Release
Underdog Replaces
Beagle
AdMixture on
Hadoop
© 2014 MapR Technologies 74
…while the business continues to grow rapidly
7
4
-
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14
#ofprocessedsamples)
DNA Database Size

More Related Content

What's hot

OpenPOWER Academia and Research team's webinar - Presentations from Oak Ridg...
OpenPOWER Academia and Research team's webinar  - Presentations from Oak Ridg...OpenPOWER Academia and Research team's webinar  - Presentations from Oak Ridg...
OpenPOWER Academia and Research team's webinar - Presentations from Oak Ridg...Ganesan Narayanasamy
 
Cloud Accelerated Genomics
Cloud Accelerated GenomicsCloud Accelerated Genomics
Cloud Accelerated GenomicsIdan Tohami
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
Content Framework for Operational Environmental Remote Sensing Data Sets: NPO...
Content Framework for Operational Environmental Remote Sensing Data Sets: NPO...Content Framework for Operational Environmental Remote Sensing Data Sets: NPO...
Content Framework for Operational Environmental Remote Sensing Data Sets: NPO...The HDF-EOS Tools and Information Center
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with RGreat Wide Open
 
Opportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckOpportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckPistoia Alliance
 
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...Frederic Desprez
 
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryThe Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryIntel IT Center
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
Carpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncCarpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncnisohq
 
Power of Python with Big Data
Power of Python with Big DataPower of Python with Big Data
Power of Python with Big DataEdureka!
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research PlatformLarry Smarr
 
EDF2012 Peter Boncz - LOD benchmarking SRbench
EDF2012   Peter Boncz - LOD benchmarking SRbenchEDF2012   Peter Boncz - LOD benchmarking SRbench
EDF2012 Peter Boncz - LOD benchmarking SRbenchEuropean Data Forum
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)mortardata
 
Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014
Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014
Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014StampedeCon
 
Advances at the Argonne Leadership Computing Center
Advances at the Argonne Leadership Computing CenterAdvances at the Argonne Leadership Computing Center
Advances at the Argonne Leadership Computing Centerdavidemartin
 

What's hot (20)

OpenPOWER Academia and Research team's webinar - Presentations from Oak Ridg...
OpenPOWER Academia and Research team's webinar  - Presentations from Oak Ridg...OpenPOWER Academia and Research team's webinar  - Presentations from Oak Ridg...
OpenPOWER Academia and Research team's webinar - Presentations from Oak Ridg...
 
Cloud Accelerated Genomics
Cloud Accelerated GenomicsCloud Accelerated Genomics
Cloud Accelerated Genomics
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Content Framework for Operational Environmental Remote Sensing Data Sets: NPO...
Content Framework for Operational Environmental Remote Sensing Data Sets: NPO...Content Framework for Operational Environmental Remote Sensing Data Sets: NPO...
Content Framework for Operational Environmental Remote Sensing Data Sets: NPO...
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
R for data analytics
R for data analyticsR for data analytics
R for data analytics
 
Opportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckOpportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deck
 
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
 
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryThe Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Carpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncCarpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSync
 
Power of Python with Big Data
Power of Python with Big DataPower of Python with Big Data
Power of Python with Big Data
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
 
HDF5 for NPOESS Data Products
HDF5 for NPOESS Data ProductsHDF5 for NPOESS Data Products
HDF5 for NPOESS Data Products
 
EDF2012 Peter Boncz - LOD benchmarking SRbench
EDF2012   Peter Boncz - LOD benchmarking SRbenchEDF2012   Peter Boncz - LOD benchmarking SRbench
EDF2012 Peter Boncz - LOD benchmarking SRbench
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
 
Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014
Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014
Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014
 
Advances at the Argonne Leadership Computing Center
Advances at the Argonne Leadership Computing CenterAdvances at the Argonne Leadership Computing Center
Advances at the Argonne Leadership Computing Center
 

Viewers also liked

Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 
Strata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMStrata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMMatt Massie
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
Strata-Hadoop 2015 Presentation
Strata-Hadoop 2015 PresentationStrata-Hadoop 2015 Presentation
Strata-Hadoop 2015 PresentationTimothy Danford
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Timothy Danford
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit
 
Building cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerBuilding cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerJacob Feala
 
Developing openEHR EHRs - core functionalities
Developing openEHR EHRs - core functionalitiesDeveloping openEHR EHRs - core functionalities
Developing openEHR EHRs - core functionalitiesPablo Pazos
 
2016 AWS Life Sciences Days | Boston, MA – May 17, 2016
2016 AWS Life Sciences Days | Boston, MA – May 17, 20162016 AWS Life Sciences Days | Boston, MA – May 17, 2016
2016 AWS Life Sciences Days | Boston, MA – May 17, 2016Amazon Web Services
 
Intel precision medicine apr 2015
Intel precision medicine apr 2015Intel precision medicine apr 2015
Intel precision medicine apr 2015Ketan Paranjape
 

Viewers also liked (15)

Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
Strata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMStrata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAM
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Strata-Hadoop 2015 Presentation
Strata-Hadoop 2015 PresentationStrata-Hadoop 2015 Presentation
Strata-Hadoop 2015 Presentation
 
openEHR sll-2015final
openEHR sll-2015finalopenEHR sll-2015final
openEHR sll-2015final
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
Building cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerBuilding cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and Docker
 
Developing openEHR EHRs - core functionalities
Developing openEHR EHRs - core functionalitiesDeveloping openEHR EHRs - core functionalities
Developing openEHR EHRs - core functionalities
 
2016 AWS Life Sciences Days | Boston, MA – May 17, 2016
2016 AWS Life Sciences Days | Boston, MA – May 17, 20162016 AWS Life Sciences Days | Boston, MA – May 17, 2016
2016 AWS Life Sciences Days | Boston, MA – May 17, 2016
 
Personal Genomics: Business Model for 23andMe
Personal Genomics: Business Model for 23andMePersonal Genomics: Business Model for 23andMe
Personal Genomics: Business Model for 23andMe
 
Intel precision medicine apr 2015
Intel precision medicine apr 2015Intel precision medicine apr 2015
Intel precision medicine apr 2015
 

Similar to Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI

Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsMapR Technologies
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen ChinaAllen Day, PhD
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Allen Day, PhD
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleJulius Remigio, CBIP
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersAllen Day, PhD
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drilltshiran
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyUri Laserson
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareCarol McDonald
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR Technologies
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
 
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340Big Data Joe™ Rossi
 
Architecting R into Storm Application Development Process
Architecting R into Storm Application Development ProcessArchitecting R into Storm Application Development Process
Architecting R into Storm Application Development ProcessDataWorks Summit
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedDataWorks Summit
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleIan Downard
 
AusCover portal presentation
AusCover portal presentationAusCover portal presentation
AusCover portal presentationTERN Australia
 

Similar to Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI (20)

Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for Genomics
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data Style
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data Engineers
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community Edition
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
 
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340
 
Architecting R into Storm Application Development Process
Architecting R into Storm Application Development ProcessArchitecting R into Storm Application Development Process
Architecting R into Storm Application Development Process
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating Example
 
AusCover portal presentation
AusCover portal presentationAusCover portal presentation
AusCover portal presentation
 

More from Allen Day, PhD

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...Allen Day, PhD
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...Allen Day, PhD
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser UniversityAllen Day, PhD
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - WageningenAllen Day, PhD
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - AmsterdamAllen Day, PhD
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedAllen Day, PhD
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big DataAllen Day, PhD
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data AnalyticsAllen Day, PhD
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
 
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design PatternsAllen Day, PhD
 

More from Allen Day, PhD (13)

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, Abbreviated
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
 

Recently uploaded

Generative AI in Health Care a scoping review and a persoanl experience.
Generative AI in Health Care a scoping review and a persoanl experience.Generative AI in Health Care a scoping review and a persoanl experience.
Generative AI in Health Care a scoping review and a persoanl experience.Vaikunthan Rajaratnam
 
ORAL HYPOGLYCAEMIC AGENTS - PART 2.pptx
ORAL HYPOGLYCAEMIC AGENTS  - PART 2.pptxORAL HYPOGLYCAEMIC AGENTS  - PART 2.pptx
ORAL HYPOGLYCAEMIC AGENTS - PART 2.pptxNIKITA BHUTE
 
power point presentation of Clinical evaluation of strabismus
power point presentation of Clinical evaluation  of strabismuspower point presentation of Clinical evaluation  of strabismus
power point presentation of Clinical evaluation of strabismusChandrasekar Reddy
 
FDMA FLAP - The first dorsal metacarpal artery (FDMA) flap is used mainly for...
FDMA FLAP - The first dorsal metacarpal artery (FDMA) flap is used mainly for...FDMA FLAP - The first dorsal metacarpal artery (FDMA) flap is used mainly for...
FDMA FLAP - The first dorsal metacarpal artery (FDMA) flap is used mainly for...Shubhanshu Gaurav
 
Role of Soap based and synthetic or syndets bar
Role of  Soap based and synthetic or syndets barRole of  Soap based and synthetic or syndets bar
Role of Soap based and synthetic or syndets barmohitRahangdale
 
How to cure cirrhosis and chronic hepatitis naturally
How to cure cirrhosis and chronic hepatitis naturallyHow to cure cirrhosis and chronic hepatitis naturally
How to cure cirrhosis and chronic hepatitis naturallyZurück zum Ursprung
 
Unit I herbs as raw materials, biodynamic agriculture.ppt
Unit I herbs as raw materials, biodynamic agriculture.pptUnit I herbs as raw materials, biodynamic agriculture.ppt
Unit I herbs as raw materials, biodynamic agriculture.pptPradnya Wadekar
 
EXERCISE PERFORMANCE.pptx, Lung function
EXERCISE PERFORMANCE.pptx, Lung functionEXERCISE PERFORMANCE.pptx, Lung function
EXERCISE PERFORMANCE.pptx, Lung functionkrishnareddy157915
 
Female Reproductive Physiology Before Pregnancy
Female Reproductive Physiology Before PregnancyFemale Reproductive Physiology Before Pregnancy
Female Reproductive Physiology Before PregnancyMedicoseAcademics
 
Neurological history taking (2024) .
Neurological  history  taking  (2024)  .Neurological  history  taking  (2024)  .
Neurological history taking (2024) .Mohamed Rizk Khodair
 
Basic structure of hair and hair growth cycle.pptx
Basic structure of hair and hair growth cycle.pptxBasic structure of hair and hair growth cycle.pptx
Basic structure of hair and hair growth cycle.pptxkomalt2001
 
historyofpsychiatryinindia. Senthil Thirusangu
historyofpsychiatryinindia. Senthil Thirusanguhistoryofpsychiatryinindia. Senthil Thirusangu
historyofpsychiatryinindia. Senthil Thirusangu Medical University
 
SGK RỐI LOẠN KALI MÁU CỰC KỲ QUAN TRỌNG.pdf
SGK RỐI LOẠN KALI MÁU CỰC KỲ QUAN TRỌNG.pdfSGK RỐI LOẠN KALI MÁU CỰC KỲ QUAN TRỌNG.pdf
SGK RỐI LOẠN KALI MÁU CỰC KỲ QUAN TRỌNG.pdfHongBiThi1
 
Trustworthiness of AI based predictions Aachen 2024
Trustworthiness of AI based predictions Aachen 2024Trustworthiness of AI based predictions Aachen 2024
Trustworthiness of AI based predictions Aachen 2024EwoutSteyerberg1
 
SGK ĐIỆN GIẬT ĐHYHN RẤT LÀ HAY TUYỆT VỜI.pdf
SGK ĐIỆN GIẬT ĐHYHN        RẤT LÀ HAY TUYỆT VỜI.pdfSGK ĐIỆN GIẬT ĐHYHN        RẤT LÀ HAY TUYỆT VỜI.pdf
SGK ĐIỆN GIẬT ĐHYHN RẤT LÀ HAY TUYỆT VỜI.pdfHongBiThi1
 
SGK LEUKEMIA KINH DÒNG BẠCH CÂU HẠT HAY.pdf
SGK LEUKEMIA KINH DÒNG BẠCH CÂU HẠT HAY.pdfSGK LEUKEMIA KINH DÒNG BẠCH CÂU HẠT HAY.pdf
SGK LEUKEMIA KINH DÒNG BẠCH CÂU HẠT HAY.pdfHongBiThi1
 
ayurvedic formulations herbal drug technologyppt
ayurvedic formulations herbal drug technologypptayurvedic formulations herbal drug technologyppt
ayurvedic formulations herbal drug technologypptPradnya Wadekar
 
Breast cancer -ONCO IN MEDICAL AND SURGICAL NURSING.pptx
Breast cancer -ONCO IN MEDICAL AND SURGICAL NURSING.pptxBreast cancer -ONCO IN MEDICAL AND SURGICAL NURSING.pptx
Breast cancer -ONCO IN MEDICAL AND SURGICAL NURSING.pptxNaveenkumar267201
 

Recently uploaded (20)

Generative AI in Health Care a scoping review and a persoanl experience.
Generative AI in Health Care a scoping review and a persoanl experience.Generative AI in Health Care a scoping review and a persoanl experience.
Generative AI in Health Care a scoping review and a persoanl experience.
 
ORAL HYPOGLYCAEMIC AGENTS - PART 2.pptx
ORAL HYPOGLYCAEMIC AGENTS  - PART 2.pptxORAL HYPOGLYCAEMIC AGENTS  - PART 2.pptx
ORAL HYPOGLYCAEMIC AGENTS - PART 2.pptx
 
power point presentation of Clinical evaluation of strabismus
power point presentation of Clinical evaluation  of strabismuspower point presentation of Clinical evaluation  of strabismus
power point presentation of Clinical evaluation of strabismus
 
FDMA FLAP - The first dorsal metacarpal artery (FDMA) flap is used mainly for...
FDMA FLAP - The first dorsal metacarpal artery (FDMA) flap is used mainly for...FDMA FLAP - The first dorsal metacarpal artery (FDMA) flap is used mainly for...
FDMA FLAP - The first dorsal metacarpal artery (FDMA) flap is used mainly for...
 
Role of Soap based and synthetic or syndets bar
Role of  Soap based and synthetic or syndets barRole of  Soap based and synthetic or syndets bar
Role of Soap based and synthetic or syndets bar
 
How to cure cirrhosis and chronic hepatitis naturally
How to cure cirrhosis and chronic hepatitis naturallyHow to cure cirrhosis and chronic hepatitis naturally
How to cure cirrhosis and chronic hepatitis naturally
 
Unit I herbs as raw materials, biodynamic agriculture.ppt
Unit I herbs as raw materials, biodynamic agriculture.pptUnit I herbs as raw materials, biodynamic agriculture.ppt
Unit I herbs as raw materials, biodynamic agriculture.ppt
 
EXERCISE PERFORMANCE.pptx, Lung function
EXERCISE PERFORMANCE.pptx, Lung functionEXERCISE PERFORMANCE.pptx, Lung function
EXERCISE PERFORMANCE.pptx, Lung function
 
American College of physicians ACP high value care recommendations in rheumat...
American College of physicians ACP high value care recommendations in rheumat...American College of physicians ACP high value care recommendations in rheumat...
American College of physicians ACP high value care recommendations in rheumat...
 
Female Reproductive Physiology Before Pregnancy
Female Reproductive Physiology Before PregnancyFemale Reproductive Physiology Before Pregnancy
Female Reproductive Physiology Before Pregnancy
 
Neurological history taking (2024) .
Neurological  history  taking  (2024)  .Neurological  history  taking  (2024)  .
Neurological history taking (2024) .
 
GOUT UPDATE AHMED YEHIA 2024, case based approach with application of the lat...
GOUT UPDATE AHMED YEHIA 2024, case based approach with application of the lat...GOUT UPDATE AHMED YEHIA 2024, case based approach with application of the lat...
GOUT UPDATE AHMED YEHIA 2024, case based approach with application of the lat...
 
Basic structure of hair and hair growth cycle.pptx
Basic structure of hair and hair growth cycle.pptxBasic structure of hair and hair growth cycle.pptx
Basic structure of hair and hair growth cycle.pptx
 
historyofpsychiatryinindia. Senthil Thirusangu
historyofpsychiatryinindia. Senthil Thirusanguhistoryofpsychiatryinindia. Senthil Thirusangu
historyofpsychiatryinindia. Senthil Thirusangu
 
SGK RỐI LOẠN KALI MÁU CỰC KỲ QUAN TRỌNG.pdf
SGK RỐI LOẠN KALI MÁU CỰC KỲ QUAN TRỌNG.pdfSGK RỐI LOẠN KALI MÁU CỰC KỲ QUAN TRỌNG.pdf
SGK RỐI LOẠN KALI MÁU CỰC KỲ QUAN TRỌNG.pdf
 
Trustworthiness of AI based predictions Aachen 2024
Trustworthiness of AI based predictions Aachen 2024Trustworthiness of AI based predictions Aachen 2024
Trustworthiness of AI based predictions Aachen 2024
 
SGK ĐIỆN GIẬT ĐHYHN RẤT LÀ HAY TUYỆT VỜI.pdf
SGK ĐIỆN GIẬT ĐHYHN        RẤT LÀ HAY TUYỆT VỜI.pdfSGK ĐIỆN GIẬT ĐHYHN        RẤT LÀ HAY TUYỆT VỜI.pdf
SGK ĐIỆN GIẬT ĐHYHN RẤT LÀ HAY TUYỆT VỜI.pdf
 
SGK LEUKEMIA KINH DÒNG BẠCH CÂU HẠT HAY.pdf
SGK LEUKEMIA KINH DÒNG BẠCH CÂU HẠT HAY.pdfSGK LEUKEMIA KINH DÒNG BẠCH CÂU HẠT HAY.pdf
SGK LEUKEMIA KINH DÒNG BẠCH CÂU HẠT HAY.pdf
 
ayurvedic formulations herbal drug technologyppt
ayurvedic formulations herbal drug technologypptayurvedic formulations herbal drug technologyppt
ayurvedic formulations herbal drug technologyppt
 
Breast cancer -ONCO IN MEDICAL AND SURGICAL NURSING.pptx
Breast cancer -ONCO IN MEDICAL AND SURGICAL NURSING.pptxBreast cancer -ONCO IN MEDICAL AND SURGICAL NURSING.pptx
Breast cancer -ONCO IN MEDICAL AND SURGICAL NURSING.pptx
 

Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies Hadoop for Genomics: What you need to know
  • 2. © 2014 MapR Technologies 2 DNA Sequencing, pre-2004 years CPU transistors/mm2 HDD GB/mm2 DNA bp/$, pre-2004
  • 3. © 2014 MapR Technologies 3 DNA Sequencing, 2004 Disruption years CPU transistors/mm2 HDD GB/mm2DNA bp/$, post-2004 DNA bp/$, pre-2004
  • 4. © 2014 MapR Technologies 4 DNA Sequencing, 2004 Disruption years CPU transistors/mm2 HDD GB/mm2DNA bp/$, post-2004 DNA bp/$, pre-2004 Similar disruption occurred for Internet traffic in mid-1990s
  • 5. © 2014 MapR Technologies 5 Effect: Many DNA-Based Apps Coming… • 2014: US$ 2B, mostly research, mostly chemical costs • 2020: US$ 20B, mostly clinical, mostly analytics costs Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning 0 5 10 15 20 25 2014 2020 Clinical Non-Clinical
  • 6. © 2014 MapR Technologies 6 Genomics Value Chain Order Test from Clinic Extract Biosample BioBank Biosample DNA Extraction Sequence Biosample Secondary Analytics Tertiary Analytics Reporting to Clinic Academic R&D Pharma R&D Clinic Therapy Increased scale requirement Increased feature set requirement
  • 7. © 2014 MapR Technologies 7 Genomics Value Chain Sequence Biosample Secondary Analytics Tertiary Analytics Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK (manual) Pharma R&D OK, e.g. ILMN XTen Not OK (GATK) Missing, manual Clinic Therapy OK, e.g. ILMN XTen Missing Missing Increased scale requirement Increased feature set requirement Requirements • Data Intense • Batch • High utilization • Low COGS Requirements • Data Intense • Interactive • Easy to integrate • Expressive
  • 8. © 2014 MapR Technologies 8 Target Application: Alleviate / Prevent (Deterministic) Suffering Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  • 9. © 2014 MapR Technologies 9 http://steamcommunity.com/app/203160/discussions/0/846956188647169800/ http://www.vox.com/2015/2/1/7955921/lara-croft-moores-law What Does Moore’s Law Feel Like? #Dataviz: Lara Croft 230=>40,000 Polygons (1996-2014)
  • 10. © 2014 MapR Technologies 10 Application: Forensics http://cgi.uconn.edu/stranger-visions-forensic-art-exhibit/ http://snapshot.parabon-nanolabs.com/ http://www.nature.com/news/mugshots-built-from-dna-data-1.14899
  • 11. © 2014 MapR Technologies 11 Growth in Resource Capacity
  • 12. © 2014 MapR Technologies 12 Disruption Circa 2000 NASDAQ Composite
  • 13. © 2014 MapR Technologies 13 What Happened? What did winners do right to survive the .com recession? NASDAQ Composite
  • 14. © 2014 MapR Technologies 14 Early 1990s: Early eCommerce Vendor Setup Storage read/write read/write Website Back Office
  • 15. © 2014 MapR Technologies 15 Early 1990s: Early eCommerce Vendor Setup Storage read/write read/write Website Back Office <= SAN & NAS, Oracle <= HPC
  • 16. © 2014 MapR Technologies 16 Late 1990s: Workload became too big Storage read/write read/write Website WebsiteWebsite Website Back Office Back Office
  • 17. © 2014 MapR Technologies 17 Survivor Strategy Revealed: Google Publishes • 2003: Google Filesystem (aka GFS) – http://research.google.com/archive/gfs.html • 2004: MapReduce – http://research.google.com/archive/mapreduce.html • 2006: BigTable – http://research.google.com/archive/bigtable.html
  • 18. © 2014 MapR Technologies 18 Scale-out with Google FS + MapReduce read/write read/write Website WebsiteWebsite Website Storage + Compute Cluster Back Office Back Office
  • 19. © 2014 MapR Technologies 19© 2014 MapR Technologies Genomics: Internet Boom Déjà Vu
  • 20. © 2014 MapR Technologies 20 DNA Sequencing, post-2004 DNA Sequence NASDAQ Composite
  • 21. © 2014 MapR Technologies 21 DNA Sequencing, pre-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node Sequencer SAN & NAS => HPC =>
  • 22. © 2014 MapR Technologies 22 DNA Sequencing, post-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node DNA Sequencer Cluster (e.g. Illumina X-Ten)
  • 23. © 2014 MapR Technologies 23 DNA Sequencing, post-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node DNA Sequencer Cluster (e.g. Illumina X-Ten) HPC bottleneck Sequencer back-pressure
  • 24. © 2014 MapR Technologies 24 DNA Sequencing, post-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node DNA Sequencer Cluster (e.g. Illumina X-Ten) HPC bottleneck Sequencer back-pressure NAS doesn’t look like a great solution anymore…
  • 25. © 2014 MapR Technologies 25 Solution: Implemented 2014 @ Complete Genomics with MapR write-only DNA Sequencer Cluster (e.g. Illumina X-Ten Storage + Compute Cluster Decentralize I/O Decentralize I/O
  • 26. © 2014 MapR Technologies 26 Application Server mapr-nfsserver Linux NFS Client Mapr client API Loopback Mount: localhost:/mapr /mapr mapr-fileserver S1 mapr-fileserver S2 mapr-fileserver S3 mapr-fileserver S4 mapr-fileserver S5 Chunk 1 256MB MapR Inline Compression 1 2 3 4 5 1 2Chunk 2 256MB 3Chunk 3 256MB 4Chunk 4 256MB 5Chunk 5 256MB Translate NFS into API Calls 1 1 1 4 4 2 3 2 2 3 3 4 55 5 MapR Data Platform Network Security : MapR RPC Full Wire Encryption Client -> Server Communication Server -> Server Communication Supported Compression algorithms ( per Directory ) LZ4, LZF, ZLIB Network Traffic will be compressed automatically MapR NFS Gateway on Application Servers
  • 27. © 2014 MapR Technologies 27 [WHITEBOARD BREAK]
  • 28. © 2014 MapR Technologies 28© 2014 MapR Technologies [REDACTED]
  • 29. © 2014 MapR Technologies 29 Allows Secondary Analytics to Scale Out Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  • 30. © 2014 MapR Technologies 30 Secondary Analytics: Acute Pain Point FastQ Reads Aligned Reads Variants ADAM + Avocado Matrix rotation is very I/O intense Velvet: Algorithms for de novo short read assembly using de Bruijn graphs, Zerbino & Birney. 2008 Local de novo is best… …only feasible with efficient rotations
  • 31. © 2014 MapR Technologies 31 Apache Parquet
  • 32. © 2014 MapR Technologies 32 Row-Oriented Format read1 chr1 10000 read2 TTGGAG ABCDEF read2 chr1 20000 - TCGTAA ABCDEF read3 chr2 5000 - GGGAAC ABCDEF read4 chr3 1000000 read6 CCCTAC ABCDEF read5 chr4 900000 - TTTAAG ABCDEF 0 5 20 40 57 ID Reference Position Next ID Sequence Quality
  • 33. © 2014 MapR Technologies 33 Row-Oriented Splitting
  • 34. © 2014 MapR Technologies 34 Column-Oriented Format read1 read2 read3 read4 read5 chr1 chr1 chr2 chr3 chr4 10000 20000 5000 1000000 900000 read2 - - read6 - TTGGAG TCGTAA GGGAAC CCCTAC TTTAAG ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ID Reference Position Next ID Sequence Quality
  • 35. © 2014 MapR Technologies 35 Column-Oriented Format Partitioning read1 read2 read3 read4 read5 chr1 chr1 chr2 chr3 chr4 10000 20000 5000 1000000 900000 read2 - - read6 - TTGGAG TCGTAA TTGGAG GGGAAC TTTAAG ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ID Reference Position Next ID Sequence Quality
  • 36. © 2014 MapR Technologies 36 Column-Oriented Format Splitting
  • 37. © 2014 MapR Technologies 37 Apache Parquet
  • 38. © 2014 MapR Technologies 38 Apache Parquet http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
  • 39. © 2014 MapR Technologies 39 Allows Secondary Analytics to Scale Out GATK / HPC method: flat after chromosome split Hadoop / Spark method
  • 40. © 2014 MapR Technologies 40© 2014 MapR Technologies Tertiary Analytics
  • 41. © 2014 MapR Technologies 41 Downstream Analytics: GWAS/PheWAS FastQ Reads Aligned Reads Variants Function Phenotypes Scalable GWAS/PheWA S: “Green Field” Territory ADAM + Avocado
  • 42. © 2014 MapR Technologies 42 Target Application: Alleviate / Prevent Suffering Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  • 43. © 2014 MapR Technologies 43 GWAS Overview (Genome-wide Association Study) • Which genome features are associated with phenotype X? https://en.wikipedia.org/wiki/Genome-wide_association_study
  • 44. © 2014 MapR Technologies 44 PheWAS Overview (Phenome-wide …) • Which phenotypes are associated with genome variant X? http://www.tcpinnovations.com/drugbaron/phewas-the-tool-thats-revolutionizing-drug-development-that-youve-likely-never-heard-of/
  • 45. © 2014 MapR Technologies 45 Genome × Phenome Analysis For given population, given SNP 𝛿, and given phenotype ϕ: Count the number of occurrences as the value of the matrix 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 SPARSE Billion + Phenotypes SPARSEBillion+Genotypes
  • 46. © 2014 MapR Technologies 46 Disease Cause via Genome × Phenome Matrix Factorization • Row Eigenvectors of X represent – Sets of related phenotypes (by SNP) • Column Eigenvectors of Y represent – Sets of related SNPS (by phenotype) 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Principal Column Vector Archetype Genotypes Archetype Phenotypes Principal Row Vector Sparse Matrix Package is Actively Developed in Spark Community
  • 47. © 2014 MapR Technologies 47 Generalized Approach: Genome × Phenome Tensor • Maintain individual identity • Aggregating individuals gives up statistical power • Leverage pedigrees – Individuals are not independent observations Variants Phenotypes Variants Phenotypes
  • 48. © 2014 MapR Technologies 48 Scalable Variant Store => Root out Disease Causes Model P ~ F(G) Fortunately, this has already been done… Genotypes Med Record Phenotypes, e.g. disease risk, drug response
  • 49. © 2014 MapR Technologies 49 Largest Biometric Database in the World PEOPLE 1.2B PEOPLE
  • 50. © 2014 MapR Technologies 50 Why Create Aadhaar? • India: 1.2 billion residents – 640,000 villages, ~60% lives under $2/day – ~75% literacy, <3% pay income tax, <20% have bank accounts – ~800 million mobile, ~200-300 million migrant workers • Govt. spends about $25-40 billion on direct subsidies – Residents have no standard identity document – Most programs plagued with ghost and multiple identities causing leakage of 30-40% Standardize identity => Stop leakage
  • 51. © 2014 MapR Technologies 51 Aadhaar Biometric Capture & Index Raw Digital Fingerprint
  • 52. © 2014 MapR Technologies 52 Aadhaar Biometric ID Creation F(x): unique features G(x): uncommon features H(x): other features • 900MM people loaded in 4 years • In production – 1MM registrations/day – 200+ trillion lookups/day • All built on MapR-DB (HBase) Low Entropy + Unique Low Entropy + Infrequent
  • 53. © 2014 MapR Technologies 53 Consistent, Low Latency --- M7 Read Latency --- Others Read Latency
  • 54. © 2014 MapR Technologies 54 How Does this Relate to Genomics? F-1(x): common features F(x): unique features G(x): uncommon features H(x): other features Same data shape and size • Aadhaar: 1B humans, 5MB minutia • Genome: 7B humans, ~3M variants
  • 55. © 2014 MapR Technologies 55 How Does this Relate to Genomics? F-1(x): common features F(x): unique features G(x): uncommon features H(x): other features Phenotype: healthy or sick? Phenotype Partition => Low Entropy
  • 56. © 2014 MapR Technologies 56 ≈ individuals fingerprint minutiae Find rare minutiae to uniquely identify medicalrecords genetic variants Find shared variants to get disease root cause Takeaway 1: Don’t reinvent the wheel
  • 57. © 2014 MapR Technologies 57 Takeaway 2: Evolution, not Revolution DNA Sequence NASDAQ Composite
  • 58. © 2014 MapR Technologies 58 Thank You @allenday // @mapr Now a few slides about MapR’s product… …and proposed next actions
  • 59. © 2014 MapR Technologies 59 “Quick Start” Package Engagement includes: 1. Identification of data sources, transformations and reporting engines 2. Access and use of the solution template including source code 3. Training on customizing the solution template to the organization’s requirement 4. Deployment architecture document that enables a production deployment plan for the specific solution SOLUTION TEMPLATE KNOWLEDGE TRANSFER DEPLOYMENT ARCHITECTURE
  • 60. © 2014 MapR Technologies 60 “Quick Start” 1 – Resequencing with Hadoop Reduces Storage Hardware Requirements Accelerates Data Processing Time Minimal impact to existing data pipelines “Quick Start” 2 – Variant Analysis with NoSQL Present data for exploration Operationalize complex workflows Web-scale performance
  • 61. © 2014 MapR Technologies 62 Genomics Value Chain Sequence Biosample Secondary Analytics Tertiary Analytics Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK Pharma R&D OK, e.g. ILMN XTen Not OK Missing Clinic Therapy OK, e.g. ILMN XTen Missing Missing
  • 62. © 2014 MapR Technologies 63 Genomics Value Chain Sequence Biosample Secondary Analytics Tertiary Analytics Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK Pharma R&D OK, e.g. ILMN XTen Not OK Missing Clinic Therapy OK, e.g. ILMN XTen Missing Missing Addressed by Quick Start 1 Addressed by Quick Start 2
  • 63. © 2014 MapR Technologies 64© 2014 MapR Technologies BONUS ROUND
  • 64. © 2014 MapR Technologies 65© 2014 MapR Technologies Genealogy Company Slides credit: Bill Yetman, Hadoop Summit 2014 http://slidesha.re/1vRh3kY
  • 65. © 2014 MapR Technologies 66 GERMLINE is… • …an algorithm that finds hidden relationships within a pool of DNA • …the reference implementation of that algorithm written in C++. • You can find it here: http://www1.cs.columbia.edu/~gusev/germline/ 6 6
  • 66. © 2014 MapR Technologies 67 Projected GERMLINE run times (in hours) 6 7 Hours Samples 0 100 200 300 400 500 600 700 2,500 12,500 22,500 32,500 42,500 52,500 62,500 72,500 82,500 92,500 102,500 112,500 122,500 GERMLINE run times Projected GERMLINE run times 700 hours = 29+ days EXPONENTIAL COMPLEXITY
  • 67. © 2014 MapR Technologies 68 GERMLINE: What’s the Problem? • GERMLINE (the implementation) was not meant to be used in an industrial setting – Stateless, single threaded, prone to swapping (heavy memory usage) – GERMLINE performs poorly on large data sets • Our metrics predicted exactly where the process would slow to a crawl • Put simply: GERMLINE couldn't scale 6 8
  • 68. © 2014 MapR Technologies 69 Run times for matching (in hours) 6 9 Hours Samples 0 20 40 60 80 100 120 140 160 180 GERMLINE run times Jermline run times Projected GERMLINE run times EXPONENTIAL LINEAR HBase Refactor
  • 69. © 2014 MapR Technologies 70 • Paper submitted describing the implementation • Releasing as an Open Source project soon • [HBase Schema/Algorithm Slides] 7 0
  • 70. © 2014 MapR Technologies 71© 2014 MapR Technologies Further Growth & Optimization
  • 71. © 2014 MapR Technologies 72 Underdog (Strand Phasing) performance – Went from 12 hours to process 1,000 samples to under 25 minutes with a MapReduce implementation 7 2 With improved accuracy! Underdog replaces Beagle 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 Total Run Size Total Beagle-Underdog Duration
  • 72. © 2014 MapR Technologies 73 Pipeline steps and incremental change… – Incremental change over time – Supporting the business in a “just in time” Agile way 7 3 0 50000 100000 150000 200000 250000 500 3622 7243 9615 12353 16333 19522 22861 26642 31172 35986 40852 45252 49817 54738 61675 69496 77257 84337 90074 97448 104684 111937 119669 127194 134970 142232 149988 157710 165685 173719 181617 189817 197853 205855 213471 221290 228912 236516 243550 251315 259164 267266 275335 283114 291017 298823 306556 314662 322655 330745 338813 346847 354938 362954 371064 379208 387334 395432 Beagle-Underdog Phasing Pipeline Finalize Relationship Processing Germline-Jermline Results Processing Germline-Jermline Processing Beagle Post Phasing Admixture Plink Prep Pipeline Initialization Jermline replaces Germline Ethnicity V2 Release Underdog Replaces Beagle AdMixture on Hadoop
  • 73. © 2014 MapR Technologies 74 …while the business continues to grow rapidly 7 4 - 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14 #ofprocessedsamples) DNA Database Size

Editor's Notes

  1. cinical
  2. 49
  3. Increase GDP by 2%
  4. BOOM LSH
  5. This chart shows that MapR-DB (the database in the MapR Enterprise Database Edition, formerly known as M7) (in blue) consistency reads data quickly with no spikes. Other distributions suffer from periodic “housekeeping” tasks like compactions (defragmentation) and garbage collection, leading to sharp spikes in read delays.
  6. Graph of each step in the pipeline for every run. This graph shows how important it is to measure everything. Some steps have been greatly reduced or eliminated. Light blue is the matching step. You can see it going quadratic and then the change when ‘J’ Jermline was released.