SlideShare a Scribd company logo
1 of 16
© 2015 MapR Technologies 1© 2015 MapR Technologies
© 2015 MapR Technologies 2
Let’s kick of a build while we talk…
• Follow the instructions at
• https://github.com/allenday/spark-genome-alignment-demo
• While that’s cooking…
• What’s this all about?
© 2015 MapR Technologies 3
Alignment
Reference
Sequences
Aligned
Reads Downstream
Applications…
DNA Reads
© 2015 MapR Technologies 4
Alignment
Reference
Sequences
DNA Reads
Aligned
Reads Downstream
Applications…
Align()
© 2015 MapR Technologies 5
Possible Align() Outcomes
Unaligned
DNA Reads
Reference
Sequences
Single
Location
Reads
Multiple
Location
Reads
Unlocatable
Reads
Align()
© 2015 MapR Technologies 6
Many-to-Many Relationship Between Reads and Locations
• Read1
• Read2
• Read3
• Read4
• NULL
• LocationA
• LocationB
• LocationC
• LocationD
• LocationA
• NULL
• LocationE
© 2015 MapR Technologies 7
Parallelizing Alignment
Unaligned
DNA Reads
Locations
Locations
Locations
Part1Part2Part3
Aligned
DNA Reads
Align() Concat() Sort() Etc…Split()
© 2015 MapR Technologies 8
Using HPC+SAN has Bottlenecks (SGE, PBS, Condor, Etc)
Part1Part2Part3
Volume Read
Bottleneck
Volume Write
Bottleneck
Read & Write
Bottleneck
© 2015 MapR Technologies 9
Using Spark Eliminates Bottlenecks
Align() Concat() Sort()Split()
© 2015 MapR Technologies 10
How to Do it? FastQ => BAM
Prepare Environment
• Get Input Reference Sequences in FastA format (one time step)
• Index Reference Sequences with Bowtie (one time step)
Get Input Data
• Get Input Unaligned Reads in FastQ format (per input data set)
Do the Work
• Convert Reads to ADAM Format with Spark (per input data set)
• Use Spark Pipe to Align the Reads (per input data fragment)
• Write the Alignments to SAM or BAM format (per input data,
optional)
© 2015 MapR Technologies 11
Let’s look at some code…
• Let’s check on the build we kicked off earlier.
• And the driver script:
• cat $DEMO/bin/bowtie_pipe_single.scala
• I said before that it’s possible to be 100% compatible with legacy
tools by writing out to SAM or BAM.
• You might not want to do this…
© 2015 MapR Technologies 12
Benefits of Keeping Data in Spark / ADAM
• ADAM file sizes are about 20% smaller than BAM.
• ADAM files are faster for common operations (filters, range
queries)
• ADAM files sort more quickly than `samtools sort …`. This can
be a huge time savings
© 2015 MapR Technologies 13
© 2015 MapR Technologies 14
• Some other commands built-in, take a look:
• $ADAM_HOME/bin/adam-submit
• $ADAM_HOME/bin/adam-submit count_kmers
$DEMO/build/data/reads.sam /tmp/count_kmers 3
• head /tmp/count_kmers/part-00000
• All of this is easier without leaving the Spark shell
• And it’s faster, too…
© 2015 MapR Technologies 15
And it’s faster, too… Using Spark Eliminates Bottlenecks
Align() Concat() Sort()Split()
© 2015 MapR Technologies 16
Thanks! Questions?
@allenday, @mapr
aday@mapr.com
linkedin.com/in/allenday

More Related Content

What's hot

SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
Databricks
 

What's hot (20)

May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
#GeodeSummit: Easy Ways to Become a Contributor to Apache Geode
#GeodeSummit: Easy Ways to Become a Contributor to Apache Geode#GeodeSummit: Easy Ways to Become a Contributor to Apache Geode
#GeodeSummit: Easy Ways to Become a Contributor to Apache Geode
 
Incredible ODI tips to work with Hyperion tools that you ever wanted to know
Incredible ODI tips to work with Hyperion tools that you ever wanted to knowIncredible ODI tips to work with Hyperion tools that you ever wanted to know
Incredible ODI tips to work with Hyperion tools that you ever wanted to know
 
Breaking data
Breaking dataBreaking data
Breaking data
 
Data science lifecycle with Apache Zeppelin
Data science lifecycle with Apache ZeppelinData science lifecycle with Apache Zeppelin
Data science lifecycle with Apache Zeppelin
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Dagster @ R&S MNT
Dagster @ R&S MNTDagster @ R&S MNT
Dagster @ R&S MNT
 
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis GkoufasSpark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis Gkoufas
 
Wrapper formulas saltconf 2016
Wrapper formulas saltconf 2016Wrapper formulas saltconf 2016
Wrapper formulas saltconf 2016
 
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
 
Components of openEHR based EHRs
Components of openEHR based EHRsComponents of openEHR based EHRs
Components of openEHR based EHRs
 
Interactive Analytics using Apache Spark
Interactive Analytics using Apache SparkInteractive Analytics using Apache Spark
Interactive Analytics using Apache Spark
 
Apache edgent
Apache edgentApache edgent
Apache edgent
 
SparkR + Zeppelin
SparkR + ZeppelinSparkR + Zeppelin
SparkR + Zeppelin
 
When the Cloud is a Rockin: High Availability in Apache CloudStack
When the Cloud is a Rockin: High Availability in Apache CloudStackWhen the Cloud is a Rockin: High Availability in Apache CloudStack
When the Cloud is a Rockin: High Availability in Apache CloudStack
 
AtlasCamp 2015: Get your add-on in shape for Data Center
AtlasCamp 2015: Get your add-on in shape for Data CenterAtlasCamp 2015: Get your add-on in shape for Data Center
AtlasCamp 2015: Get your add-on in shape for Data Center
 
Getting Ready to Use Redis with Apache Spark with Tague Griffith
Getting Ready to Use Redis with Apache Spark with Tague GriffithGetting Ready to Use Redis with Apache Spark with Tague Griffith
Getting Ready to Use Redis with Apache Spark with Tague Griffith
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL Engine
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
 

Viewers also liked

ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
fnothaft
 

Viewers also liked (7)

Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
 
Strata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMStrata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAM
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 

Similar to Free Code Friday: Genome Resequencing with Spark, Part 1

Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
Daniela Zuppini
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 

Similar to Free Code Friday: Genome Resequencing with Spark, Part 1 (20)

Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Take a peek at Dell's smart EPM global environment
Take a peek at Dell's smart EPM global environmentTake a peek at Dell's smart EPM global environment
Take a peek at Dell's smart EPM global environment
 
Rust is for "Big Data"
Rust is for "Big Data"Rust is for "Big Data"
Rust is for "Big Data"
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
 
20160524 ibm fast data meetup
20160524 ibm fast data meetup20160524 ibm fast data meetup
20160524 ibm fast data meetup
 
Introduction to MySQL Cluster
Introduction to MySQL ClusterIntroduction to MySQL Cluster
Introduction to MySQL Cluster
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
MySQL Cluster
MySQL ClusterMySQL Cluster
MySQL Cluster
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?
 
Monitoring Attack Surface to Secure DevOps Pipelines
Monitoring Attack Surface to Secure DevOps PipelinesMonitoring Attack Surface to Secure DevOps Pipelines
Monitoring Attack Surface to Secure DevOps Pipelines
 
12 Factor Scala
12 Factor Scala12 Factor Scala
12 Factor Scala
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 

More from MapR Technologies

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Free Code Friday: Genome Resequencing with Spark, Part 1

  • 1. © 2015 MapR Technologies 1© 2015 MapR Technologies
  • 2. © 2015 MapR Technologies 2 Let’s kick of a build while we talk… • Follow the instructions at • https://github.com/allenday/spark-genome-alignment-demo • While that’s cooking… • What’s this all about?
  • 3. © 2015 MapR Technologies 3 Alignment Reference Sequences Aligned Reads Downstream Applications… DNA Reads
  • 4. © 2015 MapR Technologies 4 Alignment Reference Sequences DNA Reads Aligned Reads Downstream Applications… Align()
  • 5. © 2015 MapR Technologies 5 Possible Align() Outcomes Unaligned DNA Reads Reference Sequences Single Location Reads Multiple Location Reads Unlocatable Reads Align()
  • 6. © 2015 MapR Technologies 6 Many-to-Many Relationship Between Reads and Locations • Read1 • Read2 • Read3 • Read4 • NULL • LocationA • LocationB • LocationC • LocationD • LocationA • NULL • LocationE
  • 7. © 2015 MapR Technologies 7 Parallelizing Alignment Unaligned DNA Reads Locations Locations Locations Part1Part2Part3 Aligned DNA Reads Align() Concat() Sort() Etc…Split()
  • 8. © 2015 MapR Technologies 8 Using HPC+SAN has Bottlenecks (SGE, PBS, Condor, Etc) Part1Part2Part3 Volume Read Bottleneck Volume Write Bottleneck Read & Write Bottleneck
  • 9. © 2015 MapR Technologies 9 Using Spark Eliminates Bottlenecks Align() Concat() Sort()Split()
  • 10. © 2015 MapR Technologies 10 How to Do it? FastQ => BAM Prepare Environment • Get Input Reference Sequences in FastA format (one time step) • Index Reference Sequences with Bowtie (one time step) Get Input Data • Get Input Unaligned Reads in FastQ format (per input data set) Do the Work • Convert Reads to ADAM Format with Spark (per input data set) • Use Spark Pipe to Align the Reads (per input data fragment) • Write the Alignments to SAM or BAM format (per input data, optional)
  • 11. © 2015 MapR Technologies 11 Let’s look at some code… • Let’s check on the build we kicked off earlier. • And the driver script: • cat $DEMO/bin/bowtie_pipe_single.scala • I said before that it’s possible to be 100% compatible with legacy tools by writing out to SAM or BAM. • You might not want to do this…
  • 12. © 2015 MapR Technologies 12 Benefits of Keeping Data in Spark / ADAM • ADAM file sizes are about 20% smaller than BAM. • ADAM files are faster for common operations (filters, range queries) • ADAM files sort more quickly than `samtools sort …`. This can be a huge time savings
  • 13. © 2015 MapR Technologies 13
  • 14. © 2015 MapR Technologies 14 • Some other commands built-in, take a look: • $ADAM_HOME/bin/adam-submit • $ADAM_HOME/bin/adam-submit count_kmers $DEMO/build/data/reads.sam /tmp/count_kmers 3 • head /tmp/count_kmers/part-00000 • All of this is easier without leaving the Spark shell • And it’s faster, too…
  • 15. © 2015 MapR Technologies 15 And it’s faster, too… Using Spark Eliminates Bottlenecks Align() Concat() Sort()Split()
  • 16. © 2015 MapR Technologies 16 Thanks! Questions? @allenday, @mapr aday@mapr.com linkedin.com/in/allenday