SlideShare a Scribd company logo
1 of 18
2 October 2017
Scalable and reproducible
workflows with Pachyderm
Jon Ander Novella de Miguel
Pharmaceutical Bioinformatics research group
Uppsala, Sweden
2 October 2017
APPROACHESTO TACKLE BIOLOGICAL COMPUTATIONS
Data growth in biomedicine Scalable methods for Big Data Analytics
enabled by Cloud Computing
2 October 2017
• Mass Spectrometry can offer high metabolite coverage
METABOLITE DATA
2 October 2017
• Workflow definitions
• Isolation of scientific software
• Reproducibility
• Parallelisation
CHALLENGES
2 October 2017
• Stitching many different software tools is
tedious
• Time-intensive and parameter heavy steps
involved
• Examples:Taverna, Nextflow, SciPipe
WORKFLOW DEFINITIONS
2 October 2017
• Workflow definitions
• Isolation of scientific software
• Reproducibility
• Parallelisation
CHALLENGES
2 October 2017
• Containers wrap an app with its own
operating environment
• Portability and environmental consistency
• Useful in science
• Is Vagrant already old-fashioned?
ISOLATION OF SCIENTIFIC SOFTWARE
2 October 2017
• Deployment, scaling and management of containers in a
cluster
• Kubernetes: big and active community
• Automatic healing and machine decoupling
[1] https://www.kubernetes.io
[1]
CONTAINER ORCHESTRATION TOOLS
2 October 2017
• Workflow definitions
• Isolation of scientific software
• Reproducibility
• Parallelisation
CHALLENGES
2 October 2017
• Workflow-system based on Kubernetes
• A distributed data processing tool based
on containers
• Enables reproducibility, provenance,
parallelization and isolation
“You can focus on being productive, while
Pachyderm will scale up and analyze for you”
[2] https://www.pachyderm.io
[2]
WHAT IS PACHYDERM?
2 October 2017
The main primitives are:
• Repositories: versioned collections of data
• Commits: new data
• Files: data storage primitives
[3] https://www.pachyderm.io/pfs.html
[3]
PFS offers version control for data:
PACHYDERM FILE SYSTEM (PFS)
2 October 2017
• Tasks executed by Kubernetes pods
• Parallelization: spreading data
• Incrementality and glob patterns
• Directed Acyclic Graph
[4] https://www.pachyderm.io/pps.html
[4]
PACHYDERM PIPELINE SYSTEM (PPS)
2 October 2017
• Reproducing a metabolomics workflow with Pachyderm
• Learn how to distribute processing using containers
• Feeling the power of data versioning
• Learn how we can use containers in a cloud-like distributed processing
environment
GOALS OF THE DAY
2 October 2017
• OpenMS: software for metabolite and proteome data
analysis and management
• Detection of mass traces and their aggregation into
features
• Four pre-processing steps
AN OPENMS BASED WORKFLOW
X
CSV
File Filter
Feature Finder
Feature Linker
Text Exporter
2 October 2017
• Kubernetes cluster backed by a Vagrant box (VM)
• https://github.com/CARAMBA-Clinic/COST-
CHARME/blob/master/README.md
• Execution of workflow-engine in Cloud-Like environment via
Jupyter
• Downstream analysis on RStudio
METHODS
2 October 2017
• Four interconnected tasks/processes
• Intermediate data handled by repositories
• Results stored also in a repository
WORKFLOW IN PACHYDERM
2 October 2017
• Thanks to Pachyderm, we can enable a reproducible and scalable data processing
platform
• Can you write your own container and distribute its computation?
REPRODUCIBLE RESULT
2 October 2017
THANKS! ANY QUESTIONS?
“Provenance and reproducibility enable a rigorous and
efficient data science”
Jon Ander Novella de Miguel
Department of Pharmaceutical Biosciences
Jon.Novella@farmbio.uu.se

More Related Content

What's hot

Monitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadogMonitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadogSeth Rosenblum
 
Story of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streamingStory of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streaminglohitvijayarenu
 
Twitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud StorageTwitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud Storagelohitvijayarenu
 
Analytics over Terabytes of Data at Twitter
Analytics over Terabytes of Data at TwitterAnalytics over Terabytes of Data at Twitter
Analytics over Terabytes of Data at TwitterImply
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs
 
Exploring Alluxio for Daily Tasks at Robinhood
Exploring Alluxio for Daily Tasks at RobinhoodExploring Alluxio for Daily Tasks at Robinhood
Exploring Alluxio for Daily Tasks at RobinhoodAlluxio, Inc.
 
Strip your TEXT fields - Exeter Web Feb/2016
Strip your TEXT fields - Exeter Web Feb/2016Strip your TEXT fields - Exeter Web Feb/2016
Strip your TEXT fields - Exeter Web Feb/2016Gabriela Ferrara
 
What does Netflix, NTT and Rubicon Project have in common? Apache Druid.
What does Netflix, NTT and Rubicon Project have in common? Apache Druid.What does Netflix, NTT and Rubicon Project have in common? Apache Druid.
What does Netflix, NTT and Rubicon Project have in common? Apache Druid.Rommel Garcia
 
How @twitterhadoop chose google cloud
How @twitterhadoop chose google cloudHow @twitterhadoop chose google cloud
How @twitterhadoop chose google cloudlohitvijayarenu
 
Meetup Kubernetes Rhein-Necker
Meetup Kubernetes Rhein-NeckerMeetup Kubernetes Rhein-Necker
Meetup Kubernetes Rhein-Neckerinovex GmbH
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB
 
Mongo presentation conf
Mongo presentation confMongo presentation conf
Mongo presentation confShridhar Joshi
 
Realtime Analytics with Druid
Realtime Analytics with DruidRealtime Analytics with Druid
Realtime Analytics with DruidSeungWoo Han
 
Why Your MongoDB Needs Redis
Why Your MongoDB Needs RedisWhy Your MongoDB Needs Redis
Why Your MongoDB Needs RedisItamar Haber
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioAlluxio, Inc.
 
What’s new in Alluxio 2: from seamless operations to structured data management
What’s new in Alluxio 2: from seamless operations to structured data managementWhat’s new in Alluxio 2: from seamless operations to structured data management
What’s new in Alluxio 2: from seamless operations to structured data managementAlluxio, Inc.
 

What's hot (20)

Monitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadogMonitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadog
 
Story of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streamingStory of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streaming
 
Twitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud StorageTwitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud Storage
 
RubiX
RubiXRubiX
RubiX
 
Analytics over Terabytes of Data at Twitter
Analytics over Terabytes of Data at TwitterAnalytics over Terabytes of Data at Twitter
Analytics over Terabytes of Data at Twitter
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
Exploring Alluxio for Daily Tasks at Robinhood
Exploring Alluxio for Daily Tasks at RobinhoodExploring Alluxio for Daily Tasks at Robinhood
Exploring Alluxio for Daily Tasks at Robinhood
 
Strip your TEXT fields - Exeter Web Feb/2016
Strip your TEXT fields - Exeter Web Feb/2016Strip your TEXT fields - Exeter Web Feb/2016
Strip your TEXT fields - Exeter Web Feb/2016
 
What does Netflix, NTT and Rubicon Project have in common? Apache Druid.
What does Netflix, NTT and Rubicon Project have in common? Apache Druid.What does Netflix, NTT and Rubicon Project have in common? Apache Druid.
What does Netflix, NTT and Rubicon Project have in common? Apache Druid.
 
How @twitterhadoop chose google cloud
How @twitterhadoop chose google cloudHow @twitterhadoop chose google cloud
How @twitterhadoop chose google cloud
 
Meetup Kubernetes Rhein-Necker
Meetup Kubernetes Rhein-NeckerMeetup Kubernetes Rhein-Necker
Meetup Kubernetes Rhein-Necker
 
Strip your TEXT fields
Strip your TEXT fieldsStrip your TEXT fields
Strip your TEXT fields
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
 
Mongo presentation conf
Mongo presentation confMongo presentation conf
Mongo presentation conf
 
Advanced git
Advanced gitAdvanced git
Advanced git
 
Realtime Analytics with Druid
Realtime Analytics with DruidRealtime Analytics with Druid
Realtime Analytics with Druid
 
Why Your MongoDB Needs Redis
Why Your MongoDB Needs RedisWhy Your MongoDB Needs Redis
Why Your MongoDB Needs Redis
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
What’s new in Alluxio 2: from seamless operations to structured data management
What’s new in Alluxio 2: from seamless operations to structured data managementWhat’s new in Alluxio 2: from seamless operations to structured data management
What’s new in Alluxio 2: from seamless operations to structured data management
 

Similar to Scalable and reproducible workflows with Pachyderm

The New CyREST: Economical Delivery of Complex, Reproducible Network Biology ...
The New CyREST: Economical Delivery of Complex, Reproducible Network Biology ...The New CyREST: Economical Delivery of Complex, Reproducible Network Biology ...
The New CyREST: Economical Delivery of Complex, Reproducible Network Biology ...bdemchak
 
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...Juan Antonio Vizcaino
 
Open Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials researchOpen Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials researchValery Tkachenko
 
Cluster Management _ kubernetes MADIHA HARIFI
Cluster Management _ kubernetes MADIHA HARIFICluster Management _ kubernetes MADIHA HARIFI
Cluster Management _ kubernetes MADIHA HARIFIHarifi Madiha
 
Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...Jenny Mitcham
 
Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2Vivien Bonazzi
 
The ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateThe ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateJuan Antonio Vizcaino
 
Whats happening with ReDBoX - Gavin Kennedy
Whats happening with ReDBoX - Gavin KennedyWhats happening with ReDBoX - Gavin Kennedy
Whats happening with ReDBoX - Gavin KennedyARDC
 
HNSciCloud update @ the World LHC Computing Grid deployment board
HNSciCloud update @ the World LHC Computing Grid deployment board  HNSciCloud update @ the World LHC Computing Grid deployment board
HNSciCloud update @ the World LHC Computing Grid deployment board Helix Nebula The Science Cloud
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...Ilkay Altintas, Ph.D.
 
The Future of Semantics on the Web
The Future of Semantics on the WebThe Future of Semantics on the Web
The Future of Semantics on the WebJohn Domingue
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
Packaging computational biology tools for broad distribution and ease-of-reuse
Packaging computational biology tools for broad distribution and ease-of-reusePackaging computational biology tools for broad distribution and ease-of-reuse
Packaging computational biology tools for broad distribution and ease-of-reuseMatthew Vaughn
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterIOSR Journals
 
PRISM Project Update
PRISM Project UpdatePRISM Project Update
PRISM Project Updateimgcommcall
 
Research data spring: a consortial approach to RDM within SaS
Research data spring: a consortial approach to RDM within SaSResearch data spring: a consortial approach to RDM within SaS
Research data spring: a consortial approach to RDM within SaSJisc RDM
 
Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;
Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;
Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;Larry Smarr
 
HPC I/O for Computational Scientists
HPC I/O for Computational ScientistsHPC I/O for Computational Scientists
HPC I/O for Computational Scientistsinside-BigData.com
 

Similar to Scalable and reproducible workflows with Pachyderm (20)

The New CyREST: Economical Delivery of Complex, Reproducible Network Biology ...
The New CyREST: Economical Delivery of Complex, Reproducible Network Biology ...The New CyREST: Economical Delivery of Complex, Reproducible Network Biology ...
The New CyREST: Economical Delivery of Complex, Reproducible Network Biology ...
 
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
 
Open Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials researchOpen Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials research
 
Cluster Management _ kubernetes MADIHA HARIFI
Cluster Management _ kubernetes MADIHA HARIFICluster Management _ kubernetes MADIHA HARIFI
Cluster Management _ kubernetes MADIHA HARIFI
 
Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...
 
Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2
 
The ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateThe ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 update
 
Whats happening with ReDBoX - Gavin Kennedy
Whats happening with ReDBoX - Gavin KennedyWhats happening with ReDBoX - Gavin Kennedy
Whats happening with ReDBoX - Gavin Kennedy
 
HNSciCloud update @ the World LHC Computing Grid deployment board
HNSciCloud update @ the World LHC Computing Grid deployment board  HNSciCloud update @ the World LHC Computing Grid deployment board
HNSciCloud update @ the World LHC Computing Grid deployment board
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
 
The Future of Semantics on the Web
The Future of Semantics on the WebThe Future of Semantics on the Web
The Future of Semantics on the Web
 
Scholze liber 2015-06-25_final
Scholze liber 2015-06-25_finalScholze liber 2015-06-25_final
Scholze liber 2015-06-25_final
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
Packaging computational biology tools for broad distribution and ease-of-reuse
Packaging computational biology tools for broad distribution and ease-of-reusePackaging computational biology tools for broad distribution and ease-of-reuse
Packaging computational biology tools for broad distribution and ease-of-reuse
 
Archivematica Community Update - SAA 2016
Archivematica Community Update - SAA 2016Archivematica Community Update - SAA 2016
Archivematica Community Update - SAA 2016
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
 
PRISM Project Update
PRISM Project UpdatePRISM Project Update
PRISM Project Update
 
Research data spring: a consortial approach to RDM within SaS
Research data spring: a consortial approach to RDM within SaSResearch data spring: a consortial approach to RDM within SaS
Research data spring: a consortial approach to RDM within SaS
 
Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;
Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;
Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;
 
HPC I/O for Computational Scientists
HPC I/O for Computational ScientistsHPC I/O for Computational Scientists
HPC I/O for Computational Scientists
 

Recently uploaded

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburgmasabamasaba
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsBert Jan Schrijver
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 

Recently uploaded (20)

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 

Scalable and reproducible workflows with Pachyderm

  • 1. 2 October 2017 Scalable and reproducible workflows with Pachyderm Jon Ander Novella de Miguel Pharmaceutical Bioinformatics research group Uppsala, Sweden
  • 2. 2 October 2017 APPROACHESTO TACKLE BIOLOGICAL COMPUTATIONS Data growth in biomedicine Scalable methods for Big Data Analytics enabled by Cloud Computing
  • 3. 2 October 2017 • Mass Spectrometry can offer high metabolite coverage METABOLITE DATA
  • 4. 2 October 2017 • Workflow definitions • Isolation of scientific software • Reproducibility • Parallelisation CHALLENGES
  • 5. 2 October 2017 • Stitching many different software tools is tedious • Time-intensive and parameter heavy steps involved • Examples:Taverna, Nextflow, SciPipe WORKFLOW DEFINITIONS
  • 6. 2 October 2017 • Workflow definitions • Isolation of scientific software • Reproducibility • Parallelisation CHALLENGES
  • 7. 2 October 2017 • Containers wrap an app with its own operating environment • Portability and environmental consistency • Useful in science • Is Vagrant already old-fashioned? ISOLATION OF SCIENTIFIC SOFTWARE
  • 8. 2 October 2017 • Deployment, scaling and management of containers in a cluster • Kubernetes: big and active community • Automatic healing and machine decoupling [1] https://www.kubernetes.io [1] CONTAINER ORCHESTRATION TOOLS
  • 9. 2 October 2017 • Workflow definitions • Isolation of scientific software • Reproducibility • Parallelisation CHALLENGES
  • 10. 2 October 2017 • Workflow-system based on Kubernetes • A distributed data processing tool based on containers • Enables reproducibility, provenance, parallelization and isolation “You can focus on being productive, while Pachyderm will scale up and analyze for you” [2] https://www.pachyderm.io [2] WHAT IS PACHYDERM?
  • 11. 2 October 2017 The main primitives are: • Repositories: versioned collections of data • Commits: new data • Files: data storage primitives [3] https://www.pachyderm.io/pfs.html [3] PFS offers version control for data: PACHYDERM FILE SYSTEM (PFS)
  • 12. 2 October 2017 • Tasks executed by Kubernetes pods • Parallelization: spreading data • Incrementality and glob patterns • Directed Acyclic Graph [4] https://www.pachyderm.io/pps.html [4] PACHYDERM PIPELINE SYSTEM (PPS)
  • 13. 2 October 2017 • Reproducing a metabolomics workflow with Pachyderm • Learn how to distribute processing using containers • Feeling the power of data versioning • Learn how we can use containers in a cloud-like distributed processing environment GOALS OF THE DAY
  • 14. 2 October 2017 • OpenMS: software for metabolite and proteome data analysis and management • Detection of mass traces and their aggregation into features • Four pre-processing steps AN OPENMS BASED WORKFLOW X CSV File Filter Feature Finder Feature Linker Text Exporter
  • 15. 2 October 2017 • Kubernetes cluster backed by a Vagrant box (VM) • https://github.com/CARAMBA-Clinic/COST- CHARME/blob/master/README.md • Execution of workflow-engine in Cloud-Like environment via Jupyter • Downstream analysis on RStudio METHODS
  • 16. 2 October 2017 • Four interconnected tasks/processes • Intermediate data handled by repositories • Results stored also in a repository WORKFLOW IN PACHYDERM
  • 17. 2 October 2017 • Thanks to Pachyderm, we can enable a reproducible and scalable data processing platform • Can you write your own container and distribute its computation? REPRODUCIBLE RESULT
  • 18. 2 October 2017 THANKS! ANY QUESTIONS? “Provenance and reproducibility enable a rigorous and efficient data science” Jon Ander Novella de Miguel Department of Pharmaceutical Biosciences Jon.Novella@farmbio.uu.se