Scalable and reproducible workflows with Pachyderm

•

3 likes•539 views

This presentation contains an introduction to using Pachyderm as a tool to enable scalable and reproducible workflows in the life sciences. Pachyderm is an open-source workflow-engine and distributed data processing tool that leverages the container ecosystem.

Software

2 October 2017
Scalable and reproducible
workflows with Pachyderm
Jon Ander Novella de Miguel
Pharmaceutical Bioinformatics research group
Uppsala, Sweden

2 October 2017
APPROACHESTO TACKLE BIOLOGICAL COMPUTATIONS
Data growth in biomedicine Scalable methods for Big Data Analytics
enabled by Cloud Computing

2 October 2017
• Mass Spectrometry can offer high metabolite coverage
METABOLITE DATA

2 October 2017
• Workflow definitions
• Isolation of scientific software
• Reproducibility
• Parallelisation
CHALLENGES

2 October 2017
• Stitching many different software tools is
tedious
• Time-intensive and parameter heavy steps
involved
• Examples:Taverna, Nextflow, SciPipe
WORKFLOW DEFINITIONS

2 October 2017
• Containers wrap an app with its own
operating environment
• Portability and environmental consistency
• Useful in science
• Is Vagrant already old-fashioned?
ISOLATION OF SCIENTIFIC SOFTWARE

2 October 2017
• Deployment, scaling and management of containers in a
cluster
• Kubernetes: big and active community
• Automatic healing and machine decoupling
[1] https://www.kubernetes.io
[1]
CONTAINER ORCHESTRATION TOOLS

2 October 2017
• Workflow-system based on Kubernetes
• A distributed data processing tool based
on containers
• Enables reproducibility, provenance,
parallelization and isolation
“You can focus on being productive, while
Pachyderm will scale up and analyze for you”
[2] https://www.pachyderm.io
[2]
WHAT IS PACHYDERM?

2 October 2017
The main primitives are:
• Repositories: versioned collections of data
• Commits: new data
• Files: data storage primitives
[3] https://www.pachyderm.io/pfs.html
[3]
PFS offers version control for data:
PACHYDERM FILE SYSTEM (PFS)

2 October 2017
• Tasks executed by Kubernetes pods
• Parallelization: spreading data
• Incrementality and glob patterns
• Directed Acyclic Graph
[4] https://www.pachyderm.io/pps.html
[4]
PACHYDERM PIPELINE SYSTEM (PPS)

2 October 2017
• Reproducing a metabolomics workflow with Pachyderm
• Learn how to distribute processing using containers
• Feeling the power of data versioning
• Learn how we can use containers in a cloud-like distributed processing
environment
GOALS OF THE DAY

2 October 2017
• OpenMS: software for metabolite and proteome data
analysis and management
• Detection of mass traces and their aggregation into
features
• Four pre-processing steps
AN OPENMS BASED WORKFLOW
X
CSV
File Filter
Feature Finder
Feature Linker
Text Exporter

2 October 2017
• Kubernetes cluster backed by a Vagrant box (VM)
• https://github.com/CARAMBA-Clinic/COST-
CHARME/blob/master/README.md
• Execution of workflow-engine in Cloud-Like environment via
Jupyter
• Downstream analysis on RStudio
METHODS

2 October 2017
• Four interconnected tasks/processes
• Intermediate data handled by repositories
• Results stored also in a repository
WORKFLOW IN PACHYDERM

2 October 2017
• Thanks to Pachyderm, we can enable a reproducible and scalable data processing
platform
• Can you write your own container and distribute its computation?
REPRODUCIBLE RESULT

2 October 2017
THANKS! ANY QUESTIONS?
“Provenance and reproducibility enable a rigorous and
efficient data science”
Jon Ander Novella de Miguel
Department of Pharmaceutical Biosciences
Jon.Novella@farmbio.uu.se

What's hot

Monitoring and scaling postgres at datadogSeth Rosenblum

Story of migrating event pipeline from batch to streaminglohitvijayarenu

Twitter's Data Replicator for Google Cloud Storagelohitvijayarenu

RubiXShubham Tagra

Analytics over Terabytes of Data at TwitterImply

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs

Exploring Alluxio for Daily Tasks at RobinhoodAlluxio, Inc.

Strip your TEXT fields - Exeter Web Feb/2016Gabriela Ferrara

What does Netflix, NTT and Rubicon Project have in common? Apache Druid.Rommel Garcia

How @twitterhadoop chose google cloudlohitvijayarenu

Meetup Kubernetes Rhein-Neckerinovex GmbH

Strip your TEXT fieldsGabriela Ferrara

MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB

Mongo presentation confShridhar Joshi

Advanced gitsatya sudheer

Realtime Analytics with DruidSeungWoo Han

Why Your MongoDB Needs RedisItamar Haber

Enabling Presto Caching at Uber with AlluxioAlluxio, Inc.

What’s new in Alluxio 2: from seamless operations to structured data managementAlluxio, Inc.

What's hot (20)

Monitoring and scaling postgres at datadog

Story of migrating event pipeline from batch to streaming

Twitter's Data Replicator for Google Cloud Storage

RubiX

Analytics over Terabytes of Data at Twitter

Apache Iceberg - A Table Format for Hige Analytic Datasets

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Exploring Alluxio for Daily Tasks at Robinhood

Strip your TEXT fields - Exeter Web Feb/2016

What does Netflix, NTT and Rubicon Project have in common? Apache Druid.

How @twitterhadoop chose google cloud

Meetup Kubernetes Rhein-Necker

Strip your TEXT fields

MongoDB for Spatio-Behavioral Data Analysis and Visualization

Mongo presentation conf

Advanced git

Realtime Analytics with Druid

Why Your MongoDB Needs Redis

Enabling Presto Caching at Uber with Alluxio

What’s new in Alluxio 2: from seamless operations to structured data management

Similar to Scalable and reproducible workflows with Pachyderm

The New CyREST: Economical Delivery of Complex, Reproducible Network Biology ...bdemchak

The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...Juan Antonio Vizcaino

Open Science Data Repository - the platform for materials researchValery Tkachenko

Cluster Management _ kubernetes MADIHA HARIFIHarifi Madiha

Project update: A collaborative approach to "filling the digital preservation...Jenny Mitcham

Bonazzi commons bd2 k ahm 2016 v2Vivien Bonazzi

The ProteomeXchange Consoritum: 2017 updateJuan Antonio Vizcaino

Whats happening with ReDBoX - Gavin KennedyARDC

HNSciCloud update @ the World LHC Computing Grid deployment board Helix Nebula The Science Cloud

A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...Ilkay Altintas, Ph.D.

The Future of Semantics on the WebJohn Domingue

Scholze liber 2015-06-25_finalKarlsruhe Institute of Technology (KIT)

Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j

Packaging computational biology tools for broad distribution and ease-of-reuseMatthew Vaughn

Archivematica Community Update - SAA 2016Artefactual Systems - Archivematica

Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterIOSR Journals

PRISM Project Updateimgcommcall

Research data spring: a consortial approach to RDM within SaSJisc RDM

Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;Larry Smarr

HPC I/O for Computational Scientistsinside-BigData.com

Similar to Scalable and reproducible workflows with Pachyderm (20)

The New CyREST: Economical Delivery of Complex, Reproducible Network Biology ...

The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...

Open Science Data Repository - the platform for materials research

Cluster Management _ kubernetes MADIHA HARIFI

Project update: A collaborative approach to "filling the digital preservation...

Bonazzi commons bd2 k ahm 2016 v2

The ProteomeXchange Consoritum: 2017 update

Whats happening with ReDBoX - Gavin Kennedy

HNSciCloud update @ the World LHC Computing Grid deployment board

A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...

The Future of Semantics on the Web

Scholze liber 2015-06-25_final

Novo Nordisk's journey in developing an open-source application on Neo4j

Packaging computational biology tools for broad distribution and ease-of-reuse

Archivematica Community Update - SAA 2016

Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster

PRISM Project Update

Research data spring: a consortial approach to RDM within SaS

Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;

HPC I/O for Computational Scientists

Recently uploaded

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba

%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan

%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health

Direct Style Effect Systems -The Print[A] Example- A Comprehension AidPhilip Schwarz

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba

Exploring the Best Video Editing App.pdfproinshot.com

%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburgmasabamasaba

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

The title is not connected to what is insideshinachiaurasa2

Define the academic and professional writing..pdfPearlKirahMaeRagusta1

%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba

Generic or specific? Making sensible software design decisionsBert Jan Schrijver

Software Quality Assurance Interview QuestionsArshad QA

Recently uploaded (20)

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...

%in kempton park+277-882-255-28 abortion pills for sale in kempton park

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...

%in Harare+277-882-255-28 abortion pills for sale in Harare

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain

Exploring the Best Video Editing App.pdf

%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg

Unlocking the Future of AI Agents with Large Language Models

The title is not connected to what is inside

Define the academic and professional writing..pdf

%in Midrand+277-882-255-28 abortion pills for sale in midrand

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein

Generic or specific? Making sensible software design decisions

Software Quality Assurance Interview Questions

Scalable and reproducible workflows with Pachyderm

1. 2 October 2017 Scalable and reproducible workflows with Pachyderm Jon Ander Novella de Miguel Pharmaceutical Bioinformatics research group Uppsala, Sweden

2. 2 October 2017 APPROACHESTO TACKLE BIOLOGICAL COMPUTATIONS Data growth in biomedicine Scalable methods for Big Data Analytics enabled by Cloud Computing

3. 2 October 2017 • Mass Spectrometry can offer high metabolite coverage METABOLITE DATA

4. 2 October 2017 • Workflow definitions • Isolation of scientific software • Reproducibility • Parallelisation CHALLENGES

5. 2 October 2017 • Stitching many different software tools is tedious • Time-intensive and parameter heavy steps involved • Examples:Taverna, Nextflow, SciPipe WORKFLOW DEFINITIONS

6. 2 October 2017 • Workflow definitions • Isolation of scientific software • Reproducibility • Parallelisation CHALLENGES

7. 2 October 2017 • Containers wrap an app with its own operating environment • Portability and environmental consistency • Useful in science • Is Vagrant already old-fashioned? ISOLATION OF SCIENTIFIC SOFTWARE

8. 2 October 2017 • Deployment, scaling and management of containers in a cluster • Kubernetes: big and active community • Automatic healing and machine decoupling [1] https://www.kubernetes.io [1] CONTAINER ORCHESTRATION TOOLS

9. 2 October 2017 • Workflow definitions • Isolation of scientific software • Reproducibility • Parallelisation CHALLENGES

10. 2 October 2017 • Workflow-system based on Kubernetes • A distributed data processing tool based on containers • Enables reproducibility, provenance, parallelization and isolation “You can focus on being productive, while Pachyderm will scale up and analyze for you” [2] https://www.pachyderm.io [2] WHAT IS PACHYDERM?

11. 2 October 2017 The main primitives are: • Repositories: versioned collections of data • Commits: new data • Files: data storage primitives [3] https://www.pachyderm.io/pfs.html [3] PFS offers version control for data: PACHYDERM FILE SYSTEM (PFS)

12. 2 October 2017 • Tasks executed by Kubernetes pods • Parallelization: spreading data • Incrementality and glob patterns • Directed Acyclic Graph [4] https://www.pachyderm.io/pps.html [4] PACHYDERM PIPELINE SYSTEM (PPS)

13. 2 October 2017 • Reproducing a metabolomics workflow with Pachyderm • Learn how to distribute processing using containers • Feeling the power of data versioning • Learn how we can use containers in a cloud-like distributed processing environment GOALS OF THE DAY

14. 2 October 2017 • OpenMS: software for metabolite and proteome data analysis and management • Detection of mass traces and their aggregation into features • Four pre-processing steps AN OPENMS BASED WORKFLOW X CSV File Filter Feature Finder Feature Linker Text Exporter

15. 2 October 2017 • Kubernetes cluster backed by a Vagrant box (VM) • https://github.com/CARAMBA-Clinic/COST- CHARME/blob/master/README.md • Execution of workflow-engine in Cloud-Like environment via Jupyter • Downstream analysis on RStudio METHODS

16. 2 October 2017 • Four interconnected tasks/processes • Intermediate data handled by repositories • Results stored also in a repository WORKFLOW IN PACHYDERM

17. 2 October 2017 • Thanks to Pachyderm, we can enable a reproducible and scalable data processing platform • Can you write your own container and distribute its computation? REPRODUCIBLE RESULT

18. 2 October 2017 THANKS! ANY QUESTIONS? “Provenance and reproducibility enable a rigorous and efficient data science” Jon Ander Novella de Miguel Department of Pharmaceutical Biosciences Jon.Novella@farmbio.uu.se

Scalable and reproducible workflows with Pachyderm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scalable and reproducible workflows with Pachyderm

Similar to Scalable and reproducible workflows with Pachyderm (20)

Recently uploaded

Recently uploaded (20)

Scalable and reproducible workflows with Pachyderm