This presentation contains an introduction to using Pachyderm as a tool to enable scalable and reproducible workflows in the life sciences. Pachyderm is an open-source workflow-engine and distributed data processing tool that leverages the container ecosystem.
Scalable and reproducible workflows with Pachyderm
1. 2 October 2017
Scalable and reproducible
workflows with Pachyderm
Jon Ander Novella de Miguel
Pharmaceutical Bioinformatics research group
Uppsala, Sweden
2. 2 October 2017
APPROACHESTO TACKLE BIOLOGICAL COMPUTATIONS
Data growth in biomedicine Scalable methods for Big Data Analytics
enabled by Cloud Computing
3. 2 October 2017
• Mass Spectrometry can offer high metabolite coverage
METABOLITE DATA
4. 2 October 2017
• Workflow definitions
• Isolation of scientific software
• Reproducibility
• Parallelisation
CHALLENGES
5. 2 October 2017
• Stitching many different software tools is
tedious
• Time-intensive and parameter heavy steps
involved
• Examples:Taverna, Nextflow, SciPipe
WORKFLOW DEFINITIONS
6. 2 October 2017
• Workflow definitions
• Isolation of scientific software
• Reproducibility
• Parallelisation
CHALLENGES
7. 2 October 2017
• Containers wrap an app with its own
operating environment
• Portability and environmental consistency
• Useful in science
• Is Vagrant already old-fashioned?
ISOLATION OF SCIENTIFIC SOFTWARE
8. 2 October 2017
• Deployment, scaling and management of containers in a
cluster
• Kubernetes: big and active community
• Automatic healing and machine decoupling
[1] https://www.kubernetes.io
[1]
CONTAINER ORCHESTRATION TOOLS
9. 2 October 2017
• Workflow definitions
• Isolation of scientific software
• Reproducibility
• Parallelisation
CHALLENGES
10. 2 October 2017
• Workflow-system based on Kubernetes
• A distributed data processing tool based
on containers
• Enables reproducibility, provenance,
parallelization and isolation
“You can focus on being productive, while
Pachyderm will scale up and analyze for you”
[2] https://www.pachyderm.io
[2]
WHAT IS PACHYDERM?
11. 2 October 2017
The main primitives are:
• Repositories: versioned collections of data
• Commits: new data
• Files: data storage primitives
[3] https://www.pachyderm.io/pfs.html
[3]
PFS offers version control for data:
PACHYDERM FILE SYSTEM (PFS)
12. 2 October 2017
• Tasks executed by Kubernetes pods
• Parallelization: spreading data
• Incrementality and glob patterns
• Directed Acyclic Graph
[4] https://www.pachyderm.io/pps.html
[4]
PACHYDERM PIPELINE SYSTEM (PPS)
13. 2 October 2017
• Reproducing a metabolomics workflow with Pachyderm
• Learn how to distribute processing using containers
• Feeling the power of data versioning
• Learn how we can use containers in a cloud-like distributed processing
environment
GOALS OF THE DAY
14. 2 October 2017
• OpenMS: software for metabolite and proteome data
analysis and management
• Detection of mass traces and their aggregation into
features
• Four pre-processing steps
AN OPENMS BASED WORKFLOW
X
CSV
File Filter
Feature Finder
Feature Linker
Text Exporter
15. 2 October 2017
• Kubernetes cluster backed by a Vagrant box (VM)
• https://github.com/CARAMBA-Clinic/COST-
CHARME/blob/master/README.md
• Execution of workflow-engine in Cloud-Like environment via
Jupyter
• Downstream analysis on RStudio
METHODS
16. 2 October 2017
• Four interconnected tasks/processes
• Intermediate data handled by repositories
• Results stored also in a repository
WORKFLOW IN PACHYDERM
17. 2 October 2017
• Thanks to Pachyderm, we can enable a reproducible and scalable data processing
platform
• Can you write your own container and distribute its computation?
REPRODUCIBLE RESULT
18. 2 October 2017
THANKS! ANY QUESTIONS?
“Provenance and reproducibility enable a rigorous and
efficient data science”
Jon Ander Novella de Miguel
Department of Pharmaceutical Biosciences
Jon.Novella@farmbio.uu.se