1. 2016-09-04 BioExcel SIG, ECCB, Amsterdam
Advances in Scientific
Workflow Environments
Carole Goble, Stian Soiland-Reyes
The University of Manchester
carole.goble@manchester.ac.uk
http://esciencelab.org.uk/
2. What is a Workflow?
• Orchestrating multiple
computational tasks
• Managing the control and
data flow between them
• In a world that is
homogeneous or
heterogeneous
• Tasks
– Local / remote
– Local / third party
– White, grey or black boxes
– Reliable / fragile
– Reserved / dynamic
– Various underpinning
infrastructure
– Various access controls
BioExcel: Biomolecular recognition
3. What is a Workflow?
Automation
– Automate computational aspects
– Repetitive pipelines, sweep campaigns
Scaling – compute cycles
– Make use of computational infrastructure
& handle large data
Abstraction – people cycles
– Shield complexity and incompatibilities
– Report, re-use, evolve, share, compare
– Repeat –Tweak - Repeat
– First class commodities
Provenance - reporting
– Capture, report and utilize log and data
lineage auto-documentation
– Traceable evolution, audit, transparency
– Compare
With thanks to Bertram Ludascher:WORKS 2015 Keynote
Findable
Accessible
Interoperable
Reusable
(Reproducible)
12. Workflow Patterns, templates
Data
wrangling
& analytics
Simulations
Instrument
pipelines
+
+
http://tpeterka.github.io/maui-project/
The Future of ScientificWorkflows, Report of DOEWorkshop 2015,
http://science.energy.gov/~/media/ascr/pdf/programdocuments/docs/workflows_final_report.pd
13. Workflow Patterns, templates
Data
wrangling
& analytics
Simulations
Instrument
pipelines
+
+ Garijo et al Common Motifs in ScientificWorkflows: An EmpiricalAnalysis, FGCS, 36, July 2014, 338–351
14. Workflow Patterns, templates
• Long running and complex code
• Tunable parameters and input sets
• Simulation sweeps / iterations
• Ensembles, comparisons
• Tricky set-ups, human-in-the-loop
interaction
• Computational steering
• In situ workflows – multiple tasks, same
box, within fixed time
– data locality.
– human-in-the-loop.
– capture provenance.
Data
wrangling
& analytics
Simulations
Instrument
pipelines
+
+
15. Traction + Examples
Reuse behaviours
Exploratory vs Production
Different kinds of user / deployment
Developer – User Ratios
BiologistDeveloper Computational
Scientist
19. “Multi-scale” WFMS
• Workflow
Management
System
– Its design and reporting
environment
– Its execution
environment
• The tasks
– tools, codes and services
and their execution
environments
• Stack layer
– App level, infrastructure
level
20. Component making
Tasks loosely coupled through files,
• execute on geographically distributed
clusters, clouds, grids across systems
• execute on multiple facilities
• call host services (web / grid services)
DAIC
Distributed Area/Instrument
Computing
“Multi-scale” WFMS
Tasks tightly coupled
• exchanging info over memory/storage
• network of supercomputers
• In situ workflows – multiple tasks, same
box, within fixed time
HPC
Interoperability
Portability
Granularity
Maintenance
22. Copernicus workflow engine for
parallel adaptive molecular dynamics
• Peer-to-peer distributed
computing platform
– high-level parallelization of
statistical sampling problems
• Consolidation of heterogeneous
compute resources
• Automatic resource matching of
jobs against compute resources
• Automatic fault tolerance of
distributed work
• Workflow execution engine to
define a problem (reporting) and
trace its results live (provenance)
• Flexible plugin facilities
– programs to be integrated to the
workflow execution engine
Free Energy
Workflow using
GROMACS
http://copernicus-computing.org/
23. COMPs/PyCOMPs:
Programmer Productivity
framework
• Sequential programming
– Parallelisation and
distribution heavy-lifting
– Dependency detection
• Infrastructure unaware
– Abstract application from
underlying infrastructure
– Portability
• Standard Programming
Languages
– Java, Python, C/C++
• No (or few!) APIs
– Standard Java
25. Stop Press!
GUIs not essential!
• Canvas, drag-drop blocks, arrows,
run button
• Command-line & embedding in
developer or user applications
Scripts can be workflows!
• WMS<->Scripts
• Script vs Workflows/ASAP:
– Automation: *****
– Scaling: **
– Abstraction: *
– Provenance: **
26. Stop Press!
GUIs not essential!
• Canvas, drag-drop blocks, arrows,
run button
• Command-line & embedding in
developer or user applications
Scripts can be workflows!
• WMS <-> Scripts
• Script vs Workflows/ASAP:
– Automation: *****
– Scaling: **
– Abstraction: *
– Provenance: **
Work close to a problem-
specific ad-hoc data model
Domain Specific Language
"programming-lite" scripts
• wire with declarative
"makefile"-like DAG
Plus
• procedural scripting and
expressions in languages
like Javascript and Python
Nextflow, SnakeMake,
CommonWorkflow Language
29. Provenance the link between computation and results
W3C PROV model standard
record for reporting
compare diffs/discrepancies
provenance analytics
track changes, adapt
partial repeat/reproduce
carry attributions
compute credits
compute data quality/trust
select data to keep/release
optimisation and debugging
Metadata propagation –where was the
physical sample collected, and who
should be attributed?
Task-based abstractions: simplifying
provenance using motifs and tool
annotations
“Free energy calculation” rather than 5
steps including preparation of PDB files
and GROMACS execution
30. Provenance the link workflow variants
and workflow reuse and repurpose
W3C PROV model standard?
record for reporting
compare diffs/discrepancies
provenance analytics
track changes, adapt
carry attributions
compute design credits
versioning, forking, cloning
Nested workflows
functions by stealth
Copy and paste fragmentation
Designing for reuse
Find and Go
Software practices
Systematic reuse
Guidelines for persistently identifying
software using DataCite
https://epubs.stfc.ac.uk/work/24058274
https://www.force11.org/software-citation-
principles
31. ASAP Wfms for FAIR Science
Automate: workflows,
programs and services folks
already use or want to use
Scale: Enable computational
productivity
Abstract: Enable human
productivity
Provenance: Record and use Usability
Workflow Plugged in Code
Reporting Comparison
Thanks to Bertram Ludascher
33. ● Task-specific “mini-workflow”
fragments
– e.g. using Gromacs, CPMD,
HADDOCK
● Packaged
– EGIVM images and Docker
containers
● Backed by existing registries
– ELIXIR’s bio.tools and EGI App DB
● Instantiated as cloud instances
– private (Open Nebula, Open Stack)
– public (e.g.AmazonAWS )
Application Building Blocks
BioExcel Virtualised Software Library
“transversal workflow units”, higher level operations
34. BioExcel Use cases
● Genomics
● Ensembl Molecular
simulations
● Free Energy simulations
● Multiscale modelling of
molecular basis for odor
and taste
● Biomolecular recognition
● Pharmacological queries
● Virtual Screening
35. Finding valid pathways through free-energy
landscapes: implementation of the “string of
swarms” method using Copernicus as a
workflow manager, and GROMACS as a
compute engine.
36. Workflow Interoperability.
• Common format for bioinformatics tool &
workflow execution
• Community based standards effort
• Designed for clusters & clouds
• Supports the use of containers (e.g. Docker)
• Specify data dependencies between steps
• Scatter/gather on steps
• Nest workflows in steps
• Develop your pipeline on your local computer
(optionally with Docker)
• Execute on your research cluster or in the cloud
• Deliver to users via workbenches
• EDAM ontology (ELIXIR-DK) to specify file
formats and reason about them: “FASTQ
Sanger” encoding is a type of FASTQ file
37. Workflow Research Object Bundle
researchobject.org
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects,
JWeb Semantics doi:10.1016/j.websem.2015.01.003
application/vnd.wf4ever.robundle+zip
38. Z. Zhao et al., “Workflow bus for e-Science”, in IEEE e-Science 2006, Amsterdam
40. http://bioexcel.eu/events/bioexcel-workflow-training-for-computational-biomolecular-
research/
Adam Hospital (IRB), Anna Montras (IRB), Stian Soiland-Reyes (UNIMAN), Alexandre Bonvin
(UU), Adrien Melquiond (UU), Josep Lluís Gelpí (BSC), Daniele Lezzi (BSC), Steven Newhouse
(EBI), Jose A. Dianes (EBI), Mark Abraham (KTH), Rossen Apostolov (KTH), Emiliano Ippoliti
(Jülich), Adam Carter (UEDIN), Darren J. White (UEDIN)
Slides: Bertram Ludascher, Ewa Deelman, Vasa Curcin, Paolo Missier, Pinar Alper, Susheel
Varma, Rob Finn, Michael Crusoe, Rizos Sakellariou
Sign up
ASAP!