Workflows have proved to be an excellent medium for representing scientific methods in general and especially also in the areas of life sciences and chemistry.
In the last 20 years quite a few mature workflow engines and workflow editors with diverse foci and strength have been developed to support communities in managing workflows. More recent trends for enhancements of workflow systems address the usability of workflows, re-usability, sustainability and optimisations of their efficiency.
The talk will go into detail for three methods and concepts concerned with these topics. The first topic consists of meta-workflows for quantum-chemical applications applied in the project MoSGrid (Molecular Simulation Grid) and its science gateway developed on top of WS-PGRADE. The second one addresses scaling up bioinformatic workflows with dynamic job expansion and our case study using Galaxy and Makeflow. The talk concludes with a model for balancing thread-level and task-level parallelism for data-intensive workloads on clusters and clouds on the example of aligners for next-generation sequencing data.
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Increasing the Efficiency of Workflows: Use Cases in the Life Sciences
1.
Sandra
Gesing
Center
for
Research
Compu6ng
sandra.gesing@nd.edu
19
August
2016
Increasing
the
Efficiency
of
Workflows:
Use
Cases
in
the
Life
Sciences
2. University
of
Notre
Dame
Sandra
Gesing
2
•
In
the
middle
of
nowhere
of
northern
Indiana
(1.5
h
from
Chicago)
•
4
undergraduate
colleges
•
~35
research
ins6tutes
and
centers
•
~12,000
students
3. Center
for
Research
Compu6ng
Sandra
Gesing
3
•
SoUware
development
and
profiling
•
Cyberinfrastructure/science
gateway
development
•
Computa6onal
Scien6st
support
•
Collabora6ve
research/
grant
development
•
System
administra6on/
prototype
architectures
•
Computa6onal
resources:
25,000
cores+
•
Storage
resources:
3
PB
•
Na6onal
resources
(e.g.,
XSEDE)
•
~40
researchers,
research
programmers,
HPC
specialists
CRC
and
OIT
building
h`p://crc.nd.edu
CRC
HPC
Center
(old
Union
Sta6on)
4. Life
Sciences
Sandra
Gesing
4
• Genomics
• Proteomics
• Metabolomics
• Immunomics
• System
biology
• Molecular
simula6ons
• Docking
• Epidemiology
• …
Black
Swallowtail
–
larvae
and
bu`erfly
5. The
Genomics
Boom
Sandra
Gesing
5
February
16,
2001
biotech
company
Celera
February
15,
2001
The
Human
Genome
Project
6. The
Genomics
Boom
Sandra
Gesing
6
Craig
Venter
(leU)
and
Francis
Collins
(right)
7. Big
Data
Sandra
Gesing
7
•
Explosion
in
the
quan6ty,
variety
and
complexity
of
data
•
Ques6ons
can
be
answered
impossible
to
even
ask
about
10
years
ago
•
Costs
far
reduced
(e.g.,
Human
Genome
project,
15
years,
~$2
billion;
today
~3
days,
$1000)
8. Big
Data
Sandra
Gesing
8
h`p://www.genome.gov/images/content/cost_per_genome_oct2015.jpg
9. Analysis
of
Data
Sandra
Gesing
9
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt
12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt
12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct
12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt
12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt
12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt
12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg
12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga
12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc
12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa
12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
Slide
copied
from:
Stuart
Owen
„Workflows
with
Taverna“
A
sequence
of
connected
steps
in
a
defined
order
based
on
their
control
and
data
dependencies
10. Analysis
of
data
Sandra
Gesing
10
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt
12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt
12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct
12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt
12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt
12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt
12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg
12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga
12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc
12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa
12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
Slide
copied
from:
Stuart
Owen
„Workflows
with
Taverna“
A
sequence
of
connected
steps
in
a
defined
order
based
on
their
control
and
data
dependencies
Workflows!
14. State
of
the
Art
Sandra
Gesing
14
Data
and
compute-‐
intensive
problems
High-‐speed
networks
Users
generally
not
IT
specialists
Tools
and
workflow
engines
Web-‐based
agile
frameworks
Distributed
data
and
compu6ng
infrastructures
15. Challenge
for
Developers
Sandra
Gesing
15
Data
and
compute-‐
intensive
problems
High-‐speed
networks
Tools
and
workflow
engines
Web-‐based
agile
frameworks
Distributed
data
and
compu6ng
infrastructures
Users
generally
not
IT
specialists
Need
for
intui6ve
and
efficient
workflows!
16. Challenge
for
Developers
Sandra
Gesing
16
Data
and
compute-‐
intensive
problems
High-‐speed
networks
Tools
and
workflow
engines
Web-‐based
agile
frameworks
Distributed
data
and
compu6ng
infrastructures
Users
generally
not
IT
specialists
17. Usability
Sandra
Gesing
17
“AUer
all,
usability
really
just
means
that
making
sure
that
something
works
well:
that
a
person
…
can
use
the
thing
-‐
whether
it's
a
Web
site,
a
fighter
jet,
or
a
revolving
door
-‐
for
its
intended
purpose
without
geung
hopelessly
frustrated.”
(Steve
Krug
in
“Don't
make
me
think!:
A
Common
Sense
Approach
to
Web
Usability”,
2005)
19. Workflow
Enhancements
Sandra
Gesing
19
• Logical
level:
Meta-‐workflows
Herres-‐Pawlis,
S.,
Hoffmann,
A.,
Rösener,
T.,
Krüger,
J.,
Grunzke,
R.,
and
Gesing,
S.
“Mul6-‐layer
Meta-‐
metaworkflows
for
the
Evalua6on
of
Solvent
and
Dispersion
Effects
in
Transi6on
Metal
Systems
Using
the
MoSGrid
Science
Gateways”Science
Gateways
(IWSG),
2015
7th
Interna6onal
Workshop
on,
pp.47-‐52,
3-‐5
June
2015,
IEEE
Xplore,
doi:
10.1109/IWSG.2015.13
• System
level:
Combina6on
of
strengths
of
workflow
systems
Hazekamp,
N.,
Sarro,
J.,
Choudhury,
O.,
Gesing,
S.,
Sco`
Emrich
and
Thain,
D.
“Scaling
Up
Bioinforma6cs
Workflows
with
Dynamic
Job
Expansion:
A
Case
Study
Using
Galaxy
and
Makeflow”,
e-‐Science
(e-‐Science),
2015
IEEE
11th
Interna6onal
Conference
on,
pp.332-‐341,
Aug.
31
2015-‐Sept.
4
2015
• Predic6on:
Model
for
op6miza6on
of
tasks
and
threads
Choudhury,
O.,
Rajan,
D.,
Hazekamp,
N.,
Gesing,
S.,
Thain,
D.,
and
Emrich,
S.
“Balancing
Thread-‐level
and
Task-‐
level
Parallelism
for
Data-‐Intensive
Workloads
on
Clusters
and
Clouds”,
Cluster
Compu6ng
(CLUSTER),
2015
IEEE
Interna6onal
Conference
on,
pp.390-‐393,
8-‐11
Sept.
2015,
doi:10.1109/CLUSTER.2015.60
20. MoSGrid
Science
Gateway
Molecular
Simula6on
Grid
•
Science
gateway
integrated
with
underlying
compute
and
data
management
infrastructure
•
Distributed
workflow
management
•
Data
repository
•
Metadata
management
Sandra
Gesing
20
33. Scaling
Up
Workflows
Sandra
Gesing
33
Simple
Workflow
in
Galaxy
Problem:
As
Size
increases
so
does
Time
34. Scaling
Up
Workflows
Sandra
Gesing
34
Workflow
with
Parallelism
added
in
Galaxy
Problem:
Tools
must
be
updated
every
change
in
Parallelism/Relies
on
Scien6st
41. Scaling
Up
Workflows
Sandra
Gesing
41
Job
Sandbox
–
Log
file
crea6on
for
cleanup
42. Scaling
Up
Workflows
Sandra
Gesing
42
Dynamic
Job
Expansion
• Work
Queue:
we
u6lized
100s
of
cores
from
a
Condor
Pool
• Cleaning
Sandbox
using
knowledge
of
intermediates
and
logging
• Explored
methods
to
transmit
needed
environments
such
as
executables
and
Java
61.5X
speed-‐up
on
32
GB
dataset
u6lizing
these
methods
43. Thread-‐level
and
Task-‐level
Parallelism
Sandra
Gesing
43
• Develop predictive performance models for an
application domain
• Achieve acceptable performance the first time
• Optimize resource utilization
• Execution time
• Memory usage
44. Thread-‐level
and
Task-‐level
Parallelism
• WorkQueue master-worker framework
• Sun Grid Engine (SGE) batch system
Sandra
Gesing
44
45. Thread-‐level
and
Task-‐level
Parallelism
Sandra
Gesing
45
1. Applica6on-‐level
model
for
6me:
𝑇(𝑅, 𝑄, 𝑁)= 𝛽1 𝑅 𝑄/𝑁 + 𝛽2
2. Applica6on-‐level
model
for
memory:
𝑀(𝑅, 𝑁)= γ1R +γ2N
3. System-‐level
model
for
6me:
𝑇𝑇𝑜𝑡𝑎𝑙= 𝜂1 𝑄 𝐾/𝐷 + 𝜂2( 𝑄/𝐵 + 𝑅 𝐾𝑁/𝐵𝐶 )+ 𝜂3T(R, 𝑄/𝐾 , 𝑁)∗ 𝐾 𝑁/𝑀𝐶 +
𝜂4 𝑂/𝐵 + 𝜂5 𝑂 𝐾/𝐷
4. System-‐level
model
for
memory:
𝑀𝑀𝑎𝑠𝑡𝑒𝑟(𝑅, 𝑄)=ϕ1R +ϕ2Q
46. Thread-‐level
and
Task-‐level
Parallelism
Sandra
Gesing
46
7
data
points
(R)
7
data
points
(Q)
7
data
points
(N)
343
data
points
Data
CollecCon
Training
data
Regression
Model
Training
Accuracy
Test
MAPE
TesCng
Regression
Coefficient
s
Tes6ng
data
52. Acknowledgements
Sandra
Gesing
52
Logical
level
• Richard
Grunzke
• Sonja
Herres-‐Pawlis
• Alexander
Hoffmann
• Jens
Krüger
• MTA
SZTAKI
• University
of
Westminster
System
level
and
predic6on
• Nicholas Hazekamp
• Olivia Choudhury
• Douglas Thain
• Scott Emrich
• Notre Dame Bioinformatics Lab
• The Cooperative Computing Lab, University of Notre
Dame
53. Science
Gateways
Community
Ins6tute
Sandra
Gesing
53
01
August
2016
–
31
July
2021
1. Incubator:
shared
exper6se
in
business
and
sustainability
planning,
cybersecurity,
user
interface
design,
and
soUware
engineering
prac6ces.
2. Extended
Developer
Support:
expert
developers
for
up
to
one
year
3. Scien6fic
SoUware
Collabora6ve:
open-‐source,
extensible
framework
for
gateway
design,
integra6on,
and
services
4. Community
Engagement
and
Exchange:
forum
for
communica6on
and
shared
experiences
5. Workforce
Development:
training
programs
and
helping
universi6es
form
gateway
support
groups
h`p://sciencegateways.org/