Argonne’s Discovery Engines for Big Data project is working to enable new research modalities based on the integration of advanced computing with experiments at facilities such as the Advanced Photon Source (APS). I review science drivers and initial results in diffuse scattering, high energy diffraction microscopy, tomography, and pythography. I also describe the computational methods and infrastructure that we leverage to support such applications, which include the Petrel online data store, ALCF supercomputers, Globus research data management services, and Swift parallel scripting. This work points to a future in which tight integration of DOE’s experimental and computational facilities enables both new science and more efficient and rapid discovery.
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
1. Discovery Engines for Big Data
Accelerating Discovery in Basic Energy Sciences
Ian Foster
Argonne National Laboratory
Joint work with Ray Osborn, Guy Jennings, Jon Almer,
Hemant Sharma, Mike Wilde, Justin Wozniak,
Rachana Ananthakrishnan, Ben Blaiszik, and many others
Work supported by Argonne LDRD
2. Motivating example: Disordered structures
“Most of materials science is bottlenecked by disordered structures”
Atomic disorder plays an important role in
controlling the bulk properties of complex
materials, for example:
Colossal magnetoresistance
Unconventional superconductivity
Ferroelectric relaxor behavior
Fast-ion conduction
And many many others!
We want a systematic understanding
of the relationships between material
composition, temperature, structure,
and other properties
2
3. A role for both experiment and simulation
Experiment: Observe (indirect) properties of real structures
E.g., single crystal
diffuse scattering at
Advanced Photon Source
Sample Experimental
Simulation: Compute properties of potential structures
E.g., DISCUS simulated
diffuse scattering;
molecular dynamics for
structures
3
Material
composition
Simulated
structure
Simulated
scattering
La 60%
Sr 40%
scattering
4. Opportunity: Integrate experiment & simulation
Experiments can explain and guide simulations
– E.g., guide experiments via evolutionary optimization
Simulations can explain and guide experiments
– E.g., identify temperature regimes in which more data is needed
Experimental Sample
scattering
4
Material
composition
Simulated
structure
Simulated
scattering
La 60%
Sr 40%
Experimental Sample
sca ering
Material
composi on
Simulated
structure
Simulated
sca ering
Detect errors
(secs—mins)
Knowledge base
Past experiments;
simula ons; literature;
expert knowledge
Select experiments
(mins—hours)
Contribute to knowledge base
Simula ons driven by
experiments (mins—days)
Knowledge-driven
decision making
Evolu onary op miza on
5. Opportunity: Link experiment, simulation, and data
analytics to create a discovery engine
Experimental Sample
sca ering
Material
composi on
Simulated
structure
Simulated
sca ering
La 60%
Sr 40%
Detect errors
(secs—mins)
Knowledge base
Past experiments;
simula ons; literature;
expert knowledge
Select experiments
(mins—hours)
Contribute to knowledge base
Simula ons driven by
experiments (mins—days)
Knowledge-driven
decision making
Evolu onary op miza on
6. Opportunities for discovery acceleration in energy
sciences are numerous and span DOE facilities
New
science
processes
6
Grazing incidence small
angle x-ray scattering
Directed self assembly
(Nealey, Ferrier, De Pablo, et
al.). 8
6-ID
Single crystal
diffuse scattering
Defect structure in
disordered materials
(Osborn et al.)
High-energy
1-ID
x-ray diffraction microscopy
Microstructure in
bulk materials
(Almer, Sharma, et al.)
More data
New
analysis
methods
Common themes
Large amounts of data
New mathematical and numerical methods
Statistical and machine learning methods
Rapid reconstruction and analysis
Large-scale parallel computation
End-to-end automation
Data management and provenance
(Examples)
7. Parallel pipeline enables real-time analysis of
diffuse scattering data, plus offline DIFFEV fitting
DIFFEV step
Use simulation and evolutionary
algorithm to determine crystal config
that can produce scattering image
8. Accelerating mapping of materials microstructure
with high energy diffraction microscopy (HEDM)
8
Top: Grains in a 0.79 mm3 volume of a copper wire.
Bottom: Tensile deformation of a copper wire when the wire is pulled. (J. Almer)
9. Parallel pipeline enables immediate assessment of
alignment quality in high-energy diffraction microscopy
9
Blue Gene/Q
Orthros
(All data in NFS)
3: Generate
Parameters
FOP.c
50 tasks
25s/task
¼ CPU hours
Uses Swift/K
Dataset
360 files
4 GB total
1: Median calc
75s (90% I/O)
MedianImage.c
Uses Swift/K
2: Peak Search
15s per file
ImageProcessing.c
Uses Swift/K
Reduced
Dataset
360 files
5 MB total
feedback to experiment
Detector
4: Analysis Pass
FitOrientation.c
60s/task (PC)
1667 CPU hours
60s/task (BG/Q)
1667 CPU hours
Uses Swift/T
GO Transfer
Up to
2.2 M CPU hours
per week!
ssh
Globus Catalog
Scientific Metadata
Workflow Workflow Progress
Control
Script
Bash
Manual
This is a
single
workflow
3: Convert bin L
to N
2 min for all files,
convert files to
Network Endian
format
Before
After
Hemant Sharma, Justin Wozniak, Mike Wilde, Jon Almer
10. Big data staging with MPI-IO enables interactive
analysis of IBM BG/Q supercomputer
Justin Wozniak
11. New data, computational capabilities, and
methods create opportunities and challenges
Integrate data movement, management, workflow, and
computation to accelerate data-driven applications
11
Integrate statistics/machine learning to assess many
models and calibrate them against `all' relevant data
New computer facilities enable on-demand computing
and high-speed analysis of large quantities of data
Applications
Algorithms
Environments
Infrastructure
Infrastructure
Facilities
12. Towards a lab-wide (and DOE-wide) data
architecture and facility
12
Researchers, system administrators, collaborators, students, …
Web interfaces, REST APIs, command line interfaces
Services Domain portals
Registry:
metadata,
attributes
Component
& workflow
repository
PDACS
Resources
Workflow
execution
Data
transfer,
sync,
sharing
Registry:
metadata,
attributes
Integration layer: Remote access protocols, authentication, authorization
Utility
compute
system
(“cloud”)
Data
publication
&
discovery
Parallel
file
DISC
system system
Experimental
facility
Visualization
system
Component
& workflow
repository
kBase
HPC
compute
eMatter
FACE-IT
13. Towards a lab-wide (and DOE-wide) data
architecture and facility
13
Researchers, system administrators, collaborators, students, …
Web interfaces, REST APIs, command line interfaces
Services Domain portals
Registry:
metadata,
attributes
Component
& workflow
repository
PDACS
Resources
Workflow
execution
Data
transfer,
sync,
sharing
Registry:
metadata,
attributes
Integration layer: Remote access protocols, authentication, authorization
Utility
compute
system
(“cloud”)
Data
publication
&
discovery
Parallel
file
DISC
system system
Experimental
facility
Visualization
system
Component
& workflow
repository
kBase
HPC
compute
eMatter
FACE-IT
14. Towards a lab-wide (and DOE-wide) data
architecture and facility
14
Researchers, system administrators, collaborators, students, …
Web interfaces, REST APIs, command line interfaces
Services Domain portals
Registry:
metadata,
attributes
Component
& workflow
repository
PDACS
Resources
Workflow
execution
Data
transfer,
sync,
sharing
Registry:
metadata,
attributes
Integration layer: Remote access protocols, authentication, authorization
Utility
compute
system
(“cloud”)
Data
publication
&
discovery
Parallel
file
DISC
system system
Experimental
facility
Visualization
system
Component
& workflow
repository
kBase
HPC
compute
eMatter
FACE-IT
15. Towards a lab-wide (and DOE-wide) data
architecture and facility
15
Researchers, system administrators, collaborators, students, …
Web interfaces, REST APIs, command line interfaces
Services Domain portals
Registry:
metadata,
attributes
Component
& workflow
repository
PDACS
Resources
Workflow
execution
Data
transfer,
sync,
sharing
Registry:
metadata,
attributes
Integration layer: Remote access protocols, authentication, authorization
Utility
compute
system
(“cloud”)
Data
publication
&
discovery
Parallel
file
DISC
system system
Experimental
facility
Visualization
system
Component
& workflow
repository
kBase
HPC
compute
eMatter
FACE-IT
16. Towards a lab-wide (and DOE-wide) data
architecture and facility
16
Researchers, system administrators, collaborators, students, …
Web interfaces, REST APIs, command line interfaces
Services Domain portals
Registry:
metadata,
attributes
Component
& workflow
repository
PDACS
Resources
Workflow
execution
Data
transfer,
sync,
sharing
Registry:
metadata,
attributes
Integration layer: Remote access protocols, authentication, authorization
Utility
compute
system
(“cloud”)
Data
publication
&
discovery
Parallel
file
DISC
system system
Experimental
facility
Visualization
system
Component
& workflow
repository
kBase
HPC
compute
eMatter
FACE-IT
18. The Petrel research data service
High-speed, high-capacity data store
Seamless integration with data fabric
Project-focused, self-managed
18
32 I/O nodes with GridFTP
1.7 PB GPFS store
Other sites,
facilities,
colleagues
100 TB allocations
User managed access
globus.org
19. Managing the research data lifecycle with Globus services
Experimental
Globus transfers files
reliably, securely, rapidly
1
facility
PI initiates transfer
request; or requested
automatically by
script or science
gateway
Compute facility
2
PI selects files to
share, selects user
or group, and sets
access permissions
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
Researcher logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
Researcher
assembles data
set; describes it
using metadata
(Dublin core and
domain-specific)
www.globus.org
Booth 3649
Curator reviews and
approves; data set
published on campus
or other system
Publication
repository
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
4
7
6
3
5
• SaaS Only a web
browser required
• Access using your
campus credentials
• Globus monitors and
notifies throughout
6 8
Personal computer
Transfe
r
Publicatio
n
Sharin
g
Discove
ry
20. Managing the research data lifecycle with Globus services
Experimental
Globus transfers files
reliably, securely, rapidly
1
facility
PI initiates transfer
request; or requested
automatically by
script or science
gateway
Compute facility
2
PI selects files to
share, selects user
or group, and sets
access permissions
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
Researcher logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
Researcher
assembles data
set; describes it
using metadata
(Dublin core and
domain-specific)
www.globus.org
Booth 3649
Curator reviews and
approves; data set
published on campus
or other system
Publication
repository
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
4
7
6
3
5
• SaaS Only a web
browser required
• Access using your
campus credentials
• Globus monitors and
notifies throughout
6 8
Personal computer
Transfe
r
Publicatio
n
Sharin
g
Discove
ry
21. Managing the research data lifecycle with Globus services
Experimental
Globus transfers files
reliably, securely, rapidly
1
facility
PI initiates transfer
request; or requested
automatically by
script or science
gateway
Compute facility
2
PI selects files to
share, selects user
or group, and sets
access permissions
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
Researcher logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
Researcher
assembles data
set; describes it
using metadata
(Dublin core and
domain-specific)
www.globus.org
Booth 3649
Curator reviews and
approves; data set
published on campus
or other system
Publication
repository
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
4
7
6
3
5
• SaaS Only a web
browser required
• Access using your
campus credentials
• Globus monitors and
notifies throughout
6 8
Personal computer
Transfe
r
Publicatio
n
Sharin
g
Discove
ry
24. Tying it all together: A basic energy sciences
cyberinfrastructure
Storage
locations
Compute
facilities
Collaboration catalogs
Provenance
Files & Metadata
Script
libraries
24
25. Tying it all together: A basic energy sciences
cyberinfrastructure
Storage
locations
Compute
facilities
Collaboration catalogs
Provenance
Files & Metadata
Script
libraries
0: Develop or
reuse script
25
26. Tying it all together: A basic energy sciences
cyberinfrastructure
1: Run script (EL1.layer)
Storage
locations
Compute
facilities
Collaboration catalogs
Provenance
Files & Metadata
Script
libraries
0: Develop or
reuse script
26
27. Tying it all together: A basic energy sciences
cyberinfrastructure
1: Run script (EL1.layer)
2. Lookup file
name=EL1.layer
user=Anton
type=reconstruction
Storage
locations
Compute
facilities
Collaboration catalogs
Provenance
Files & Metadata
Script
libraries
0: Develop or
reuse script
27
28. Tying it all together: A basic energy sciences
cyberinfrastructure
1: Run script (EL1.layer)
2. Lookup file
name=EL1.layer
user=Anton
type=reconstruction
Storage
locations
3: Transfer inputs
Compute
facilities
Collaboration catalogs
Provenance
Files & Metadata
Script
libraries
0: Develop or
reuse script
28
29. Tying it all together: A basic energy sciences
cyberinfrastructure
1: Run script (EL1.layer)
2. Lookup file
name=EL1.layer
user=Anton
type=reconstruction
Storage
locations
3: Transfer inputs
4: Run app
Compute
facilities
Collaboration catalogs
Provenance
Files & Metadata
Script
libraries
0: Develop or
reuse script
29
30. Tying it all together: A basic energy sciences
cyberinfrastructure
1: Run script (EL1.layer)
Collaboration catalogs
2. Lookup file
name=EL1.layer
user=Anton
type=reconstruction
Storage
locations
3: Transfer inputs
4: Run app
Compute
facilities
6: Update catalogs
5: Transfer results
Provenance
Files & Metadata
Script
libraries
0: Develop or
reuse script
30
31. Tying it all together: A basic energy sciences
cyberinfrastructure
1: Run script (EL1.layer)
2. Lookup file
name=EL1.layer
user=Anton
type=reconstruction
Storage
locations
3: Transfer inputs
4: Run app
Compute
facilities
6: Update catalogs
5: Transfer results
External
collaborators
Collaboration catalogs
Provenance
Files & Metadata
Script
libraries
0: Develop or
reuse script
31
Researchers
32. Towards a science of workflow performance
Develop, evaluate, and refine
component and end-to-end
models
Models from the literature
Fluid models for network flows
SKOPE modeling
system
Develop and apply
data-driven
estimation methods
• Differential regression
• Surrogate models
• Other methods from literature
Develop easy-to-use tools to
provide end-users with
actionable advice
• Runtime advisor, integrated
with Globus services
“Robust Analytical
Modeling for Science
at Extreme Scales”
Automated experiments to test
models & build database
• Experiment design
• Testbeds
33. Discovery engines for energy science
Science automation services
Scripting, security, storage, cataloging, transfer
Simulation
Characterize,
Predict
Assimilate
Steer data
acquisition
Data analysis
Reconstruct,
detect features,
auto-correlate,
particle
distributions, …
~0.001-0.5 GB/s/flow
~2 GB/s total burst
~200 TB/month
~10 concurrent flows
(Today: x10 in 5 yrs)
Integration
Optimize, fit, …
Configure
Check
Guide
Scientific opportunities
Probe material structure and
function at unprecedented scales
Technical challenges
Many experimental modalities
Data rates and computation
needs vary widely; are increasing
Knowledge management,
integration, synthesis
New methods demand rapid access
to large amounts of data, computing
Batch
Immediate
0.001 1 100+
PFlops
Precompute
material
database
Reconstruct
image
Auto-correlation
Feature
detection
34. Next steps
From six beamlines to 60 beamlines
From 60 facility users to 6000 facility users
From one lab to all labs
From data management and analysis to knowledge
management, integration, and analysis
From per-user to per-discipline (and trans-discipline)
data repositories, publication, and discovery
From terabytes to petabytes
From three months to three hours to build pipelines
From intuitive to analytical understanding of systems
34
Editor's Notes
Atomic disorder, both in the form of point defects and the nanoscale self-organization that often accompany them, plays an important role in controlling the bulk properties of complex materials.
Use experiments to constrain models of material structure, and vice versa
Experiments: Single crystal diffuse scattering of, e.g., bilayer manganites, yielding pair distribution functions
Simulations: Molecular dynamics for candidate structures, yielding simulated scattering and simulated pair distribution functions
Experiment genome!
Simulation genome!
Add CNM
Add Tao beamlines
This picture shows the big picture.
New types of computer systems enable high-speed data access, high-speed analysis, and on-demand computing
Integrated networking, data transfer, and security solutions enable ultra-rapid, secure communication among resources
New data and workflow services enable automation and provenance tracking for data-driven applications
Simple APIs enable rapid development of domain-specific tools, applications, and portals
New types of computer systems enable high-speed data access, high-speed analysis, and on-demand computing
Integrated networking, data transfer, and security solutions enable ultra-rapid, secure communication among resources
New data and workflow services enable automation and provenance tracking for data-driven applications
Simple APIs enable rapid development of domain-specific tools, applications, and portals
New types of computer systems enable high-speed data access, high-speed analysis, and on-demand computing
Integrated networking, data transfer, and security solutions enable ultra-rapid, secure communication among resources
New data and workflow services enable automation and provenance tracking for data-driven applications
Simple APIs enable rapid development of domain-specific tools, applications, and portals
New types of computer systems enable high-speed data access, high-speed analysis, and on-demand computing
Integrated networking, data transfer, and security solutions enable ultra-rapid, secure communication among resources
New data and workflow services enable automation and provenance tracking for data-driven applications
Simple APIs enable rapid development of domain-specific tools, applications, and portals
New types of computer systems enable high-speed data access, high-speed analysis, and on-demand computing
Integrated networking, data transfer, and security solutions enable ultra-rapid, secure communication among resources
New data and workflow services enable automation and provenance tracking for data-driven applications
Simple APIs enable rapid development of domain-specific tools, applications, and portals
We consider Petrel here to be “within APS”
A uniform data fabric across the lab
… with seamless access to large data…
… for use in computation, collaboration and distribution …
… that is project focused and self managed …
… and is described and discoverable
This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity.
It shows the major components of the CMTS cyberinfrastructure we are integrating.
Here’s how CMTS will use it and how it will help them.
0. Develop script
A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services.
The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel.
Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist:1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all.
Locate input file locations via dataset catalogCMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated.
Transfer inputsSwift will automatically transport input datasets to the selected computational resource for an application run (if needed)
Run appSwift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings.
Transfer resultsSwift will automatically transport output datasets to the selected archival or temporary storage resource (if needed)
Update catalogs…and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs.
Collaborate! (2 clicks)All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.
This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity.
It shows the major components of the CMTS cyberinfrastructure we are integrating.
Here’s how CMTS will use it and how it will help them.
0. Develop script
A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services.
The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel.
Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist:1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all.
Locate input file locations via dataset catalogCMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated.
Transfer inputsSwift will automatically transport input datasets to the selected computational resource for an application run (if needed)
Run appSwift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings.
Transfer resultsSwift will automatically transport output datasets to the selected archival or temporary storage resource (if needed)
Update catalogs…and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs.
Collaborate! (2 clicks)All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.
This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity.
It shows the major components of the CMTS cyberinfrastructure we are integrating.
Here’s how CMTS will use it and how it will help them.
0. Develop script
A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services.
The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel.
Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist:1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all.
Locate input file locations via dataset catalogCMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated.
Transfer inputsSwift will automatically transport input datasets to the selected computational resource for an application run (if needed)
Run appSwift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings.
Transfer resultsSwift will automatically transport output datasets to the selected archival or temporary storage resource (if needed)
Update catalogs…and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs.
Collaborate! (2 clicks)All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.
This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity.
It shows the major components of the CMTS cyberinfrastructure we are integrating.
Here’s how CMTS will use it and how it will help them.
0. Develop script
A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services.
The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel.
Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist:1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all.
Locate input file locations via dataset catalogCMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated.
Transfer inputsSwift will automatically transport input datasets to the selected computational resource for an application run (if needed)
Run appSwift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings.
Transfer resultsSwift will automatically transport output datasets to the selected archival or temporary storage resource (if needed)
Update catalogs…and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs.
Collaborate! (2 clicks)All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.
This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity.
It shows the major components of the CMTS cyberinfrastructure we are integrating.
Here’s how CMTS will use it and how it will help them.
0. Develop script
A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services.
The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel.
Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist:1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all.
Locate input file locations via dataset catalogCMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated.
Transfer inputsSwift will automatically transport input datasets to the selected computational resource for an application run (if needed)
Run appSwift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings.
Transfer resultsSwift will automatically transport output datasets to the selected archival or temporary storage resource (if needed)
Update catalogs…and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs.
Collaborate! (2 clicks)All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.
This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity.
It shows the major components of the CMTS cyberinfrastructure we are integrating.
Here’s how CMTS will use it and how it will help them.
0. Develop script
A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services.
The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel.
Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist:1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all.
Locate input file locations via dataset catalogCMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated.
Transfer inputsSwift will automatically transport input datasets to the selected computational resource for an application run (if needed)
Run appSwift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings.
Transfer resultsSwift will automatically transport output datasets to the selected archival or temporary storage resource (if needed)
Update catalogs…and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs.
Collaborate! (2 clicks)All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.
This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity.
It shows the major components of the CMTS cyberinfrastructure we are integrating.
Here’s how CMTS will use it and how it will help them.
0. Develop script
A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services.
The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel.
Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist:1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all.
Locate input file locations via dataset catalogCMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated.
Transfer inputsSwift will automatically transport input datasets to the selected computational resource for an application run (if needed)
Run appSwift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings.
Transfer resultsSwift will automatically transport output datasets to the selected archival or temporary storage resource (if needed)
Update catalogs…and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs.
Collaborate! (2 clicks)All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.
This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity.
It shows the major components of the CMTS cyberinfrastructure we are integrating.
Here’s how CMTS will use it and how it will help them.
0. Develop script
A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services.
The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel.
Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist:1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all.
Locate input file locations via dataset catalogCMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated.
Transfer inputsSwift will automatically transport input datasets to the selected computational resource for an application run (if needed)
Run appSwift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings.
Transfer resultsSwift will automatically transport output datasets to the selected archival or temporary storage resource (if needed)
Update catalogs…and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs.
Collaborate! (2 clicks)All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.