SlideShare a Scribd company logo
1 of 34
Discovery Engines for Big Data 
Accelerating Discovery in Basic Energy Sciences 
Ian Foster 
Argonne National Laboratory 
Joint work with Ray Osborn, Guy Jennings, Jon Almer, 
Hemant Sharma, Mike Wilde, Justin Wozniak, 
Rachana Ananthakrishnan, Ben Blaiszik, and many others 
Work supported by Argonne LDRD
Motivating example: Disordered structures 
“Most of materials science is bottlenecked by disordered structures” 
Atomic disorder plays an important role in 
controlling the bulk properties of complex 
materials, for example: 
 Colossal magnetoresistance 
 Unconventional superconductivity 
 Ferroelectric relaxor behavior 
 Fast-ion conduction 
 And many many others! 
We want a systematic understanding 
of the relationships between material 
composition, temperature, structure, 
and other properties 
2
A role for both experiment and simulation 
Experiment: Observe (indirect) properties of real structures 
E.g., single crystal 
diffuse scattering at 
Advanced Photon Source 
Sample Experimental 
Simulation: Compute properties of potential structures 
E.g., DISCUS simulated 
diffuse scattering; 
molecular dynamics for 
structures 
3 
Material 
composition 
Simulated 
structure 
Simulated 
scattering 
La 60% 
Sr 40% 
scattering
Opportunity: Integrate experiment & simulation 
Experiments can explain and guide simulations 
– E.g., guide experiments via evolutionary optimization 
Simulations can explain and guide experiments 
– E.g., identify temperature regimes in which more data is needed 
Experimental Sample 
scattering 
4 
Material 
composition 
Simulated 
structure 
Simulated 
scattering 
La 60% 
Sr 40% 
Experimental	 Sample	 
sca ering	 
Material	 
composi on	 
Simulated	 
structure	 
Simulated	 
sca ering	 
		 
		 
Detect	errors	 
(secs—mins)	 
Knowledge	base	 
Past	experiments;	 
simula ons;	literature;	 
expert	knowledge	 
Select	experiments	 
(mins—hours)	 
Contribute	to	knowledge	base	 
Simula ons	driven	by	 
experiments	(mins—days)	 
Knowledge-driven	 
decision	making	 
Evolu onary	op miza on
Opportunity: Link experiment, simulation, and data 
analytics to create a discovery engine 
Experimental	 Sample	 
sca ering	 
Material	 
composi on	 
Simulated	 
structure	 
Simulated	 
sca ering	 
La	60%	 
Sr	40%	 
Detect	errors	 
(secs—mins)	 
Knowledge	base	 
Past	experiments;	 
simula ons;	literature;	 
expert	knowledge	 
Select	experiments	 
(mins—hours)	 
Contribute	to	knowledge	base	 
Simula ons	driven	by	 
experiments	(mins—days)	 
Knowledge-driven	 
decision	making	 
Evolu onary	op miza on
Opportunities for discovery acceleration in energy 
sciences are numerous and span DOE facilities 
New 
science 
processes 
6 
Grazing incidence small 
angle x-ray scattering 
Directed self assembly 
(Nealey, Ferrier, De Pablo, et 
al.). 8 
6-ID 
Single crystal 
diffuse scattering 
Defect structure in 
disordered materials 
(Osborn et al.) 
High-energy 
1-ID 
x-ray diffraction microscopy 
Microstructure in 
bulk materials 
(Almer, Sharma, et al.) 
More data 
New 
analysis 
methods 
Common themes 
Large amounts of data 
New mathematical and numerical methods 
Statistical and machine learning methods 
Rapid reconstruction and analysis 
Large-scale parallel computation 
End-to-end automation 
Data management and provenance 
(Examples)
Parallel pipeline enables real-time analysis of 
diffuse scattering data, plus offline DIFFEV fitting 
DIFFEV step 
Use simulation and evolutionary 
algorithm to determine crystal config 
that can produce scattering image
Accelerating mapping of materials microstructure 
with high energy diffraction microscopy (HEDM) 
8 
	 
Top: Grains in a 0.79 mm3 volume of a copper wire. 
Bottom: Tensile deformation of a copper wire when the wire is pulled. (J. Almer)
Parallel pipeline enables immediate assessment of 
alignment quality in high-energy diffraction microscopy 
9 
Blue Gene/Q 
Orthros 
(All data in NFS) 
3: Generate 
Parameters 
FOP.c 
50 tasks 
25s/task 
¼ CPU hours 
Uses Swift/K 
Dataset 
360 files 
4 GB total 
1: Median calc 
75s (90% I/O) 
MedianImage.c 
Uses Swift/K 
2: Peak Search 
15s per file 
ImageProcessing.c 
Uses Swift/K 
Reduced 
Dataset 
360 files 
5 MB total 
feedback to experiment 
Detector 
4: Analysis Pass 
FitOrientation.c 
60s/task (PC) 
1667 CPU hours 
60s/task (BG/Q) 
1667 CPU hours 
Uses Swift/T 
GO Transfer 
Up to 
2.2 M CPU hours 
per week! 
ssh 
Globus Catalog 
Scientific Metadata 
Workflow Workflow Progress 
Control 
Script 
Bash 
Manual 
This is a 
single 
workflow 
3: Convert bin L 
to N 
2 min for all files, 
convert files to 
Network Endian 
format 
Before 
After 
Hemant Sharma, Justin Wozniak, Mike Wilde, Jon Almer
Big data staging with MPI-IO enables interactive 
analysis of IBM BG/Q supercomputer 
Justin Wozniak
New data, computational capabilities, and 
methods create opportunities and challenges 
Integrate data movement, management, workflow, and 
computation to accelerate data-driven applications 
11 
Integrate statistics/machine learning to assess many 
models and calibrate them against `all' relevant data 
New computer facilities enable on-demand computing 
and high-speed analysis of large quantities of data 
Applications 
Algorithms 
Environments 
Infrastructure 
Infrastructure 
Facilities
Towards a lab-wide (and DOE-wide) data 
architecture and facility 
12 
Researchers, system administrators, collaborators, students, … 
Web interfaces, REST APIs, command line interfaces 
Services Domain portals 
Registry: 
metadata, 
attributes 
Component 
& workflow 
repository 
PDACS 
Resources 
Workflow 
execution 
Data 
transfer, 
sync, 
sharing 
Registry: 
metadata, 
attributes 
Integration layer: Remote access protocols, authentication, authorization 
Utility 
compute 
system 
(“cloud”) 
Data 
publication 
& 
discovery 
Parallel 
file 
DISC 
system system 
Experimental 
facility 
Visualization 
system 
Component 
& workflow 
repository 
kBase 
HPC 
compute 
eMatter 
FACE-IT
Towards a lab-wide (and DOE-wide) data 
architecture and facility 
13 
Researchers, system administrators, collaborators, students, … 
Web interfaces, REST APIs, command line interfaces 
Services Domain portals 
Registry: 
metadata, 
attributes 
Component 
& workflow 
repository 
PDACS 
Resources 
Workflow 
execution 
Data 
transfer, 
sync, 
sharing 
Registry: 
metadata, 
attributes 
Integration layer: Remote access protocols, authentication, authorization 
Utility 
compute 
system 
(“cloud”) 
Data 
publication 
& 
discovery 
Parallel 
file 
DISC 
system system 
Experimental 
facility 
Visualization 
system 
Component 
& workflow 
repository 
kBase 
HPC 
compute 
eMatter 
FACE-IT
Towards a lab-wide (and DOE-wide) data 
architecture and facility 
14 
Researchers, system administrators, collaborators, students, … 
Web interfaces, REST APIs, command line interfaces 
Services Domain portals 
Registry: 
metadata, 
attributes 
Component 
& workflow 
repository 
PDACS 
Resources 
Workflow 
execution 
Data 
transfer, 
sync, 
sharing 
Registry: 
metadata, 
attributes 
Integration layer: Remote access protocols, authentication, authorization 
Utility 
compute 
system 
(“cloud”) 
Data 
publication 
& 
discovery 
Parallel 
file 
DISC 
system system 
Experimental 
facility 
Visualization 
system 
Component 
& workflow 
repository 
kBase 
HPC 
compute 
eMatter 
FACE-IT
Towards a lab-wide (and DOE-wide) data 
architecture and facility 
15 
Researchers, system administrators, collaborators, students, … 
Web interfaces, REST APIs, command line interfaces 
Services Domain portals 
Registry: 
metadata, 
attributes 
Component 
& workflow 
repository 
PDACS 
Resources 
Workflow 
execution 
Data 
transfer, 
sync, 
sharing 
Registry: 
metadata, 
attributes 
Integration layer: Remote access protocols, authentication, authorization 
Utility 
compute 
system 
(“cloud”) 
Data 
publication 
& 
discovery 
Parallel 
file 
DISC 
system system 
Experimental 
facility 
Visualization 
system 
Component 
& workflow 
repository 
kBase 
HPC 
compute 
eMatter 
FACE-IT
Towards a lab-wide (and DOE-wide) data 
architecture and facility 
16 
Researchers, system administrators, collaborators, students, … 
Web interfaces, REST APIs, command line interfaces 
Services Domain portals 
Registry: 
metadata, 
attributes 
Component 
& workflow 
repository 
PDACS 
Resources 
Workflow 
execution 
Data 
transfer, 
sync, 
sharing 
Registry: 
metadata, 
attributes 
Integration layer: Remote access protocols, authentication, authorization 
Utility 
compute 
system 
(“cloud”) 
Data 
publication 
& 
discovery 
Parallel 
file 
DISC 
system system 
Experimental 
facility 
Visualization 
system 
Component 
& workflow 
repository 
kBase 
HPC 
compute 
eMatter 
FACE-IT
Architecture realization for APS experiments 
17 
External compute resources
The Petrel research data service 
 High-speed, high-capacity data store 
 Seamless integration with data fabric 
 Project-focused, self-managed 
18 
32 I/O nodes with GridFTP 
1.7 PB GPFS store 
Other sites, 
facilities, 
colleagues 
100 TB allocations 
User managed access 
globus.org
Managing the research data lifecycle with Globus services 
Experimental 
Globus transfers files 
reliably, securely, rapidly 
1 
facility 
PI initiates transfer 
request; or requested 
automatically by 
script or science 
gateway 
Compute facility 
2 
PI selects files to 
share, selects user 
or group, and sets 
access permissions 
Globus controls 
access to shared 
files on existing 
storage; no need 
to move files to 
cloud storage! 
Researcher logs in to 
Globus and accesses 
shared files; no local 
account required; 
download via Globus 
Researcher 
assembles data 
set; describes it 
using metadata 
(Dublin core and 
domain-specific) 
www.globus.org 
Booth 3649 
Curator reviews and 
approves; data set 
published on campus 
or other system 
Publication 
repository 
Peers, collaborators 
search and discover 
datasets; transfer and 
share using Globus 
4 
7 
6 
3 
5 
• SaaS  Only a web 
browser required 
• Access using your 
campus credentials 
• Globus monitors and 
notifies throughout 
6 8 
Personal computer 
Transfe 
r 
Publicatio 
n 
Sharin 
g 
Discove 
ry
Managing the research data lifecycle with Globus services 
Experimental 
Globus transfers files 
reliably, securely, rapidly 
1 
facility 
PI initiates transfer 
request; or requested 
automatically by 
script or science 
gateway 
Compute facility 
2 
PI selects files to 
share, selects user 
or group, and sets 
access permissions 
Globus controls 
access to shared 
files on existing 
storage; no need 
to move files to 
cloud storage! 
Researcher logs in to 
Globus and accesses 
shared files; no local 
account required; 
download via Globus 
Researcher 
assembles data 
set; describes it 
using metadata 
(Dublin core and 
domain-specific) 
www.globus.org 
Booth 3649 
Curator reviews and 
approves; data set 
published on campus 
or other system 
Publication 
repository 
Peers, collaborators 
search and discover 
datasets; transfer and 
share using Globus 
4 
7 
6 
3 
5 
• SaaS  Only a web 
browser required 
• Access using your 
campus credentials 
• Globus monitors and 
notifies throughout 
6 8 
Personal computer 
Transfe 
r 
Publicatio 
n 
Sharin 
g 
Discove 
ry
Managing the research data lifecycle with Globus services 
Experimental 
Globus transfers files 
reliably, securely, rapidly 
1 
facility 
PI initiates transfer 
request; or requested 
automatically by 
script or science 
gateway 
Compute facility 
2 
PI selects files to 
share, selects user 
or group, and sets 
access permissions 
Globus controls 
access to shared 
files on existing 
storage; no need 
to move files to 
cloud storage! 
Researcher logs in to 
Globus and accesses 
shared files; no local 
account required; 
download via Globus 
Researcher 
assembles data 
set; describes it 
using metadata 
(Dublin core and 
domain-specific) 
www.globus.org 
Booth 3649 
Curator reviews and 
approves; data set 
published on campus 
or other system 
Publication 
repository 
Peers, collaborators 
search and discover 
datasets; transfer and 
share using Globus 
4 
7 
6 
3 
5 
• SaaS  Only a web 
browser required 
• Access using your 
campus credentials 
• Globus monitors and 
notifies throughout 
6 8 
Personal computer 
Transfe 
r 
Publicatio 
n 
Sharin 
g 
Discove 
ry
22 
Transfers from 
a single APS 
storage system 
(to 119 
destinations)
23
Tying it all together: A basic energy sciences 
cyberinfrastructure 
Storage 
locations 
Compute 
facilities 
Collaboration catalogs 
Provenance 
Files & Metadata 
Script 
libraries 
24
Tying it all together: A basic energy sciences 
cyberinfrastructure 
Storage 
locations 
Compute 
facilities 
Collaboration catalogs 
Provenance 
Files & Metadata 
Script 
libraries 
0: Develop or 
reuse script 
25
Tying it all together: A basic energy sciences 
cyberinfrastructure 
1: Run script (EL1.layer) 
Storage 
locations 
Compute 
facilities 
Collaboration catalogs 
Provenance 
Files & Metadata 
Script 
libraries 
0: Develop or 
reuse script 
26
Tying it all together: A basic energy sciences 
cyberinfrastructure 
1: Run script (EL1.layer) 
2. Lookup file 
name=EL1.layer 
user=Anton 
type=reconstruction 
Storage 
locations 
Compute 
facilities 
Collaboration catalogs 
Provenance 
Files & Metadata 
Script 
libraries 
0: Develop or 
reuse script 
27
Tying it all together: A basic energy sciences 
cyberinfrastructure 
1: Run script (EL1.layer) 
2. Lookup file 
name=EL1.layer 
user=Anton 
type=reconstruction 
Storage 
locations 
3: Transfer inputs 
Compute 
facilities 
Collaboration catalogs 
Provenance 
Files & Metadata 
Script 
libraries 
0: Develop or 
reuse script 
28
Tying it all together: A basic energy sciences 
cyberinfrastructure 
1: Run script (EL1.layer) 
2. Lookup file 
name=EL1.layer 
user=Anton 
type=reconstruction 
Storage 
locations 
3: Transfer inputs 
4: Run app 
Compute 
facilities 
Collaboration catalogs 
Provenance 
Files & Metadata 
Script 
libraries 
0: Develop or 
reuse script 
29
Tying it all together: A basic energy sciences 
cyberinfrastructure 
1: Run script (EL1.layer) 
Collaboration catalogs 
2. Lookup file 
name=EL1.layer 
user=Anton 
type=reconstruction 
Storage 
locations 
3: Transfer inputs 
4: Run app 
Compute 
facilities 
6: Update catalogs 
5: Transfer results 
Provenance 
Files & Metadata 
Script 
libraries 
0: Develop or 
reuse script 
30
Tying it all together: A basic energy sciences 
cyberinfrastructure 
1: Run script (EL1.layer) 
2. Lookup file 
name=EL1.layer 
user=Anton 
type=reconstruction 
Storage 
locations 
3: Transfer inputs 
4: Run app 
Compute 
facilities 
6: Update catalogs 
5: Transfer results 
External 
collaborators 
Collaboration catalogs 
Provenance 
Files & Metadata 
Script 
libraries 
0: Develop or 
reuse script 
31 
Researchers
Towards a science of workflow performance 
Develop, evaluate, and refine 
component and end-to-end 
models 
 Models from the literature 
 Fluid models for network flows 
 SKOPE modeling 
system 
Develop and apply 
data-driven 
estimation methods 
• Differential regression 
• Surrogate models 
• Other methods from literature 
Develop easy-to-use tools to 
provide end-users with 
actionable advice 
• Runtime advisor, integrated 
with Globus services 
“Robust Analytical 
Modeling for Science 
at Extreme Scales” 
Automated experiments to test 
models & build database 
• Experiment design 
• Testbeds
Discovery engines for energy science 
Science automation services 
Scripting, security, storage, cataloging, transfer 
Simulation 
Characterize, 
Predict 
Assimilate 
Steer data 
acquisition 
Data analysis 
Reconstruct, 
detect features, 
auto-correlate, 
particle 
distributions, … 
~0.001-0.5 GB/s/flow 
~2 GB/s total burst 
~200 TB/month 
~10 concurrent flows 
(Today: x10 in 5 yrs) 
Integration 
Optimize, fit, … 
Configure 
Check 
Guide 
Scientific opportunities 
 Probe material structure and 
function at unprecedented scales 
Technical challenges 
 Many experimental modalities 
 Data rates and computation 
needs vary widely; are increasing 
 Knowledge management, 
integration, synthesis 
New methods demand rapid access 
to large amounts of data, computing 
Batch 
Immediate 
0.001 1 100+ 
PFlops 
Precompute 
material 
database 
Reconstruct 
image 
Auto-correlation 
Feature 
detection
Next steps 
 From six beamlines to 60 beamlines 
 From 60 facility users to 6000 facility users 
 From one lab to all labs 
 From data management and analysis to knowledge 
management, integration, and analysis 
 From per-user to per-discipline (and trans-discipline) 
data repositories, publication, and discovery 
 From terabytes to petabytes 
 From three months to three hours to build pipelines 
 From intuitive to analytical understanding of systems 
34

More Related Content

What's hot

Virtual Science in the Cloud
Virtual Science in the CloudVirtual Science in the Cloud
Virtual Science in the Cloudthetfoot
 
CHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning PlatformCHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning PlatformLarry Smarr
 
Cloud com foster december 2010
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010Ian Foster
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light SourcesIan Foster
 
Sgg crest-presentation-final
Sgg crest-presentation-finalSgg crest-presentation-final
Sgg crest-presentation-finalmarpierc
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 
PNNL April 2011 ogce
PNNL April 2011 ogcePNNL April 2011 ogce
PNNL April 2011 ogcemarpierc
 
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Keiichiro Ono
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Ian Foster
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Anubhav Jain
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
 
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Keiichiro Ono
 
Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...
Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...
Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...Eran Chinthaka Withana
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataAnubhav Jain
 
User Inspired Management of Scientific Jobs in Grids and Clouds
User Inspired Management of Scientific Jobs in Grids and CloudsUser Inspired Management of Scientific Jobs in Grids and Clouds
User Inspired Management of Scientific Jobs in Grids and CloudsEran Chinthaka Withana
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASAIan Foster
 
Toward a National Research Platform
Toward a National Research PlatformToward a National Research Platform
Toward a National Research PlatformLarry Smarr
 
Overview of the W3C Semantic Sensor Network (SSN) ontology
Overview of the W3C Semantic Sensor Network (SSN) ontologyOverview of the W3C Semantic Sensor Network (SSN) ontology
Overview of the W3C Semantic Sensor Network (SSN) ontologyRaúl García Castro
 
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...aceas13tern
 

What's hot (20)

Virtual Science in the Cloud
Virtual Science in the CloudVirtual Science in the Cloud
Virtual Science in the Cloud
 
CHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning PlatformCHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning Platform
 
Cloud com foster december 2010
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
Sgg crest-presentation-final
Sgg crest-presentation-finalSgg crest-presentation-final
Sgg crest-presentation-final
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
PNNL April 2011 ogce
PNNL April 2011 ogcePNNL April 2011 ogce
PNNL April 2011 ogce
 
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
 
Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...
Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...
Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
 
Research Objects in Wf4Ever
Research Objects in Wf4EverResearch Objects in Wf4Ever
Research Objects in Wf4Ever
 
User Inspired Management of Scientific Jobs in Grids and Clouds
User Inspired Management of Scientific Jobs in Grids and CloudsUser Inspired Management of Scientific Jobs in Grids and Clouds
User Inspired Management of Scientific Jobs in Grids and Clouds
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASA
 
Toward a National Research Platform
Toward a National Research PlatformToward a National Research Platform
Toward a National Research Platform
 
Overview of the W3C Semantic Sensor Network (SSN) ontology
Overview of the W3C Semantic Sensor Network (SSN) ontologyOverview of the W3C Semantic Sensor Network (SSN) ontology
Overview of the W3C Semantic Sensor Network (SSN) ontology
 
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
 

Similar to Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Scientific
Scientific Scientific
Scientific marpierc
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Databricks
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneIan Foster
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Integrated research data management in the Structural Sciences
Integrated research data management in the Structural SciencesIntegrated research data management in the Structural Sciences
Integrated research data management in the Structural SciencesManjulaPatel
 
Integrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudIntegrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudData Finder
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationIan Foster
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009Ian Foster
 
247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research WorkbenchStuart Chalk
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersIan Foster
 
GlobusWorld 2019 Opening Keynote
GlobusWorld 2019 Opening KeynoteGlobusWorld 2019 Opening Keynote
GlobusWorld 2019 Opening KeynoteGlobus
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational ScienceChelle Gentemann
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 

Similar to Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences (20)

eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Scientific
Scientific Scientific
Scientific
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Integrated research data management in the Structural Sciences
Integrated research data management in the Structural SciencesIntegrated research data management in the Structural Sciences
Integrated research data management in the Structural Sciences
 
Integrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudIntegrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloud
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
 
247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
GlobusWorld 2019 Opening Keynote
GlobusWorld 2019 Opening KeynoteGlobusWorld 2019 Opening Keynote
GlobusWorld 2019 Opening Keynote
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 

More from Ian Foster

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxIan Foster
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionIan Foster
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumIan Foster
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsIan Foster
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryIan Foster
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptxIan Foster
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceIan Foster
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryIan Foster
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the ContinuumIan Foster
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryIan Foster
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterIan Foster
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon SummaryIan Foster
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperabilityIan Foster
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasIan Foster
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFIan Foster
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformSoftware Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformIan Foster
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
 

More from Ian Foster (20)

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptx
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart Instruments
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon Summary
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperability
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformSoftware Infrastructure for a National Research Platform
Software Infrastructure for a National Research Platform
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 

Recently uploaded

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)AkefAfaneh2
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxBhagirath Gogikar
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oManavSingh202607
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 

Recently uploaded (20)

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 

Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

  • 1. Discovery Engines for Big Data Accelerating Discovery in Basic Energy Sciences Ian Foster Argonne National Laboratory Joint work with Ray Osborn, Guy Jennings, Jon Almer, Hemant Sharma, Mike Wilde, Justin Wozniak, Rachana Ananthakrishnan, Ben Blaiszik, and many others Work supported by Argonne LDRD
  • 2. Motivating example: Disordered structures “Most of materials science is bottlenecked by disordered structures” Atomic disorder plays an important role in controlling the bulk properties of complex materials, for example:  Colossal magnetoresistance  Unconventional superconductivity  Ferroelectric relaxor behavior  Fast-ion conduction  And many many others! We want a systematic understanding of the relationships between material composition, temperature, structure, and other properties 2
  • 3. A role for both experiment and simulation Experiment: Observe (indirect) properties of real structures E.g., single crystal diffuse scattering at Advanced Photon Source Sample Experimental Simulation: Compute properties of potential structures E.g., DISCUS simulated diffuse scattering; molecular dynamics for structures 3 Material composition Simulated structure Simulated scattering La 60% Sr 40% scattering
  • 4. Opportunity: Integrate experiment & simulation Experiments can explain and guide simulations – E.g., guide experiments via evolutionary optimization Simulations can explain and guide experiments – E.g., identify temperature regimes in which more data is needed Experimental Sample scattering 4 Material composition Simulated structure Simulated scattering La 60% Sr 40% Experimental Sample sca ering Material composi on Simulated structure Simulated sca ering Detect errors (secs—mins) Knowledge base Past experiments; simula ons; literature; expert knowledge Select experiments (mins—hours) Contribute to knowledge base Simula ons driven by experiments (mins—days) Knowledge-driven decision making Evolu onary op miza on
  • 5. Opportunity: Link experiment, simulation, and data analytics to create a discovery engine Experimental Sample sca ering Material composi on Simulated structure Simulated sca ering La 60% Sr 40% Detect errors (secs—mins) Knowledge base Past experiments; simula ons; literature; expert knowledge Select experiments (mins—hours) Contribute to knowledge base Simula ons driven by experiments (mins—days) Knowledge-driven decision making Evolu onary op miza on
  • 6. Opportunities for discovery acceleration in energy sciences are numerous and span DOE facilities New science processes 6 Grazing incidence small angle x-ray scattering Directed self assembly (Nealey, Ferrier, De Pablo, et al.). 8 6-ID Single crystal diffuse scattering Defect structure in disordered materials (Osborn et al.) High-energy 1-ID x-ray diffraction microscopy Microstructure in bulk materials (Almer, Sharma, et al.) More data New analysis methods Common themes Large amounts of data New mathematical and numerical methods Statistical and machine learning methods Rapid reconstruction and analysis Large-scale parallel computation End-to-end automation Data management and provenance (Examples)
  • 7. Parallel pipeline enables real-time analysis of diffuse scattering data, plus offline DIFFEV fitting DIFFEV step Use simulation and evolutionary algorithm to determine crystal config that can produce scattering image
  • 8. Accelerating mapping of materials microstructure with high energy diffraction microscopy (HEDM) 8 Top: Grains in a 0.79 mm3 volume of a copper wire. Bottom: Tensile deformation of a copper wire when the wire is pulled. (J. Almer)
  • 9. Parallel pipeline enables immediate assessment of alignment quality in high-energy diffraction microscopy 9 Blue Gene/Q Orthros (All data in NFS) 3: Generate Parameters FOP.c 50 tasks 25s/task ¼ CPU hours Uses Swift/K Dataset 360 files 4 GB total 1: Median calc 75s (90% I/O) MedianImage.c Uses Swift/K 2: Peak Search 15s per file ImageProcessing.c Uses Swift/K Reduced Dataset 360 files 5 MB total feedback to experiment Detector 4: Analysis Pass FitOrientation.c 60s/task (PC) 1667 CPU hours 60s/task (BG/Q) 1667 CPU hours Uses Swift/T GO Transfer Up to 2.2 M CPU hours per week! ssh Globus Catalog Scientific Metadata Workflow Workflow Progress Control Script Bash Manual This is a single workflow 3: Convert bin L to N 2 min for all files, convert files to Network Endian format Before After Hemant Sharma, Justin Wozniak, Mike Wilde, Jon Almer
  • 10. Big data staging with MPI-IO enables interactive analysis of IBM BG/Q supercomputer Justin Wozniak
  • 11. New data, computational capabilities, and methods create opportunities and challenges Integrate data movement, management, workflow, and computation to accelerate data-driven applications 11 Integrate statistics/machine learning to assess many models and calibrate them against `all' relevant data New computer facilities enable on-demand computing and high-speed analysis of large quantities of data Applications Algorithms Environments Infrastructure Infrastructure Facilities
  • 12. Towards a lab-wide (and DOE-wide) data architecture and facility 12 Researchers, system administrators, collaborators, students, … Web interfaces, REST APIs, command line interfaces Services Domain portals Registry: metadata, attributes Component & workflow repository PDACS Resources Workflow execution Data transfer, sync, sharing Registry: metadata, attributes Integration layer: Remote access protocols, authentication, authorization Utility compute system (“cloud”) Data publication & discovery Parallel file DISC system system Experimental facility Visualization system Component & workflow repository kBase HPC compute eMatter FACE-IT
  • 13. Towards a lab-wide (and DOE-wide) data architecture and facility 13 Researchers, system administrators, collaborators, students, … Web interfaces, REST APIs, command line interfaces Services Domain portals Registry: metadata, attributes Component & workflow repository PDACS Resources Workflow execution Data transfer, sync, sharing Registry: metadata, attributes Integration layer: Remote access protocols, authentication, authorization Utility compute system (“cloud”) Data publication & discovery Parallel file DISC system system Experimental facility Visualization system Component & workflow repository kBase HPC compute eMatter FACE-IT
  • 14. Towards a lab-wide (and DOE-wide) data architecture and facility 14 Researchers, system administrators, collaborators, students, … Web interfaces, REST APIs, command line interfaces Services Domain portals Registry: metadata, attributes Component & workflow repository PDACS Resources Workflow execution Data transfer, sync, sharing Registry: metadata, attributes Integration layer: Remote access protocols, authentication, authorization Utility compute system (“cloud”) Data publication & discovery Parallel file DISC system system Experimental facility Visualization system Component & workflow repository kBase HPC compute eMatter FACE-IT
  • 15. Towards a lab-wide (and DOE-wide) data architecture and facility 15 Researchers, system administrators, collaborators, students, … Web interfaces, REST APIs, command line interfaces Services Domain portals Registry: metadata, attributes Component & workflow repository PDACS Resources Workflow execution Data transfer, sync, sharing Registry: metadata, attributes Integration layer: Remote access protocols, authentication, authorization Utility compute system (“cloud”) Data publication & discovery Parallel file DISC system system Experimental facility Visualization system Component & workflow repository kBase HPC compute eMatter FACE-IT
  • 16. Towards a lab-wide (and DOE-wide) data architecture and facility 16 Researchers, system administrators, collaborators, students, … Web interfaces, REST APIs, command line interfaces Services Domain portals Registry: metadata, attributes Component & workflow repository PDACS Resources Workflow execution Data transfer, sync, sharing Registry: metadata, attributes Integration layer: Remote access protocols, authentication, authorization Utility compute system (“cloud”) Data publication & discovery Parallel file DISC system system Experimental facility Visualization system Component & workflow repository kBase HPC compute eMatter FACE-IT
  • 17. Architecture realization for APS experiments 17 External compute resources
  • 18. The Petrel research data service  High-speed, high-capacity data store  Seamless integration with data fabric  Project-focused, self-managed 18 32 I/O nodes with GridFTP 1.7 PB GPFS store Other sites, facilities, colleagues 100 TB allocations User managed access globus.org
  • 19. Managing the research data lifecycle with Globus services Experimental Globus transfers files reliably, securely, rapidly 1 facility PI initiates transfer request; or requested automatically by script or science gateway Compute facility 2 PI selects files to share, selects user or group, and sets access permissions Globus controls access to shared files on existing storage; no need to move files to cloud storage! Researcher logs in to Globus and accesses shared files; no local account required; download via Globus Researcher assembles data set; describes it using metadata (Dublin core and domain-specific) www.globus.org Booth 3649 Curator reviews and approves; data set published on campus or other system Publication repository Peers, collaborators search and discover datasets; transfer and share using Globus 4 7 6 3 5 • SaaS  Only a web browser required • Access using your campus credentials • Globus monitors and notifies throughout 6 8 Personal computer Transfe r Publicatio n Sharin g Discove ry
  • 20. Managing the research data lifecycle with Globus services Experimental Globus transfers files reliably, securely, rapidly 1 facility PI initiates transfer request; or requested automatically by script or science gateway Compute facility 2 PI selects files to share, selects user or group, and sets access permissions Globus controls access to shared files on existing storage; no need to move files to cloud storage! Researcher logs in to Globus and accesses shared files; no local account required; download via Globus Researcher assembles data set; describes it using metadata (Dublin core and domain-specific) www.globus.org Booth 3649 Curator reviews and approves; data set published on campus or other system Publication repository Peers, collaborators search and discover datasets; transfer and share using Globus 4 7 6 3 5 • SaaS  Only a web browser required • Access using your campus credentials • Globus monitors and notifies throughout 6 8 Personal computer Transfe r Publicatio n Sharin g Discove ry
  • 21. Managing the research data lifecycle with Globus services Experimental Globus transfers files reliably, securely, rapidly 1 facility PI initiates transfer request; or requested automatically by script or science gateway Compute facility 2 PI selects files to share, selects user or group, and sets access permissions Globus controls access to shared files on existing storage; no need to move files to cloud storage! Researcher logs in to Globus and accesses shared files; no local account required; download via Globus Researcher assembles data set; describes it using metadata (Dublin core and domain-specific) www.globus.org Booth 3649 Curator reviews and approves; data set published on campus or other system Publication repository Peers, collaborators search and discover datasets; transfer and share using Globus 4 7 6 3 5 • SaaS  Only a web browser required • Access using your campus credentials • Globus monitors and notifies throughout 6 8 Personal computer Transfe r Publicatio n Sharin g Discove ry
  • 22. 22 Transfers from a single APS storage system (to 119 destinations)
  • 23. 23
  • 24. Tying it all together: A basic energy sciences cyberinfrastructure Storage locations Compute facilities Collaboration catalogs Provenance Files & Metadata Script libraries 24
  • 25. Tying it all together: A basic energy sciences cyberinfrastructure Storage locations Compute facilities Collaboration catalogs Provenance Files & Metadata Script libraries 0: Develop or reuse script 25
  • 26. Tying it all together: A basic energy sciences cyberinfrastructure 1: Run script (EL1.layer) Storage locations Compute facilities Collaboration catalogs Provenance Files & Metadata Script libraries 0: Develop or reuse script 26
  • 27. Tying it all together: A basic energy sciences cyberinfrastructure 1: Run script (EL1.layer) 2. Lookup file name=EL1.layer user=Anton type=reconstruction Storage locations Compute facilities Collaboration catalogs Provenance Files & Metadata Script libraries 0: Develop or reuse script 27
  • 28. Tying it all together: A basic energy sciences cyberinfrastructure 1: Run script (EL1.layer) 2. Lookup file name=EL1.layer user=Anton type=reconstruction Storage locations 3: Transfer inputs Compute facilities Collaboration catalogs Provenance Files & Metadata Script libraries 0: Develop or reuse script 28
  • 29. Tying it all together: A basic energy sciences cyberinfrastructure 1: Run script (EL1.layer) 2. Lookup file name=EL1.layer user=Anton type=reconstruction Storage locations 3: Transfer inputs 4: Run app Compute facilities Collaboration catalogs Provenance Files & Metadata Script libraries 0: Develop or reuse script 29
  • 30. Tying it all together: A basic energy sciences cyberinfrastructure 1: Run script (EL1.layer) Collaboration catalogs 2. Lookup file name=EL1.layer user=Anton type=reconstruction Storage locations 3: Transfer inputs 4: Run app Compute facilities 6: Update catalogs 5: Transfer results Provenance Files & Metadata Script libraries 0: Develop or reuse script 30
  • 31. Tying it all together: A basic energy sciences cyberinfrastructure 1: Run script (EL1.layer) 2. Lookup file name=EL1.layer user=Anton type=reconstruction Storage locations 3: Transfer inputs 4: Run app Compute facilities 6: Update catalogs 5: Transfer results External collaborators Collaboration catalogs Provenance Files & Metadata Script libraries 0: Develop or reuse script 31 Researchers
  • 32. Towards a science of workflow performance Develop, evaluate, and refine component and end-to-end models  Models from the literature  Fluid models for network flows  SKOPE modeling system Develop and apply data-driven estimation methods • Differential regression • Surrogate models • Other methods from literature Develop easy-to-use tools to provide end-users with actionable advice • Runtime advisor, integrated with Globus services “Robust Analytical Modeling for Science at Extreme Scales” Automated experiments to test models & build database • Experiment design • Testbeds
  • 33. Discovery engines for energy science Science automation services Scripting, security, storage, cataloging, transfer Simulation Characterize, Predict Assimilate Steer data acquisition Data analysis Reconstruct, detect features, auto-correlate, particle distributions, … ~0.001-0.5 GB/s/flow ~2 GB/s total burst ~200 TB/month ~10 concurrent flows (Today: x10 in 5 yrs) Integration Optimize, fit, … Configure Check Guide Scientific opportunities  Probe material structure and function at unprecedented scales Technical challenges  Many experimental modalities  Data rates and computation needs vary widely; are increasing  Knowledge management, integration, synthesis New methods demand rapid access to large amounts of data, computing Batch Immediate 0.001 1 100+ PFlops Precompute material database Reconstruct image Auto-correlation Feature detection
  • 34. Next steps  From six beamlines to 60 beamlines  From 60 facility users to 6000 facility users  From one lab to all labs  From data management and analysis to knowledge management, integration, and analysis  From per-user to per-discipline (and trans-discipline) data repositories, publication, and discovery  From terabytes to petabytes  From three months to three hours to build pipelines  From intuitive to analytical understanding of systems 34

Editor's Notes

  1. Atomic disorder, both in the form of point defects and the nanoscale self-organization that often accompany them, plays an important role in controlling the bulk properties of complex materials.
  2. Use experiments to constrain models of material structure, and vice versa Experiments: Single crystal diffuse scattering of, e.g., bilayer manganites, yielding pair distribution functions Simulations: Molecular dynamics for candidate structures, yielding simulated scattering and simulated pair distribution functions Experiment genome! Simulation genome!
  3. Add CNM Add Tao beamlines
  4. This picture shows the big picture.
  5. New types of computer systems enable high-speed data access, high-speed analysis, and on-demand computing Integrated networking, data transfer, and security solutions enable ultra-rapid, secure communication among resources New data and workflow services enable automation and provenance tracking for data-driven applications Simple APIs enable rapid development of domain-specific tools, applications, and portals
  6. New types of computer systems enable high-speed data access, high-speed analysis, and on-demand computing Integrated networking, data transfer, and security solutions enable ultra-rapid, secure communication among resources New data and workflow services enable automation and provenance tracking for data-driven applications Simple APIs enable rapid development of domain-specific tools, applications, and portals
  7. New types of computer systems enable high-speed data access, high-speed analysis, and on-demand computing Integrated networking, data transfer, and security solutions enable ultra-rapid, secure communication among resources New data and workflow services enable automation and provenance tracking for data-driven applications Simple APIs enable rapid development of domain-specific tools, applications, and portals
  8. New types of computer systems enable high-speed data access, high-speed analysis, and on-demand computing Integrated networking, data transfer, and security solutions enable ultra-rapid, secure communication among resources New data and workflow services enable automation and provenance tracking for data-driven applications Simple APIs enable rapid development of domain-specific tools, applications, and portals
  9. New types of computer systems enable high-speed data access, high-speed analysis, and on-demand computing Integrated networking, data transfer, and security solutions enable ultra-rapid, secure communication among resources New data and workflow services enable automation and provenance tracking for data-driven applications Simple APIs enable rapid development of domain-specific tools, applications, and portals
  10. We consider Petrel here to be “within APS”
  11. A uniform data fabric across the lab … with seamless access to large data… … for use in computation, collaboration and distribution … … that is project focused and self managed … … and is described and discoverable
  12. This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity. It shows the major components of the CMTS cyberinfrastructure we are integrating. Here’s how CMTS will use it and how it will help them. 0. Develop script A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services. The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel. Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist: 1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all. Locate input file locations via dataset catalog CMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated. Transfer inputs Swift will automatically transport input datasets to the selected computational resource for an application run (if needed) Run app Swift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings. Transfer results Swift will automatically transport output datasets to the selected archival or temporary storage resource (if needed) Update catalogs …and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs. Collaborate! (2 clicks) All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.
  13. This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity. It shows the major components of the CMTS cyberinfrastructure we are integrating. Here’s how CMTS will use it and how it will help them. 0. Develop script A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services. The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel. Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist: 1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all. Locate input file locations via dataset catalog CMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated. Transfer inputs Swift will automatically transport input datasets to the selected computational resource for an application run (if needed) Run app Swift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings. Transfer results Swift will automatically transport output datasets to the selected archival or temporary storage resource (if needed) Update catalogs …and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs. Collaborate! (2 clicks) All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.
  14. This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity. It shows the major components of the CMTS cyberinfrastructure we are integrating. Here’s how CMTS will use it and how it will help them. 0. Develop script A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services. The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel. Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist: 1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all. Locate input file locations via dataset catalog CMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated. Transfer inputs Swift will automatically transport input datasets to the selected computational resource for an application run (if needed) Run app Swift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings. Transfer results Swift will automatically transport output datasets to the selected archival or temporary storage resource (if needed) Update catalogs …and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs. Collaborate! (2 clicks) All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.
  15. This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity. It shows the major components of the CMTS cyberinfrastructure we are integrating. Here’s how CMTS will use it and how it will help them. 0. Develop script A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services. The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel. Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist: 1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all. Locate input file locations via dataset catalog CMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated. Transfer inputs Swift will automatically transport input datasets to the selected computational resource for an application run (if needed) Run app Swift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings. Transfer results Swift will automatically transport output datasets to the selected archival or temporary storage resource (if needed) Update catalogs …and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs. Collaborate! (2 clicks) All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.
  16. This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity. It shows the major components of the CMTS cyberinfrastructure we are integrating. Here’s how CMTS will use it and how it will help them. 0. Develop script A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services. The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel. Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist: 1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all. Locate input file locations via dataset catalog CMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated. Transfer inputs Swift will automatically transport input datasets to the selected computational resource for an application run (if needed) Run app Swift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings. Transfer results Swift will automatically transport output datasets to the selected archival or temporary storage resource (if needed) Update catalogs …and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs. Collaborate! (2 clicks) All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.
  17. This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity. It shows the major components of the CMTS cyberinfrastructure we are integrating. Here’s how CMTS will use it and how it will help them. 0. Develop script A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services. The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel. Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist: 1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all. Locate input file locations via dataset catalog CMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated. Transfer inputs Swift will automatically transport input datasets to the selected computational resource for an application run (if needed) Run app Swift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings. Transfer results Swift will automatically transport output datasets to the selected archival or temporary storage resource (if needed) Update catalogs …and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs. Collaborate! (2 clicks) All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.
  18. This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity. It shows the major components of the CMTS cyberinfrastructure we are integrating. Here’s how CMTS will use it and how it will help them. 0. Develop script A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services. The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel. Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist: 1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all. Locate input file locations via dataset catalog CMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated. Transfer inputs Swift will automatically transport input datasets to the selected computational resource for an application run (if needed) Run app Swift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings. Transfer results Swift will automatically transport output datasets to the selected archival or temporary storage resource (if needed) Update catalogs …and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs. Collaborate! (2 clicks) All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.
  19. This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity. It shows the major components of the CMTS cyberinfrastructure we are integrating. Here’s how CMTS will use it and how it will help them. 0. Develop script A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services. The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel. Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist: 1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all. Locate input file locations via dataset catalog CMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated. Transfer inputs Swift will automatically transport input datasets to the selected computational resource for an application run (if needed) Run app Swift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings. Transfer results Swift will automatically transport output datasets to the selected archival or temporary storage resource (if needed) Update catalogs …and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs. Collaborate! (2 clicks) All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.