Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Discovery Engines for Big Data
Accelerating Discovery in Basic Energy Sciences
Ian Foster
Argonne National Laboratory
Joint work with Ray Osborn, Guy Jennings, Jon Almer,
Hemant Sharma, Mike Wilde, Justin Wozniak,
Rachana Ananthakrishnan, Ben Blaiszik, and many others
Work supported by Argonne LDRD

Motivating example: Disordered structures
“Most of materials science is bottlenecked by disordered structures”
Atomic disorder plays an important role in
controlling the bulk properties of complex
materials, for example:
 Colossal magnetoresistance
 Unconventional superconductivity
 Ferroelectric relaxor behavior
 Fast-ion conduction
 And many many others!
We want a systematic understanding
of the relationships between material
composition, temperature, structure,
and other properties
2

A role for both experiment and simulation
Experiment: Observe (indirect) properties of real structures
E.g., single crystal
diffuse scattering at
Advanced Photon Source
Sample Experimental
Simulation: Compute properties of potential structures
E.g., DISCUS simulated
diffuse scattering;
molecular dynamics for
structures
3
Material
composition
Simulated
structure
Simulated
scattering
La 60%
Sr 40%
scattering

Opportunity: Integrate experiment & simulation
Experiments can explain and guide simulations
– E.g., guide experiments via evolutionary optimization
Simulations can explain and guide experiments
– E.g., identify temperature regimes in which more data is needed
Experimental Sample
scattering
4
Material
composition
Simulated
structure
Simulated
scattering
La 60%
Sr 40%
Experimental Sample
sca ering
Material
composi on
Simulated
structure
Simulated
sca ering

Detect errors
(secs—mins)
Knowledge base
Past experiments;
simula ons; literature;
expert knowledge
Select experiments
(mins—hours)
Contribute to knowledge base
Simula ons driven by
experiments (mins—days)
Knowledge-driven
decision making
Evolu onary op miza on

Opportunity: Link experiment, simulation, and data
analytics to create a discovery engine
Experimental Sample
sca ering
Material
composi on
Simulated
structure
Simulated
sca ering
La 60%
Sr 40%
Detect errors
(secs—mins)
Knowledge base
Past experiments;
simula ons; literature;
expert knowledge
Select experiments
(mins—hours)
Contribute to knowledge base
Simula ons driven by
experiments (mins—days)
Knowledge-driven
decision making
Evolu onary op miza on

Opportunities for discovery acceleration in energy
sciences are numerous and span DOE facilities
New
science
processes
6
Grazing incidence small
angle x-ray scattering
Directed self assembly
(Nealey, Ferrier, De Pablo, et
al.). 8
6-ID
Single crystal
diffuse scattering
Defect structure in
disordered materials
(Osborn et al.)
High-energy
1-ID
x-ray diffraction microscopy
Microstructure in
bulk materials
(Almer, Sharma, et al.)
More data
New
analysis
methods
Common themes
Large amounts of data
New mathematical and numerical methods
Statistical and machine learning methods
Rapid reconstruction and analysis
Large-scale parallel computation
End-to-end automation
Data management and provenance
(Examples)

Parallel pipeline enables real-time analysis of
diffuse scattering data, plus offline DIFFEV fitting
DIFFEV step
Use simulation and evolutionary
algorithm to determine crystal config
that can produce scattering image

Accelerating mapping of materials microstructure
with high energy diffraction microscopy (HEDM)
8

Top: Grains in a 0.79 mm3 volume of a copper wire.
Bottom: Tensile deformation of a copper wire when the wire is pulled. (J. Almer)

Parallel pipeline enables immediate assessment of
alignment quality in high-energy diffraction microscopy
9
Blue Gene/Q
Orthros
(All data in NFS)
3: Generate
Parameters
FOP.c
50 tasks
25s/task
¼ CPU hours
Uses Swift/K
Dataset
360 files
4 GB total
1: Median calc
75s (90% I/O)
MedianImage.c
Uses Swift/K
2: Peak Search
15s per file
ImageProcessing.c
Uses Swift/K
Reduced
Dataset
360 files
5 MB total
feedback to experiment
Detector
4: Analysis Pass
FitOrientation.c
60s/task (PC)
1667 CPU hours
60s/task (BG/Q)
1667 CPU hours
Uses Swift/T
GO Transfer
Up to
2.2 M CPU hours
per week!
ssh
Globus Catalog
Scientific Metadata
Workflow Workflow Progress
Control
Script
Bash
Manual
This is a
single
workflow
3: Convert bin L
to N
2 min for all files,
convert files to
Network Endian
format
Before
After
Hemant Sharma, Justin Wozniak, Mike Wilde, Jon Almer

Big data staging with MPI-IO enables interactive
analysis of IBM BG/Q supercomputer
Justin Wozniak

New data, computational capabilities, and
methods create opportunities and challenges
Integrate data movement, management, workflow, and
computation to accelerate data-driven applications
11
Integrate statistics/machine learning to assess many
models and calibrate them against `all' relevant data
New computer facilities enable on-demand computing
and high-speed analysis of large quantities of data
Applications
Algorithms
Environments
Infrastructure
Infrastructure
Facilities

Towards a lab-wide (and DOE-wide) data
architecture and facility
12
Researchers, system administrators, collaborators, students, …
Web interfaces, REST APIs, command line interfaces
Services Domain portals
Registry:
metadata,
attributes
Component
& workflow
repository
PDACS
Resources
Workflow
execution
Data
transfer,
sync,
sharing
Registry:
metadata,
attributes
Integration layer: Remote access protocols, authentication, authorization
Utility
compute
system
(“cloud”)
Data
publication
&
discovery
Parallel
file
DISC
system system
Experimental
facility
Visualization
system
Component
& workflow
repository
kBase
HPC
compute
eMatter
FACE-IT

13
Registry:
metadata,
attributes
Component
& workflow
repository
PDACS
Resources
Workflow
execution
Data
transfer,
sync,
sharing
Registry:
metadata,
attributes
Utility
compute
system
(“cloud”)
Data
publication
&
discovery
Parallel
file
DISC
system system
Experimental
facility
Visualization
system
Component
& workflow
repository
kBase
HPC
compute
eMatter
FACE-IT

14
Registry:
metadata,
attributes
Component
& workflow
repository
PDACS
Resources
Workflow
execution
Data
transfer,
sync,
sharing
Registry:
metadata,
attributes
Utility
compute
system
(“cloud”)
Data
publication
&
discovery
Parallel
file
DISC
system system
Experimental
facility
Visualization
system
Component
& workflow
repository
kBase
HPC
compute
eMatter
FACE-IT

15
Registry:
metadata,
attributes
Component
& workflow
repository
PDACS
Resources
Workflow
execution
Data
transfer,
sync,
sharing
Registry:
metadata,
attributes
Utility
compute
system
(“cloud”)
Data
publication
&
discovery
Parallel
file
DISC
system system
Experimental
facility
Visualization
system
Component
& workflow
repository
kBase
HPC
compute
eMatter
FACE-IT

16
Registry:
metadata,
attributes
Component
& workflow
repository
PDACS
Resources
Workflow
execution
Data
transfer,
sync,
sharing
Registry:
metadata,
attributes
Utility
compute
system
(“cloud”)
Data
publication
&
discovery
Parallel
file
DISC
system system
Experimental
facility
Visualization
system
Component
& workflow
repository
kBase
HPC
compute
eMatter
FACE-IT

Architecture realization for APS experiments
17
External compute resources

The Petrel research data service
 High-speed, high-capacity data store
 Seamless integration with data fabric
 Project-focused, self-managed
18
32 I/O nodes with GridFTP
1.7 PB GPFS store
Other sites,
facilities,
colleagues
100 TB allocations
User managed access
globus.org

Managing the research data lifecycle with Globus services
Experimental
Globus transfers files
reliably, securely, rapidly
1
facility
PI initiates transfer
request; or requested
automatically by
script or science
gateway
Compute facility
2
PI selects files to
share, selects user
or group, and sets
access permissions
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
Researcher logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
Researcher
assembles data
set; describes it
using metadata
(Dublin core and
domain-specific)
www.globus.org
Booth 3649
Curator reviews and
approves; data set
published on campus
or other system
Publication
repository
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
4
7
6
3
5
• SaaS  Only a web
browser required
• Access using your
campus credentials
• Globus monitors and
notifies throughout
6 8
Personal computer
Transfe
r
Publicatio
n
Sharin
g
Discove
ry

22
Transfers from
a single APS
storage system
(to 119
destinations)

Tying it all together: A basic energy sciences
cyberinfrastructure
Storage
locations
Compute
facilities
Collaboration catalogs
Provenance
Files & Metadata
Script
libraries
24

cyberinfrastructure
Storage
locations
Compute
facilities
Provenance
Files & Metadata
Script
libraries
0: Develop or
reuse script
25

cyberinfrastructure
1: Run script (EL1.layer)
Storage
locations
Compute
facilities
Provenance
Files & Metadata
Script
libraries
0: Develop or
reuse script
26

cyberinfrastructure
2. Lookup file
name=EL1.layer
user=Anton
type=reconstruction
Storage
locations
Compute
facilities
Provenance
Files & Metadata
Script
libraries
0: Develop or
reuse script
27

cyberinfrastructure
2. Lookup file
name=EL1.layer
user=Anton
type=reconstruction
Storage
locations
3: Transfer inputs
Compute
facilities
Provenance
Files & Metadata
Script
libraries
0: Develop or
reuse script
28

cyberinfrastructure
2. Lookup file
name=EL1.layer
user=Anton
type=reconstruction
Storage
locations
3: Transfer inputs
4: Run app
Compute
facilities
Provenance
Files & Metadata
Script
libraries
0: Develop or
reuse script
29

cyberinfrastructure
2. Lookup file
name=EL1.layer
user=Anton
type=reconstruction
Storage
locations
3: Transfer inputs
4: Run app
Compute
facilities
6: Update catalogs
5: Transfer results
Provenance
Files & Metadata
Script
libraries
0: Develop or
reuse script
30

cyberinfrastructure
2. Lookup file
name=EL1.layer
user=Anton
type=reconstruction
Storage
locations
3: Transfer inputs
4: Run app
Compute
facilities
6: Update catalogs
5: Transfer results
External
collaborators
Provenance
Files & Metadata
Script
libraries
0: Develop or
reuse script
31
Researchers

Towards a science of workflow performance
Develop, evaluate, and refine
component and end-to-end
models
 Models from the literature
 Fluid models for network flows
 SKOPE modeling
system
Develop and apply
data-driven
estimation methods
• Differential regression
• Surrogate models
• Other methods from literature
Develop easy-to-use tools to
provide end-users with
actionable advice
• Runtime advisor, integrated
with Globus services
“Robust Analytical
Modeling for Science
at Extreme Scales”
Automated experiments to test
models & build database
• Experiment design
• Testbeds

Discovery engines for energy science
Science automation services
Scripting, security, storage, cataloging, transfer
Simulation
Characterize,
Predict
Assimilate
Steer data
acquisition
Data analysis
Reconstruct,
detect features,
auto-correlate,
particle
distributions, …
~0.001-0.5 GB/s/flow
~2 GB/s total burst
~200 TB/month
~10 concurrent flows
(Today: x10 in 5 yrs)
Integration
Optimize, fit, …
Configure
Check
Guide
Scientific opportunities
 Probe material structure and
function at unprecedented scales
Technical challenges
 Many experimental modalities
 Data rates and computation
needs vary widely; are increasing
 Knowledge management,
integration, synthesis
New methods demand rapid access
to large amounts of data, computing
Batch
Immediate
0.001 1 100+
PFlops
Precompute
material
database
Reconstruct
image
Auto-correlation
Feature
detection

Next steps
 From six beamlines to 60 beamlines
 From 60 facility users to 6000 facility users
 From one lab to all labs
 From data management and analysis to knowledge
management, integration, and analysis
 From per-user to per-discipline (and trans-discipline)
data repositories, publication, and discovery
 From terabytes to petabytes
 From three months to three hours to build pipelines
 From intuitive to analytical understanding of systems
34

Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Similar to Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences (20)

More from Ian Foster

More from Ian Foster (20)

Recently uploaded

Recently uploaded (20)

Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Editor's Notes