Talk at DOE CIO's Big Data Tech Summit -- latest take on why and wherefore of software as a service (SaaS) for science, and the Globus Online work we are doing, with various DOE examples.
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Big Process for Big Data
1. Big process for big data
Process automation for data-driven science
Ian Foster
Computation Institute
Mathematics and Computer Science Division
Department of Computer Science
Argonne National Laboratory & The University of Chicago
Talk at DOE Big Data Technology Summit, Washington DC, October 9, 2012
www.ci.anl.gov
www.ci.uchicago.edu
2. Big data is not new at DOE
Large Hadron Collider Higgs discovery “only
possible because of the
extraordinary
achievements of … grid
computing”
15 PB/year —Rolf Heuer, CERN DG
173 TB/day
500 MB/sec
LHC Computing
Grid (10+ GB/sec)
www.ci.anl.gov
2
www.ci.uchicago.edu
3. But it is now ubiquitous: e.g., genomics
www.ci.anl.gov
3 Kahn, Science, 331 (6018): 728-729 www.ci.uchicago.edu
4. But it is now ubiquitous: e.g., genomics
6 years
Computing x10
(x30 at DOE)
www.ci.anl.gov
4 Kahn, Science, 331 (6018): 728-729 www.ci.uchicago.edu
5. But it is now ubiquitous: e.g., genomics
6 years
Computing x10
(x30 at DOE)
Genome
sequencing
x105
www.ci.anl.gov
5 Kahn, Science, 331 (6018): 728-729 www.ci.uchicago.edu
6. Now ubiquitous: e.g., light sources
18 orders
of magnitude
12 orders of in 5 decades!
magnitude
in 6 decades
www.ci.anl.gov
6 Credit: Linda Young www.ci.uchicago.edu
7. Now ubiquitous: e.g., light sources
www.ci.anl.gov
7 Source: Francesco de Carlo www.ci.uchicago.edu
8. Local flows already exceed those of LHC
External Argonne data
sources
163 flows in TB/day
9 9
(estimates)
Advanced Photon Source
Argonne 143 10
Short- Long-
Leadership term term
Computing
100 storage 50
storage
Facility
150
100
Other sources
Other sources
that remain to
that remain to
be quantified
be quantified
Data
analysis
www.ci.anl.gov
8
www.ci.uchicago.edu
9. Big data demands new analysis models
Today
Desired
www.ci.anl.gov
9 Source: Francesco de Carlo www.ci.uchicago.edu 9
10. It’s velocity and variety as well as volume
Proteomics Phenotypes Transcriptomics
Genomes
Growth curves Metabolomics
Metabolic Reconciled Phenotype
Model Model predictions
Flux
Integrated predictions
Assembly Annotation
model
Hypotheses
Regulon Regulatory Pathway
prediction model designs
www.ci.anl.gov
10 Credit: Chris Henry et al. www.ci.uchicago.edu
11. Exponentially increasing complexity
Run experiment
Collect data
Move data
Check data
Annotate data
Share data
Find similar data
Link to literature
Analyze data
Publish data
www.ci.anl.gov
11
www.ci.uchicago.edu
13. Tripit exemplifies process automation
Me Other services
Book flights Record flights
Suggest hotel
Book hotel Record hotel
Get weather
Prepare maps
Share info
Monitor prices
Monitor flight
www.ci.anl.gov
13
www.ci.uchicago.edu
14. Big data requires big process
Run experiment
Outsourced
Collect data Intuitive
Move data Integrative
Check data
Annotate data Research IT
Share data as a service
Find similar data
Link to literature Secure
Performant
Analyze data
Reliable
Publish data
www.ci.anl.gov
14
www.ci.uchicago.edu
15. Characterizing big process requirements
Telescope In millions of labs
Simulation
worldwide, researchers struggle
with massive data, advanced
software, complex
protocols, burdensome reporting
Staging Ingest Registry
Community
Repository
Analysis
Next-gen
genome Archive Mirror
sequencer
Accelerate discovery and innovation by outsourcing difficult tasks
15
www.ci.anl.gov
www.ci.uchicago.edu
16. Characterizing big process requirements
Telescope In millions of labs
Simulation
worldwide, researchers struggle
with massive data, advanced
software, complex
Data movement is a frequentburdensome reporting
protocols,
challenge
• Between facilities, archives,Registry
researchers
Staging Ingest
• Many files, large data volumes
Community
• With security, reliability, performance
Repository
Analysis
Next-gen
genome Archive Mirror
sequencer
Accelerate discovery and innovation by outsourcing difficult tasks
16
www.ci.anl.gov
www.ci.uchicago.edu
17. Globus Online: Big process for big data
Data movement as a service
Secure, automated, reliable,
high-speed movement,
synchronization of many files
www.ci.anl.gov
17
www.ci.uchicago.edu
19. Examples of Globus Online in action
• K. Heitmann (ANL) moves 22TB
cosmology data at 5 Gb/s LANL ANL
• B. Winjum (UCLA) moves 900K-file
plasma physics datasets UCLA - NERSC
• Dan Kozak (Caltech) replicates 1 PB
LIGO astronomy data for resilience
• Supercomputer centers, genome facilities, light
sources, universities all recommend it
www.ci.anl.gov
19
www.ci.uchicago.edu
20. Sizes of transfers Jan-Jun; size of circles prop. to log size
Automation expands use of networks Red=NERSC/LBL/ESnet; Green=ORNL/BNL; Blue=ANL;
Yellow=FNAL; Grey=Other
Transfers Jan-June 2012,
1e+12
Size (bytes) vs time
Size ∝ log(transfer rate)
Red: NERSC/LBL/Esnet
1e+09
Green: ORNL, LBL
Blue: ANL
bytes_xfered
Yellow: FNAL
1e+06
Grey: Other
1e+03
1e+00
Jan Mar May Jul
www.ci.anl.gov
20
www.ci.uchicago.edu
21. Need much more than data movement
Telescope In millions of labs
Simulation
worldwide, researchers struggle
with massive data, advanced
software, complex
protocols, burdensome reporting
Staging Ingest Registry
Community
Repository
Analysis
Next-gen
genome Archive Mirror
sequencer
Accelerate discovery and innovation by outsourcing difficult tasks
21
www.ci.anl.gov
www.ci.uchicago.edu
22. Need much more than data movement
Ingest, cata
loging, inte
Sharing,
collaboration,
Identity, grou
ps, security
Analysis, sim
ulation, visu ...
gration annotation alization
Staging Ingest Registry
Community
Repository
Analysis
Next-gen
genome Archive Mirror
sequencer
Accelerate discovery and innovation by outsourcing difficult tasks
22
www.ci.anl.gov
www.ci.uchicago.edu
23. Earth System Grid: Data movement
• Outsource data transfer
– Client data download
– Replication between sites
• No ESGF client software needed
• 20+ times faster than HTTP
www.ci.anl.gov
23
earthsystemgrid.org www.ci.uchicago.edu
24. Kbase: Identity, group, data movement
www.ci.anl.gov
24
kbase.science.energy.gov www.ci.uchicago.edu
25. Genomics: Data movement and analysis
Galaxy-based workflow
management
Public • Globus Online
Data Integrated
Galaxy • Web-based UI
data • Drag-n-drop
Sequenc-
Sequencin Globus Online provides Storage libraries workflow creation
ing
g Centers • Easily add new
centers • High-performance
• Fault-tolerant Lab
Research tools
• Secure Local Cluster/
• Analytical tools
Seq Cloud
file transfer between all
Center run on scalable
data endpoints computers
Galaxy in Cloud
Data management Data analysis
www.ci.anl.gov
25 Source: Ravi Madduri www.ci.uchicago.edu
26. Integrating observation and simulation
1 Cloud properties and
precipitation characteristics in
large-scale models and cloud-
resolving models (e.g., CMIP5
models, GCRM)
Percentage of mapped radar domain in Darwin with returns
>10 dBz over the period 19 to 22 January 2006.
Retrieve
Compare
Construct structured
4-D atmospheric
state (“CAN”)
2
Precipitating storm
structures; storm lifecycles;
Analytics
Analytics statistical representation of
storm scale properties;
3
predictive cloud models
www.ci.anl.gov
26 Scott Collis www.ci.uchicago.edu
29. In summary: Big process for big data
Accelerate discovery and innovation worldwide
by providing research IT as a service
Outsource time-consuming tasks to
• provide large numbers of researchers with
unprecedented access to powerful tools;
• enable a massive shortening of cycle times in
time-consuming research processes; and
• reduce research IT costs via economies of scale
Accelerate existing science; enable new science
www.ci.anl.gov
29
www.ci.uchicago.edu
We will hear numerous talks today on issue relating to the management and analysis of big data—data that stresses our capabilities in terms of its volume, velocity, variety, or variability.I’d like to spend my time speaking to the importance of the related problems of process. I’ll do so from the perspective of the sciences, because that is where I have the most experience.As data volumes increase exponentially, the individual’s ability to operate on that data has to improve exponentially too, if big data is to be an opportunity and not a curse.This is especially true as the number of data sources grows rapidly and thus even the smallest lab (or company) is exposed to the data deluge
Single next-generation sequencing machine can generate 40Gbase/dayGap of >1000 – AND many more systems as people jump on bandwagonMeanwhile, other resources [money, people] stay flat
Storage statistics synthesis
See http://en.wikipedia.org/wiki/File:LLNL_US_Energy_Flow_2009.png for inspiration.Data rates are in TB/day; line thicknesses are 5 TB/day/ptNumbers:-- APS is 163 TB/day, preliminary data from de Carlo.-- ALCF is 150 TB/day: a number given in Carns et al.—but presumably is meant there to be Input *and* output??-- External sources is 8.6 TB/day (100 MB/s)—a WAG-- Others are made upOthers are just WAGs.By comparison: all observational and simulation data from LHC is 15PB/yr(Wikipedia): 475 MB/s
http://labmed.ascpjournals.org/content/40/1/5/F7.expansion.htmlOld tools: PCs, spreadsheets, etc., can’t handle these issues effectively
Aside: Another area in which I encounter substantial and growing complexity is travel.This being consumer space, there’s an app for that! A “software as a service” (aka cloud) app.
Small labs ….Potential solution? Outsource complex, time-consuming, mundane activities to third parties—to software-as-a-service (SaaS) providers—to a “research cloud” focused on process automationQuestion: Which steps can we outsource in that way?