SlideShare a Scribd company logo
1 of 30
Big process for big data
Process automation for data-driven science

Ian Foster
Computation Institute
Mathematics and Computer Science Division
Department of Computer Science
Argonne National Laboratory & The University of Chicago

Talk at DOE Big Data Technology Summit, Washington DC, October 9, 2012
                                                             www.ci.anl.gov
                                                             www.ci.uchicago.edu
Big data is not new at DOE
Large Hadron Collider Higgs discovery “only
                        possible because of the
                        extraordinary
                        achievements of … grid
                        computing”
15 PB/year              —Rolf Heuer, CERN DG
173 TB/day
500 MB/sec


              LHC Computing
              Grid (10+ GB/sec)


                                                  www.ci.anl.gov
2
                                                  www.ci.uchicago.edu
But it is now ubiquitous: e.g., genomics




                                           www.ci.anl.gov
3   Kahn, Science, 331 (6018): 728-729     www.ci.uchicago.edu
But it is now ubiquitous: e.g., genomics




                                         6 years
                                                   Computing x10
                                                   (x30 at DOE)




                                                      www.ci.anl.gov
4   Kahn, Science, 331 (6018): 728-729                www.ci.uchicago.edu
But it is now ubiquitous: e.g., genomics




                                         6 years
                                                   Computing x10
                                                   (x30 at DOE)


                                         Genome
                                         sequencing
                                         x105
                                                      www.ci.anl.gov
5   Kahn, Science, 331 (6018): 728-729                www.ci.uchicago.edu
Now ubiquitous: e.g., light sources




                                        18 orders
                                        of magnitude
12 orders of                            in 5 decades!
magnitude
in 6 decades




                                            www.ci.anl.gov
 6   Credit: Linda Young                    www.ci.uchicago.edu
Now ubiquitous: e.g., light sources




                                      www.ci.anl.gov
7   Source: Francesco de Carlo        www.ci.uchicago.edu
Local flows already exceed those of LHC
                                 External                 Argonne data
                                 sources
                         163                              flows in TB/day
                                 9          9
                                                          (estimates)
Advanced Photon Source


    Argonne              143                         10
                                      Short-                  Long-
    Leadership                         term                   term
    Computing
                         100         storage         50
                                                             storage
    Facility

                                               150
                               100
        Other sources
      Other sources
        that remain to
      that remain to
         be quantified
       be quantified
                                 Data
                                 analysis
                                                                www.ci.anl.gov
8
                                                                www.ci.uchicago.edu
Big data demands new analysis models
Today




                                   Desired
                                       www.ci.anl.gov
9   Source: Francesco de Carlo         www.ci.uchicago.edu   9
It’s velocity and variety as well as volume


     Proteomics       Phenotypes                   Transcriptomics


                       Genomes
                                       Growth curves             Metabolomics
              Metabolic                    Reconciled           Phenotype
               Model                         Model              predictions
                                                                   Flux
                                           Integrated           predictions
 Assembly             Annotation
                                             model
                                                                Hypotheses

                Regulon                    Regulatory                Pathway
               prediction                    model                   designs
                                                                       www.ci.anl.gov
10        Credit: Chris Henry et al.                                   www.ci.uchicago.edu
Exponentially increasing complexity
     Run experiment
        Collect data
        Move data
        Check data
      Annotate data
        Share data
     Find similar data
     Link to literature
       Analyze data
       Publish data
                                      www.ci.anl.gov
11
                                      www.ci.uchicago.edu
www.ci.anl.gov
12
     www.ci.uchicago.edu
Tripit exemplifies process automation

        Me                           Other services
     Book flights   Record flights
                    Suggest hotel
     Book hotel     Record hotel
                    Get weather
                    Prepare maps
                    Share info
                    Monitor prices
                    Monitor flight
                                         www.ci.anl.gov
13
                                         www.ci.uchicago.edu
Big data requires big process
     Run experiment
                                 Outsourced
        Collect data              Intuitive
        Move data                Integrative
        Check data
      Annotate data             Research IT
        Share data              as a service
     Find similar data
     Link to literature            Secure
                                 Performant
       Analyze data
                                  Reliable
       Publish data
                                         www.ci.anl.gov
14
                                         www.ci.uchicago.edu
Characterizing big process requirements
                        Telescope            In millions of labs
 Simulation
                                             worldwide, researchers struggle
                                             with massive data, advanced
                                             software, complex
                                             protocols, burdensome reporting
              Staging               Ingest                Registry


                                                         Community
                                                         Repository
                             Analysis

  Next-gen
  genome                                       Archive                Mirror
  sequencer


Accelerate discovery and innovation by outsourcing difficult tasks
 15
                                                       www.ci.anl.gov
                                                       www.ci.uchicago.edu
Characterizing big process requirements
                   Telescope         In millions of labs
 Simulation
                                     worldwide, researchers struggle
                                     with massive data, advanced
                                     software, complex
         Data movement is a         frequentburdensome reporting
                                     protocols,
                                                  challenge
         • Between facilities, archives,Registry
                                           researchers
            Staging      Ingest
         • Many files, large data volumes
                                         Community
         • With security, reliability, performance
                                         Repository
                        Analysis

  Next-gen
  genome                                Archive          Mirror
  sequencer


Accelerate discovery and innovation by outsourcing difficult tasks
 16
                                                       www.ci.anl.gov
                                                       www.ci.uchicago.edu
Globus Online: Big process for big data




Data movement as a service
Secure, automated, reliable,
 high-speed movement,
 synchronization of many files




                                           www.ci.anl.gov
17
                                           www.ci.uchicago.edu
6,000 users
500 M files, 7 PB moved
99.9% availability
Examples of Globus Online in action
•    K. Heitmann (ANL) moves 22TB
     cosmology data at 5 Gb/s LANL  ANL

•    B. Winjum (UCLA) moves 900K-file
     plasma physics datasets UCLA - NERSC

•    Dan Kozak (Caltech) replicates 1 PB
     LIGO astronomy data for resilience

•    Supercomputer centers, genome facilities, light
     sources, universities all recommend it
                                              www.ci.anl.gov
19
                                              www.ci.uchicago.edu
Sizes of transfers Jan-Jun; size of circles prop. to log size
 Automation expands use of networks            Red=NERSC/LBL/ESnet; Green=ORNL/BNL; Blue=ANL;
                                                              Yellow=FNAL; Grey=Other

Transfers Jan-June 2012,




                                      1e+12
Size (bytes) vs time
Size ∝ log(transfer rate)

Red: NERSC/LBL/Esnet

                                      1e+09
Green: ORNL, LBL
Blue: ANL
                       bytes_xfered



Yellow: FNAL
                                      1e+06



Grey: Other
                                      1e+03
                                      1e+00




                                              Jan                Mar                 May                          Jul
                                                                                            www.ci.anl.gov
20
                                                                                            www.ci.uchicago.edu
Need much more than data movement
                        Telescope            In millions of labs
 Simulation
                                             worldwide, researchers struggle
                                             with massive data, advanced
                                             software, complex
                                             protocols, burdensome reporting
              Staging               Ingest                Registry


                                                         Community
                                                         Repository
                             Analysis

  Next-gen
  genome                                       Archive                Mirror
  sequencer


Accelerate discovery and innovation by outsourcing difficult tasks
 21
                                                       www.ci.anl.gov
                                                       www.ci.uchicago.edu
Need much more than data movement
 Ingest, cata
 loging, inte
                      Sharing,
                   collaboration,
                                        Identity, grou
                                         ps, security
                                                             Analysis, sim
                                                             ulation, visu   ...
   gration          annotation                                 alization



              Staging          Ingest                    Registry


                                                       Community
                                                       Repository
                          Analysis

  Next-gen
  genome                                     Archive                Mirror
  sequencer


Accelerate discovery and innovation by outsourcing difficult tasks
 22
                                                       www.ci.anl.gov
                                                       www.ci.uchicago.edu
Earth System Grid: Data movement




•    Outsource data transfer
     –   Client data download
     –   Replication between sites
•    No ESGF client software needed
•    20+ times faster than HTTP

                                       www.ci.anl.gov
23
         earthsystemgrid.org           www.ci.uchicago.edu
Kbase: Identity, group, data movement




                                        www.ci.anl.gov
24
     kbase.science.energy.gov           www.ci.uchicago.edu
Genomics: Data movement and analysis



                                                                              Galaxy-based workflow
                                                                                   management
                              Public                                                            • Globus Online
                               Data                                                               Integrated
                                                                   Galaxy                       • Web-based UI
                                                                     data                       • Drag-n-drop
     Sequenc-
     Sequencin      Globus Online provides        Storage         libraries                       workflow creation
        ing
     g Centers                                                                                  • Easily add new
      centers       •       High-performance
                    •       Fault-tolerant Lab
                                   Research                                                       tools
                    •       Secure               Local Cluster/
                                                                                                • Analytical tools
                    Seq                             Cloud
                    file transfer between all
                   Center                                                                         run on scalable
                    data endpoints                                                                computers


                                                                              Galaxy in Cloud

                        Data management                                        Data analysis
                                                                                                 www.ci.anl.gov
25           Source: Ravi Madduri                                                                www.ci.uchicago.edu
Integrating observation and simulation
    1                                                              Cloud properties and
                                                                   precipitation characteristics in
                                                                   large-scale models and cloud-
                                                                   resolving models (e.g., CMIP5
                                                                   models, GCRM)
Percentage of mapped radar domain in Darwin with returns
>10 dBz over the period 19 to 22 January 2006.
                               Retrieve




                                                                                     Compare
Construct structured
4-D atmospheric
state (“CAN”)

                                                      2
                                                                         Precipitating storm
                                                                         structures; storm lifecycles;
                                                                             Analytics
                                                           Analytics     statistical representation of
                                                                         storm scale properties;
                                                                                                    3
                                                                         predictive cloud models
                                                                                           www.ci.anl.gov
  26           Scott Collis                                                                www.ci.uchicago.edu
Integrating observation and simulation


          Level 1      Level 2         Level 3




           PBs          TBs          GBs




                                                 www.ci.anl.gov
27   Salman Habib, Katrin Heitmann               www.ci.uchicago.edu
Integrating observation and simulation




                                         www.ci.anl.gov
28   Salman Habib, Katrin Heitmann       www.ci.uchicago.edu
In summary: Big process for big data

Accelerate discovery and innovation worldwide
by providing research IT as a service
Outsource time-consuming tasks to
• provide large numbers of researchers with
   unprecedented access to powerful tools;
• enable a massive shortening of cycle times in
   time-consuming research processes; and
• reduce research IT costs via economies of scale
Accelerate existing science; enable new science
                                           www.ci.anl.gov
29
                                           www.ci.uchicago.edu
Thank you!


foster@anl.gov
www.ci.anl.gov
www.mcs.anl.gov
www.globusonline.org
                       www.ci.anl.gov
                       www.ci.uchicago.edu

More Related Content

Viewers also liked

Jsm madduri-august-2015
Jsm madduri-august-2015Jsm madduri-august-2015
Jsm madduri-august-2015Ravi Madduri
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduriRavi Madduri
 
Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Dan Taylor
 
Big Data and Genomics
Big Data and GenomicsBig Data and Genomics
Big Data and GenomicsAl Costa
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsYahoo Developer Network
 
Globus Genomics: Democratizing NGS Analysis
Globus Genomics: Democratizing NGS AnalysisGlobus Genomics: Democratizing NGS Analysis
Globus Genomics: Democratizing NGS AnalysisRavi Madduri
 
基因大数据分析入门 Slideshare
基因大数据分析入门   Slideshare基因大数据分析入门   Slideshare
基因大数据分析入门 SlideshareWu Bigo
 
Deep two-photon brain imaging with a red-shifted fluorometric Ca2+ indicator
Deep two-photon brain imaging with a red-shifted fluorometric Ca2+ indicatorDeep two-photon brain imaging with a red-shifted fluorometric Ca2+ indicator
Deep two-photon brain imaging with a red-shifted fluorometric Ca2+ indicatorPetteriTeikariPhD
 
Focused Ultrasound Neuromodulation
Focused Ultrasound NeuromodulationFocused Ultrasound Neuromodulation
Focused Ultrasound NeuromodulationPetteriTeikariPhD
 

Viewers also liked (19)

Jsm madduri-august-2015
Jsm madduri-august-2015Jsm madduri-august-2015
Jsm madduri-august-2015
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
 
Supporting Barack Obama for President
Supporting Barack Obama for PresidentSupporting Barack Obama for President
Supporting Barack Obama for President
 
Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2
 
Big Data and Genomics
Big Data and GenomicsBig Data and Genomics
Big Data and Genomics
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
 
Raskar UIST Keynote 2015 November
Raskar UIST Keynote 2015 NovemberRaskar UIST Keynote 2015 November
Raskar UIST Keynote 2015 November
 
Coded Photography - Ramesh Raskar
Coded Photography - Ramesh RaskarCoded Photography - Ramesh Raskar
Coded Photography - Ramesh Raskar
 
Leap Motion Development (Rohan Puri)
Leap Motion Development (Rohan Puri)Leap Motion Development (Rohan Puri)
Leap Motion Development (Rohan Puri)
 
What is Media in MIT Media Lab, Why 'Camera Culture'
What is Media in MIT Media Lab, Why 'Camera Culture'What is Media in MIT Media Lab, Why 'Camera Culture'
What is Media in MIT Media Lab, Why 'Camera Culture'
 
What is SIGGRAPH NEXT? Intro by Ramesh Raskar
What is SIGGRAPH NEXT? Intro by Ramesh RaskarWhat is SIGGRAPH NEXT? Intro by Ramesh Raskar
What is SIGGRAPH NEXT? Intro by Ramesh Raskar
 
Globus Genomics: Democratizing NGS Analysis
Globus Genomics: Democratizing NGS AnalysisGlobus Genomics: Democratizing NGS Analysis
Globus Genomics: Democratizing NGS Analysis
 
Google Glass Breakdown
Google Glass BreakdownGoogle Glass Breakdown
Google Glass Breakdown
 
Stereo and 3D Displays - Matt Hirsch
Stereo and 3D Displays - Matt HirschStereo and 3D Displays - Matt Hirsch
Stereo and 3D Displays - Matt Hirsch
 
Multiview Imaging HW Overview
Multiview Imaging HW OverviewMultiview Imaging HW Overview
Multiview Imaging HW Overview
 
基因大数据分析入门 Slideshare
基因大数据分析入门   Slideshare基因大数据分析入门   Slideshare
基因大数据分析入门 Slideshare
 
Deep two-photon brain imaging with a red-shifted fluorometric Ca2+ indicator
Deep two-photon brain imaging with a red-shifted fluorometric Ca2+ indicatorDeep two-photon brain imaging with a red-shifted fluorometric Ca2+ indicator
Deep two-photon brain imaging with a red-shifted fluorometric Ca2+ indicator
 
Focused Ultrasound Neuromodulation
Focused Ultrasound NeuromodulationFocused Ultrasound Neuromodulation
Focused Ultrasound Neuromodulation
 
Introduction to Camera Challenges - Ramesh Raskar
Introduction to Camera Challenges - Ramesh RaskarIntroduction to Camera Challenges - Ramesh Raskar
Introduction to Camera Challenges - Ramesh Raskar
 

Similar to Big Process for Big Data

Mexico talk foster march 2012
Mexico talk foster march 2012Mexico talk foster march 2012
Mexico talk foster march 2012Ian Foster
 
Rethinking how we provide science IT in an era of massive data but modest bud...
Rethinking how we provide science IT in an era of massive data but modest bud...Rethinking how we provide science IT in an era of massive data but modest bud...
Rethinking how we provide science IT in an era of massive data but modest bud...Ian Foster
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridIan Foster
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009Ian Foster
 
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingScott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingGigaScience, BGI Hong Kong
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesIan Foster
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceRobert Grossman
 
Cyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingCyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingJeremy Yang
 
Running Hot October 2008
Running Hot October 2008Running Hot October 2008
Running Hot October 2008Ian Foster
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryIan Foster
 
Research Automation for Data-Driven Discovery
Research Automationfor Data-Driven DiscoveryResearch Automationfor Data-Driven Discovery
Research Automation for Data-Driven DiscoveryGlobus
 
SOLE: Linking Research Papers with Science Objects
SOLE: Linking Research Papers with Science ObjectsSOLE: Linking Research Papers with Science Objects
SOLE: Linking Research Papers with Science ObjectsTanu Malik
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)Robert Grossman
 
Utility HPC: Right Systems, Right Scale, Right Science
Utility HPC: Right Systems, Right Scale, Right ScienceUtility HPC: Right Systems, Right Scale, Right Science
Utility HPC: Right Systems, Right Scale, Right ScienceChef Software, Inc.
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Robert Grossman
 
Seth A. Faith - Building a PaaS for Forensic DNA analysis using AWS
 Seth A. Faith - Building a PaaS for Forensic DNA analysis using AWS Seth A. Faith - Building a PaaS for Forensic DNA analysis using AWS
Seth A. Faith - Building a PaaS for Forensic DNA analysis using AWSAWS Chicago
 

Similar to Big Process for Big Data (20)

Mexico talk foster march 2012
Mexico talk foster march 2012Mexico talk foster march 2012
Mexico talk foster march 2012
 
Rethinking how we provide science IT in an era of massive data but modest bud...
Rethinking how we provide science IT in an era of massive data but modest bud...Rethinking how we provide science IT in an era of massive data but modest bud...
Rethinking how we provide science IT in an era of massive data but modest bud...
 
Multiscale Modeling
Multiscale ModelingMultiscale Modeling
Multiscale Modeling
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
 
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingScott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architectures
 
Trip Report Seattle
Trip Report SeattleTrip Report Seattle
Trip Report Seattle
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data Science
 
Cyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingCyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in Biocomputing
 
Running Hot October 2008
Running Hot October 2008Running Hot October 2008
Running Hot October 2008
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
Research Automation for Data-Driven Discovery
Research Automationfor Data-Driven DiscoveryResearch Automationfor Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
SOLE: Linking Research Papers with Science Objects
SOLE: Linking Research Papers with Science ObjectsSOLE: Linking Research Papers with Science Objects
SOLE: Linking Research Papers with Science Objects
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)
 
Utility HPC: Right Systems, Right Scale, Right Science
Utility HPC: Right Systems, Right Scale, Right ScienceUtility HPC: Right Systems, Right Scale, Right Science
Utility HPC: Right Systems, Right Scale, Right Science
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)
 
Seth A. Faith - Building a PaaS for Forensic DNA analysis using AWS
 Seth A. Faith - Building a PaaS for Forensic DNA analysis using AWS Seth A. Faith - Building a PaaS for Forensic DNA analysis using AWS
Seth A. Faith - Building a PaaS for Forensic DNA analysis using AWS
 

More from Ian Foster

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxIan Foster
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionIan Foster
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumIan Foster
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsIan Foster
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationIan Foster
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryIan Foster
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptxIan Foster
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceIan Foster
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryIan Foster
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the ContinuumIan Foster
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterIan Foster
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light SourcesIan Foster
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon SummaryIan Foster
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperabilityIan Foster
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasIan Foster
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFIan Foster
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 

More from Ian Foster (20)

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptx
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart Instruments
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon Summary
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperability
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 

Big Process for Big Data

  • 1. Big process for big data Process automation for data-driven science Ian Foster Computation Institute Mathematics and Computer Science Division Department of Computer Science Argonne National Laboratory & The University of Chicago Talk at DOE Big Data Technology Summit, Washington DC, October 9, 2012 www.ci.anl.gov www.ci.uchicago.edu
  • 2. Big data is not new at DOE Large Hadron Collider Higgs discovery “only possible because of the extraordinary achievements of … grid computing” 15 PB/year —Rolf Heuer, CERN DG 173 TB/day 500 MB/sec LHC Computing Grid (10+ GB/sec) www.ci.anl.gov 2 www.ci.uchicago.edu
  • 3. But it is now ubiquitous: e.g., genomics www.ci.anl.gov 3 Kahn, Science, 331 (6018): 728-729 www.ci.uchicago.edu
  • 4. But it is now ubiquitous: e.g., genomics 6 years Computing x10 (x30 at DOE) www.ci.anl.gov 4 Kahn, Science, 331 (6018): 728-729 www.ci.uchicago.edu
  • 5. But it is now ubiquitous: e.g., genomics 6 years Computing x10 (x30 at DOE) Genome sequencing x105 www.ci.anl.gov 5 Kahn, Science, 331 (6018): 728-729 www.ci.uchicago.edu
  • 6. Now ubiquitous: e.g., light sources 18 orders of magnitude 12 orders of in 5 decades! magnitude in 6 decades www.ci.anl.gov 6 Credit: Linda Young www.ci.uchicago.edu
  • 7. Now ubiquitous: e.g., light sources www.ci.anl.gov 7 Source: Francesco de Carlo www.ci.uchicago.edu
  • 8. Local flows already exceed those of LHC External Argonne data sources 163 flows in TB/day 9 9 (estimates) Advanced Photon Source Argonne 143 10 Short- Long- Leadership term term Computing 100 storage 50 storage Facility 150 100 Other sources Other sources that remain to that remain to be quantified be quantified Data analysis www.ci.anl.gov 8 www.ci.uchicago.edu
  • 9. Big data demands new analysis models Today Desired www.ci.anl.gov 9 Source: Francesco de Carlo www.ci.uchicago.edu 9
  • 10. It’s velocity and variety as well as volume Proteomics Phenotypes Transcriptomics Genomes Growth curves Metabolomics Metabolic Reconciled Phenotype Model Model predictions Flux Integrated predictions Assembly Annotation model Hypotheses Regulon Regulatory Pathway prediction model designs www.ci.anl.gov 10 Credit: Chris Henry et al. www.ci.uchicago.edu
  • 11. Exponentially increasing complexity Run experiment Collect data Move data Check data Annotate data Share data Find similar data Link to literature Analyze data Publish data www.ci.anl.gov 11 www.ci.uchicago.edu
  • 12. www.ci.anl.gov 12 www.ci.uchicago.edu
  • 13. Tripit exemplifies process automation Me Other services Book flights Record flights Suggest hotel Book hotel Record hotel Get weather Prepare maps Share info Monitor prices Monitor flight www.ci.anl.gov 13 www.ci.uchicago.edu
  • 14. Big data requires big process Run experiment Outsourced Collect data Intuitive Move data Integrative Check data Annotate data Research IT Share data as a service Find similar data Link to literature Secure Performant Analyze data Reliable Publish data www.ci.anl.gov 14 www.ci.uchicago.edu
  • 15. Characterizing big process requirements Telescope In millions of labs Simulation worldwide, researchers struggle with massive data, advanced software, complex protocols, burdensome reporting Staging Ingest Registry Community Repository Analysis Next-gen genome Archive Mirror sequencer Accelerate discovery and innovation by outsourcing difficult tasks 15 www.ci.anl.gov www.ci.uchicago.edu
  • 16. Characterizing big process requirements Telescope In millions of labs Simulation worldwide, researchers struggle with massive data, advanced software, complex Data movement is a frequentburdensome reporting protocols, challenge • Between facilities, archives,Registry researchers Staging Ingest • Many files, large data volumes Community • With security, reliability, performance Repository Analysis Next-gen genome Archive Mirror sequencer Accelerate discovery and innovation by outsourcing difficult tasks 16 www.ci.anl.gov www.ci.uchicago.edu
  • 17. Globus Online: Big process for big data Data movement as a service Secure, automated, reliable, high-speed movement, synchronization of many files www.ci.anl.gov 17 www.ci.uchicago.edu
  • 18. 6,000 users 500 M files, 7 PB moved 99.9% availability
  • 19. Examples of Globus Online in action • K. Heitmann (ANL) moves 22TB cosmology data at 5 Gb/s LANL  ANL • B. Winjum (UCLA) moves 900K-file plasma physics datasets UCLA - NERSC • Dan Kozak (Caltech) replicates 1 PB LIGO astronomy data for resilience • Supercomputer centers, genome facilities, light sources, universities all recommend it www.ci.anl.gov 19 www.ci.uchicago.edu
  • 20. Sizes of transfers Jan-Jun; size of circles prop. to log size Automation expands use of networks Red=NERSC/LBL/ESnet; Green=ORNL/BNL; Blue=ANL; Yellow=FNAL; Grey=Other Transfers Jan-June 2012, 1e+12 Size (bytes) vs time Size ∝ log(transfer rate) Red: NERSC/LBL/Esnet 1e+09 Green: ORNL, LBL Blue: ANL bytes_xfered Yellow: FNAL 1e+06 Grey: Other 1e+03 1e+00 Jan Mar May Jul www.ci.anl.gov 20 www.ci.uchicago.edu
  • 21. Need much more than data movement Telescope In millions of labs Simulation worldwide, researchers struggle with massive data, advanced software, complex protocols, burdensome reporting Staging Ingest Registry Community Repository Analysis Next-gen genome Archive Mirror sequencer Accelerate discovery and innovation by outsourcing difficult tasks 21 www.ci.anl.gov www.ci.uchicago.edu
  • 22. Need much more than data movement Ingest, cata loging, inte Sharing, collaboration, Identity, grou ps, security Analysis, sim ulation, visu ... gration annotation alization Staging Ingest Registry Community Repository Analysis Next-gen genome Archive Mirror sequencer Accelerate discovery and innovation by outsourcing difficult tasks 22 www.ci.anl.gov www.ci.uchicago.edu
  • 23. Earth System Grid: Data movement • Outsource data transfer – Client data download – Replication between sites • No ESGF client software needed • 20+ times faster than HTTP www.ci.anl.gov 23 earthsystemgrid.org www.ci.uchicago.edu
  • 24. Kbase: Identity, group, data movement www.ci.anl.gov 24 kbase.science.energy.gov www.ci.uchicago.edu
  • 25. Genomics: Data movement and analysis Galaxy-based workflow management Public • Globus Online Data Integrated Galaxy • Web-based UI data • Drag-n-drop Sequenc- Sequencin Globus Online provides Storage libraries workflow creation ing g Centers • Easily add new centers • High-performance • Fault-tolerant Lab Research tools • Secure Local Cluster/ • Analytical tools Seq Cloud file transfer between all Center run on scalable data endpoints computers Galaxy in Cloud Data management Data analysis www.ci.anl.gov 25 Source: Ravi Madduri www.ci.uchicago.edu
  • 26. Integrating observation and simulation 1 Cloud properties and precipitation characteristics in large-scale models and cloud- resolving models (e.g., CMIP5 models, GCRM) Percentage of mapped radar domain in Darwin with returns >10 dBz over the period 19 to 22 January 2006. Retrieve Compare Construct structured 4-D atmospheric state (“CAN”) 2 Precipitating storm structures; storm lifecycles; Analytics Analytics statistical representation of storm scale properties; 3 predictive cloud models www.ci.anl.gov 26 Scott Collis www.ci.uchicago.edu
  • 27. Integrating observation and simulation Level 1 Level 2 Level 3 PBs TBs GBs www.ci.anl.gov 27 Salman Habib, Katrin Heitmann www.ci.uchicago.edu
  • 28. Integrating observation and simulation www.ci.anl.gov 28 Salman Habib, Katrin Heitmann www.ci.uchicago.edu
  • 29. In summary: Big process for big data Accelerate discovery and innovation worldwide by providing research IT as a service Outsource time-consuming tasks to • provide large numbers of researchers with unprecedented access to powerful tools; • enable a massive shortening of cycle times in time-consuming research processes; and • reduce research IT costs via economies of scale Accelerate existing science; enable new science www.ci.anl.gov 29 www.ci.uchicago.edu

Editor's Notes

  1. We will hear numerous talks today on issue relating to the management and analysis of big data—data that stresses our capabilities in terms of its volume, velocity, variety, or variability.I’d like to spend my time speaking to the importance of the related problems of process. I’ll do so from the perspective of the sciences, because that is where I have the most experience.As data volumes increase exponentially, the individual’s ability to operate on that data has to improve exponentially too, if big data is to be an opportunity and not a curse.This is especially true as the number of data sources grows rapidly and thus even the smallest lab (or company) is exposed to the data deluge
  2. Single next-generation sequencing machine can generate 40Gbase/dayGap of >1000 – AND many more systems as people jump on bandwagonMeanwhile, other resources [money, people] stay flat
  3. Storage statistics synthesis
  4. See http://en.wikipedia.org/wiki/File:LLNL_US_Energy_Flow_2009.png for inspiration.Data rates are in TB/day; line thicknesses are 5 TB/day/ptNumbers:-- APS is 163 TB/day, preliminary data from de Carlo.-- ALCF is 150 TB/day: a number given in Carns et al.—but presumably is meant there to be Input *and* output??-- External sources is 8.6 TB/day (100 MB/s)—a WAG-- Others are made upOthers are just WAGs.By comparison: all observational and simulation data from LHC is 15PB/yr(Wikipedia): 475 MB/s
  5. http://labmed.ascpjournals.org/content/40/1/5/F7.expansion.htmlOld tools: PCs, spreadsheets, etc., can’t handle these issues effectively
  6. Aside: Another area in which I encounter substantial and growing complexity is travel.This being consumer space, there’s an app for that! A “software as a service” (aka cloud) app.
  7. Small labs ….Potential solution? Outsource complex, time-consuming, mundane activities to third parties—to software-as-a-service (SaaS) providers—to a “research cloud” focused on process automationQuestion: Which steps can we outsource in that way?
  8. https://plasmasim.physics.ucla.edu/research/winjum
  9. Automated ingestCataloging
  10. DiagnosisProvenance
  11. Geophysical variables: Wind speeds, rainfall rate, temperatures, liquid water content, raindrop shape properties, etc.