SlideShare a Scribd company logo
1 of 45
Big process for big data

            Ian Foster
         foster@anl.gov
   NASA Goddard, February 27, 2013


                                     computationinstitute.org
The Computation Institute
= UChicago + Argonne
= Cross-disciplinary nexus
= Home of the Research Cloud

                             computationinstitute.org
computationinstitute.org
Will data kill genomics?




                                           x10 in 6 years
                                     x105 in 6 years




Kahn, Science, 331 (6018): 728-729             computationinstitute.org
Moore’s Law for X-Ray Sources




                                             18 orders
                                             of magnitude
                                             in 5 decades!
12 orders
of magnitude
In 6 decades!

                                          computationinstitute.org
1.2 PB of climate data
Delivered to 23,000 users

                       computationinstitute.org
We have exceptional
infrastructure for the 1%




                      computationinstitute.org
What about the 99%?



                 computationinstitute.org
Big science. Small labs.
                    computationinstitute.org
Need: A new way to deliver
research cyberinfrastructure

      Frictionless
      Affordable
      Sustainable
                       computationinstitute.org
We asked ourselves:
 What if the research work flow
could be managed as easily as…

…our pictures
                …our e-mail
                              …home entertainment

                                       computationinstitute.org
What makes these services great?

    Great User Experience
                  +
       High performance
  (but invisible) infrastructure

                           computationinstitute.org
We aspire (initially) to create a
  great user experience for
research data management

 What would a “dropbox for
   science” look like?
                           computationinstitute.org
• Collect   • Annotate
• Move      • Publish
• Sync      • Search
• Share     • Backup
• Analyze   • Archive

BIG DATA
                  computationinstitute.org
A common work flow…



                                    Registry
Staging       Ingest
 Store         Store

                                   Community
                                     Store
              Analysis
               Store



                         Archive               Mirror

                                               computationinstitute.org
… with common challenges


Data movement, sync, and sharing
                                    Registry
• Between facilities, archives, researchers
     Staging     Ingest
      Store       Store
• Many files, large data volumes
                               Community
• With security, reliability, performance
                                 Store
               Analysis
                Store



                          Archive              Mirror

                                               computationinstitute.org
• Collect            • Annotate
 • Move               • Publish
 • Sync               • Search
 • Share              • Backup
 • Capabilities delivered using
    Analyze           • Archive
Software-as-Service (SaaS) model



                              computationinstitute.org
2 Globus
      Data
                Online
                              Data
     Source     moves/sy    Destination
                ncs files


1 User
  initiates
  transfer
  request

                     Globus Online 3
                     notifies user
                               computationinstitute.org
2 Globus Online tracks            Data
                       shared files; no need        Source
                       to move files to
                       cloud storage!

1 User A selects                                    3
   file(s) to share;                   User B logs in to
   selects                               Globus Online
   user/group, sets                       and accesses
   share permissions                        shared file




                                               computationinstitute.org
Extreme ease of use

•   InCommon, Oauth, OpenID, X.509, …
•   Credential management
•   Group definition and management
•   Transfer management and optimization
•   Reliability via transfer retries
•   Web interface, REST API, command line
•   One-click “Globus Connect” install
•   5-minute Globus Connect Multi User install
                                     computationinstitute.org
Early adoption is encouraging




                        computationinstitute.org
Early adoption is encouraging



 8,000 registered users; ~100 daily
      ~10 PB moved; ~1B files
10x (or better) performance vs. scp
         99.9% availability
      Entirely hosted on AWS


                              computationinstitute.org
Delivering a great user
    experience relies on
high performance network
       infrastructure



                     computationinstitute.org
Science DMZ
+   optimizes
    performance




           computationinstitute.org
What is a Science DMZ?
Three key components, all required:
• “Friction free” network path
   –   Highly capable network devices (wire-speed, deep queues)
   –   Virtual circuit connectivity option
   –   Security policy and enforcement specific to science workflows
   –   Located at or near site perimeter if possible
• Dedicated, high-performance Data Transfer Nodes (DTNs)
   – Hardware, operating system, libraries optimized for transfer
   – Optimized data transfer tools: Globus Online, GridFTP
• Performance measurement/test node
   – perfSONAR
Details at http://fasterdata.es.net/science-dmz/
                                                            computationinstitute.org
Globus GridFTP architecture

                             Parallel
                               TCP
                                               LFN

                Globus XIO
 GridFTP                     UDP or
                             RDMA          Dedicated


                                TCP
                                             Shared

Internal layered XIO architecture allows alternative network
   and filesystem interfaces to be plugged in to the stack
                                             28computationinstitute.org
GridFTP performance options

    •   TCP configuration
    •   Concurrency: Multiple flows per node
    •   Parallelism: Multiple nodes
    •   Pipelining of requests to support small files
    •   Multiple cores for integrity, encryption
    •   Alternative protocol selection*
    •   Use of circuits and multiple paths*

    Globus Online can configure these options
    based on what it knows about a transfer
* Experimental                                   computationinstitute.org
Exploiting multiple paths




   • Take advantage of multiple interfaces in multi-homed data
     transfer nodes
   • Use circuit as well as production IP link
   • Data will flow even while the circuit is being set up
   • Once circuit is set up, use both paths to improve throughput
Raj Kettimuthu, Ezra Kissel, Martin Swany, Jason Zurawski, Dan Gunter   computationinstitute.org
Exploiting multiple paths
    Transfer between NERSC and ANL                                Transfer between UMich and Caltech


                                                                         multipath




                                    multipath




                        Default, commodity IP routes
                             + Dedicated circuits
                       = Significant performance gains
Raj Kettimuthu, Ezra Kissel, Martin Swany, Jason Zurawski, Dan Gunter                computationinstitute.org
Duration of runs, in seconds, over time.
                        Red: >10 TB transfer; green: >1 TB transfer.


           1e+07


                   1 week
           1e+05




                    1 day




                    1 hour
duration

           1e+03




                   1 minute
           1e+01




                   1 second
           1e-01




                              2011                      2012
K. Heitmann (Argonne)
moves 22 TB of cosmology
data LANL  ANL at 5 Gb/s

                     computationinstitute.org
B. Winjum (UCLA) moves
900K-file plasma physics
datasets UCLA NERSC

                      computationinstitute.org
Dan Kozak (Caltech)
replicates 1 PB LIGO
astronomy data for resilience

                       computationinstitute.org
• Collect   • Annotate
• Move      • Publish
• Sync      • Search
• Share     • Backup
• Analyze   • Archive

BIG DATA
                  computationinstitute.org
• Collect   • Annotate
• Move      • Publish
• Sync      • Search
• Share     • Backup
• Analyze   • Archive

BIG DATA
                  computationinstitute.org
Many more capabilities planned …


 Globus Online Research Data Management-as-a-Service

  Ingest,      Sharing, Colla    Backup,
Cataloging,    boration, Ann     Archival,        …      SaaS
Integration       otation        Retrieval

Globus Integrate (Globus Nexus, Globus Connect)          PaaS




                                             computationinstitute.org
A platform for integration




                    computationinstitute.org
Catalog as a service
        Approach                                           Three REST APIs
 • Hosted user-defined                                 /query/
   catalogs                                            • Retrieve subjects
 • Based on tag model                                  /tags/
                                                       • Create, delete, retrie
      <subject, name, value>
                                                          ve tags
 • Optional schema                                     /tagdef/
   constraints                                         • Create, delete, retrie
 • Integrated with other                                  ve tag definitions
   Globus services
Builds on USC Tagfiler project (C. Kesselman et al.)              computationinstitute.org
Other early successes in
services for science…




                       computationinstitute.org
computationinstitute.org
computationinstitute.org
Other innovative science
SaaS projects




                       computationinstitute.org
Other innovative science
SaaS projects




                       computationinstitute.org
Our vision for a 21st century
     cyberinfrastructure
To provide more capability for
more people at substantially
lower cost by creatively
aggregating (“cloud”) and
federating (“grid”) resources

“Science as a service”
                           computationinstitute.org
Thank you to our sponsors!




                      computationinstitute.org

More Related Content

What's hot

Talk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource DirectorsTalk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource Directors
Deepak Singh
 
Overlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in Detail
Overlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in DetailOverlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in Detail
Overlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in Detail
Jose Antonio Coarasa Perez
 
Collaboration, Big Data and the search for the Higgs Boson
Collaboration, Big Data and the  search for the Higgs BosonCollaboration, Big Data and the  search for the Higgs Boson
Collaboration, Big Data and the search for the Higgs Boson
Suma Pria Tunggal
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
WANdisco Plc
 

What's hot (20)

Talk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource DirectorsTalk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource Directors
 
An NSA Big Graph experiment
An NSA Big Graph experimentAn NSA Big Graph experiment
An NSA Big Graph experiment
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Management at CERN: The CMS Example
Big Data Management at CERN: The CMS ExampleBig Data Management at CERN: The CMS Example
Big Data Management at CERN: The CMS Example
 
final report
final reportfinal report
final report
 
Overlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in Detail
Overlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in DetailOverlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in Detail
Overlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in Detail
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Report to the NAC
Report to the NACReport to the NAC
Report to the NAC
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
BHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersBHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clusters
 
Collaboration, Big Data and the search for the Higgs Boson
Collaboration, Big Data and the  search for the Higgs BosonCollaboration, Big Data and the  search for the Higgs Boson
Collaboration, Big Data and the search for the Higgs Boson
 
SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage System
 
Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
20181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v320181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v3
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
Vortrag ralph behrens_ibm-data
Vortrag ralph behrens_ibm-dataVortrag ralph behrens_ibm-data
Vortrag ralph behrens_ibm-data
 
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
 

Viewers also liked

Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
Ian Foster
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
Ian Foster
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
Ian Foster
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
Ian Foster
 
E science foster december 2010
E science foster december 2010E science foster december 2010
E science foster december 2010
Ian Foster
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Ian Foster
 

Viewers also liked (19)

Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
 
Cloud com foster december 2010
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
 
GlobusWorld 2012 Foster Keynote
GlobusWorld 2012 Foster KeynoteGlobusWorld 2012 Foster Keynote
GlobusWorld 2012 Foster Keynote
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
 
Sociology Of The Grid May 2009
Sociology Of The Grid May 2009Sociology Of The Grid May 2009
Sociology Of The Grid May 2009
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
 
Big Process for Big Data
Big Process for Big DataBig Process for Big Data
Big Process for Big Data
 
Services for Science
Services for ScienceServices for Science
Services for Science
 
AAPM Foster July 2009
AAPM Foster July 2009AAPM Foster July 2009
AAPM Foster July 2009
 
Grid Projects In The US July 2008
Grid Projects In The US July 2008Grid Projects In The US July 2008
Grid Projects In The US July 2008
 
building global software/earthcube->sciencecloud
building global software/earthcube->sciencecloudbuilding global software/earthcube->sciencecloud
building global software/earthcube->sciencecloud
 
Delivering a Campus Research Data Service with Globus
Delivering a Campus Research Data Service with GlobusDelivering a Campus Research Data Service with Globus
Delivering a Campus Research Data Service with Globus
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
 
E science foster december 2010
E science foster december 2010E science foster december 2010
E science foster december 2010
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
 
Services for Science v2 (APAN26)
Services for Science v2 (APAN26)Services for Science v2 (APAN26)
Services for Science v2 (APAN26)
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
 

Similar to Big Process for Big Data @ NASA

Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
Kirill Osipov
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
Ian Foster
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
Globus
 

Similar to Big Process for Big Data @ NASA (20)

Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
 
Science for the Future: Strategies for Moving and Sharing Data
Science for the Future: Strategies for Moving and Sharing DataScience for the Future: Strategies for Moving and Sharing Data
Science for the Future: Strategies for Moving and Sharing Data
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
Introduction to Globus - XSEDE14 Tutorial
Introduction to Globus - XSEDE14 TutorialIntroduction to Globus - XSEDE14 Tutorial
Introduction to Globus - XSEDE14 Tutorial
 
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
 
Bertenthal
BertenthalBertenthal
Bertenthal
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
Simplified Research Data Management with the Globus Platform
Simplified Research Data Management with the Globus PlatformSimplified Research Data Management with the Globus Platform
Simplified Research Data Management with the Globus Platform
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
 
Reliable, Remote Computation at All Scales
Reliable, Remote Computation at All ScalesReliable, Remote Computation at All Scales
Reliable, Remote Computation at All Scales
 

More from Ian Foster

Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
Ian Foster
 

More from Ian Foster (20)

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptx
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart Instruments
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon Summary
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperability
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformSoftware Infrastructure for a National Research Platform
Software Infrastructure for a National Research Platform
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Big Process for Big Data @ NASA

  • 1. Big process for big data Ian Foster foster@anl.gov NASA Goddard, February 27, 2013 computationinstitute.org
  • 2. The Computation Institute = UChicago + Argonne = Cross-disciplinary nexus = Home of the Research Cloud computationinstitute.org
  • 4. Will data kill genomics? x10 in 6 years x105 in 6 years Kahn, Science, 331 (6018): 728-729 computationinstitute.org
  • 5. Moore’s Law for X-Ray Sources 18 orders of magnitude in 5 decades! 12 orders of magnitude In 6 decades! computationinstitute.org
  • 6. 1.2 PB of climate data Delivered to 23,000 users computationinstitute.org
  • 7. We have exceptional infrastructure for the 1% computationinstitute.org
  • 8. What about the 99%? computationinstitute.org
  • 9. Big science. Small labs. computationinstitute.org
  • 10. Need: A new way to deliver research cyberinfrastructure Frictionless Affordable Sustainable computationinstitute.org
  • 11. We asked ourselves: What if the research work flow could be managed as easily as… …our pictures …our e-mail …home entertainment computationinstitute.org
  • 12. What makes these services great? Great User Experience + High performance (but invisible) infrastructure computationinstitute.org
  • 13. We aspire (initially) to create a great user experience for research data management What would a “dropbox for science” look like? computationinstitute.org
  • 14. • Collect • Annotate • Move • Publish • Sync • Search • Share • Backup • Analyze • Archive BIG DATA computationinstitute.org
  • 15. A common work flow… Registry Staging Ingest Store Store Community Store Analysis Store Archive Mirror computationinstitute.org
  • 16. … with common challenges Data movement, sync, and sharing Registry • Between facilities, archives, researchers Staging Ingest Store Store • Many files, large data volumes Community • With security, reliability, performance Store Analysis Store Archive Mirror computationinstitute.org
  • 17. • Collect • Annotate • Move • Publish • Sync • Search • Share • Backup • Capabilities delivered using Analyze • Archive Software-as-Service (SaaS) model computationinstitute.org
  • 18. 2 Globus Data Online Data Source moves/sy Destination ncs files 1 User initiates transfer request Globus Online 3 notifies user computationinstitute.org
  • 19. 2 Globus Online tracks Data shared files; no need Source to move files to cloud storage! 1 User A selects 3 file(s) to share; User B logs in to selects Globus Online user/group, sets and accesses share permissions shared file computationinstitute.org
  • 20. Extreme ease of use • InCommon, Oauth, OpenID, X.509, … • Credential management • Group definition and management • Transfer management and optimization • Reliability via transfer retries • Web interface, REST API, command line • One-click “Globus Connect” install • 5-minute Globus Connect Multi User install computationinstitute.org
  • 21. Early adoption is encouraging computationinstitute.org
  • 22. Early adoption is encouraging 8,000 registered users; ~100 daily ~10 PB moved; ~1B files 10x (or better) performance vs. scp 99.9% availability Entirely hosted on AWS computationinstitute.org
  • 23. Delivering a great user experience relies on high performance network infrastructure computationinstitute.org
  • 24. Science DMZ + optimizes performance computationinstitute.org
  • 25. What is a Science DMZ? Three key components, all required: • “Friction free” network path – Highly capable network devices (wire-speed, deep queues) – Virtual circuit connectivity option – Security policy and enforcement specific to science workflows – Located at or near site perimeter if possible • Dedicated, high-performance Data Transfer Nodes (DTNs) – Hardware, operating system, libraries optimized for transfer – Optimized data transfer tools: Globus Online, GridFTP • Performance measurement/test node – perfSONAR Details at http://fasterdata.es.net/science-dmz/ computationinstitute.org
  • 26. Globus GridFTP architecture Parallel TCP LFN Globus XIO GridFTP UDP or RDMA Dedicated TCP Shared Internal layered XIO architecture allows alternative network and filesystem interfaces to be plugged in to the stack 28computationinstitute.org
  • 27. GridFTP performance options • TCP configuration • Concurrency: Multiple flows per node • Parallelism: Multiple nodes • Pipelining of requests to support small files • Multiple cores for integrity, encryption • Alternative protocol selection* • Use of circuits and multiple paths* Globus Online can configure these options based on what it knows about a transfer * Experimental computationinstitute.org
  • 28. Exploiting multiple paths • Take advantage of multiple interfaces in multi-homed data transfer nodes • Use circuit as well as production IP link • Data will flow even while the circuit is being set up • Once circuit is set up, use both paths to improve throughput Raj Kettimuthu, Ezra Kissel, Martin Swany, Jason Zurawski, Dan Gunter computationinstitute.org
  • 29. Exploiting multiple paths Transfer between NERSC and ANL Transfer between UMich and Caltech multipath multipath Default, commodity IP routes + Dedicated circuits = Significant performance gains Raj Kettimuthu, Ezra Kissel, Martin Swany, Jason Zurawski, Dan Gunter computationinstitute.org
  • 30. Duration of runs, in seconds, over time. Red: >10 TB transfer; green: >1 TB transfer. 1e+07 1 week 1e+05 1 day 1 hour duration 1e+03 1 minute 1e+01 1 second 1e-01 2011 2012
  • 31. K. Heitmann (Argonne) moves 22 TB of cosmology data LANL  ANL at 5 Gb/s computationinstitute.org
  • 32. B. Winjum (UCLA) moves 900K-file plasma physics datasets UCLA NERSC computationinstitute.org
  • 33. Dan Kozak (Caltech) replicates 1 PB LIGO astronomy data for resilience computationinstitute.org
  • 34. • Collect • Annotate • Move • Publish • Sync • Search • Share • Backup • Analyze • Archive BIG DATA computationinstitute.org
  • 35. • Collect • Annotate • Move • Publish • Sync • Search • Share • Backup • Analyze • Archive BIG DATA computationinstitute.org
  • 36. Many more capabilities planned … Globus Online Research Data Management-as-a-Service Ingest, Sharing, Colla Backup, Cataloging, boration, Ann Archival, … SaaS Integration otation Retrieval Globus Integrate (Globus Nexus, Globus Connect) PaaS computationinstitute.org
  • 37. A platform for integration computationinstitute.org
  • 38. Catalog as a service Approach Three REST APIs • Hosted user-defined /query/ catalogs • Retrieve subjects • Based on tag model /tags/ • Create, delete, retrie <subject, name, value> ve tags • Optional schema /tagdef/ constraints • Create, delete, retrie • Integrated with other ve tag definitions Globus services Builds on USC Tagfiler project (C. Kesselman et al.) computationinstitute.org
  • 39. Other early successes in services for science… computationinstitute.org
  • 42. Other innovative science SaaS projects computationinstitute.org
  • 43. Other innovative science SaaS projects computationinstitute.org
  • 44. Our vision for a 21st century cyberinfrastructure To provide more capability for more people at substantially lower cost by creatively aggregating (“cloud”) and federating (“grid”) resources “Science as a service” computationinstitute.org
  • 45. Thank you to our sponsors! computationinstitute.org

Editor's Notes

  1. The Computation Institute (or CI)A joint initiative between Uchicago and Argonne National LabA place where researchers from multiple disciplines come together and engage in research that is fundamentally enabled by computationMore recently ….we’ve been talking about it as the home of the research cloud …and I’ll describe what we mean by that throughout this talk
  2. Here are some of the areas where we have active projectsFocus on areas of particular interest to I2/Esnet, namely HEP, climate change, genomics (up and coming)
  3. And the reason is pretty obvious…This chart and others like it are becoming a cliché in next gen sequencing and big data presentations …but the point is that while Moore’s law translates to roughly 10x increase in processor power…data volumes are growing many orders of magnitude fasterAND MEANWHILE, other necessary resources [money, people] are staying pretty flatSo we have a crisis …and we hear that magic bullet of “the cloud” is going to solve itWell, as far as cost goes, clouds are helping but many issues remain
  4. Another example if the earth systems grid that provides data and tools to over 20,000 climate scientists around the worldSo what’s notable about these examples?It’s the combination of the amount of data being managed and the number of people that need access to that dataWe heard Martin Leach tell us that the Broad Institute hit 10PB of spinning disk last year …and that it’s not a big dealTo a select few, these numbers are routine ….And for the projects I just talked about, the IT infrastructure is in placeThey have robust production solutionsBuilt by substantial teams at great expenseSustained, multi-year effortsApplication-specific solutions, built mostlyon common/homogeneoustechnology platforms
  5. The point is, the 1% of projects are in good shape
  6. But what about the 99% set?There are hundreds of thousands of small and medium labs around the world that are faced with similar data management challengesThey don’t have the resources to deal with these challengesSo their research suffers …and over time many may become irrelevantSo at the CI we asked ourselves a question …many questions actually about how we can help avert this crisisAnd one question that kinds sums up a lot of our thinking is…
  7. There are hundreds of thousands of small and medium labs around the world that are faced with similar data management challengesThey don’t have the resources to deal with these challengesSo their research suffers …and over time many may become irrelevantSo at the CI we asked ourselves a question …many questions actually about how we can help avert this crisisAnd one question that kinds sums up a lot of our thinking is…
  8. Lewis CarrollEnd-to-end crisis
  9. Can’t just expect to throw more people and $$$ at the problem ….already seeing the limits
  10. Many in this room are probably users of Dropbox or similar services for keeping their files synced across multiple machinesWell, the scientific research equivalent is a little different
  11. We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
  12. So how would such a drop box for science be used? Let’s look at a very typical scientific data work flow . . .Data is generated by some instrument (a sequencer at JGI or a light source like APS/ALS)…since these instruments are in high demand, users have to get their data off the instrument to make way for the next userSo the data is typically moved from a staging area to some type of ingest storeEtcetera for analysis, sharing of results with collaborators, annotation with metadata for future search, backup/sync/archival, …
  13. Started with seemingly simple/mundane task of transferring files …etc.
  14. Many in this room are probably users of Dropbox or similar services for keeping their files synced across multiple machinesWell, the scientific research equivalent is a little different
  15. Extensible Session ProtocolA session provides context for a data transfer(OSI stack layer 5)Connections, forwarding, application context, etc.XSP provides mechanisms to configure dynamic network circuitsEzra Kissel and Martin Swany have developed a Globus XIO driver for XSP
  16. Preliminary GridFTP test results has demonstrated that making use of both the default, commodity IP routes in conjunction with dedicated circuits will provide a number of significant performance gainsIn each case, our reservable circuit capacity was limited to 2Gb/s because of capacity caps, although we note that due to bandwidth “scavenging” enabled in the circuit service, we frequently see average rates above the defined bandwidth limit.
  17. XIO-XSP is a Globus XIO driverProvides an integrated XSP client for GridFTPIncludes path provisioning and instrumentation for transfers over XSP sessionsXSPd (daemon) implements protocol frontendAccepts on-demand reservation requests from clientsSignals OSCARS and monitors circuit statusOSCARS circuits provisioned to end-hostsEither bandwidth or circuit on-demand
  18. And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
  19. And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
  20. And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
  21. We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
  22. We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
  23. http://datasets.globus.org/carl-catalog/query/propertyA=value1