Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Process for Big Data @ NASA


Published on

A talk at NASA Goddard, February 27, 2013

Large and diverse data result in challenging data management problems that researchers and facilities are often ill-equipped to handle. I propose a new approach to these problems based on the outsourcing of research data management tasks to software-as-a-service providers. I argue that this approach can both achieve significant economies of scale and accelerate discovery by allowing researchers to focus on research rather than mundane information technology tasks. I present early results with the approach in the context of Globus Online

Published in: Technology
  • Login to see the comments

Big Process for Big Data @ NASA

  1. 1. Big process for big data Ian Foster NASA Goddard, February 27, 2013
  2. 2. The Computation Institute= UChicago + Argonne= Cross-disciplinary nexus= Home of the Research Cloud
  3. 3.
  4. 4. Will data kill genomics? x10 in 6 years x105 in 6 yearsKahn, Science, 331 (6018): 728-729
  5. 5. Moore’s Law for X-Ray Sources 18 orders of magnitude in 5 decades!12 ordersof magnitudeIn 6 decades!
  6. 6. 1.2 PB of climate dataDelivered to 23,000 users
  7. 7. We have exceptionalinfrastructure for the 1%
  8. 8. What about the 99%?
  9. 9. Big science. Small labs.
  10. 10. Need: A new way to deliverresearch cyberinfrastructure Frictionless Affordable Sustainable
  11. 11. We asked ourselves: What if the research work flowcould be managed as easily as……our pictures …our e-mail …home entertainment
  12. 12. What makes these services great? Great User Experience + High performance (but invisible) infrastructure
  13. 13. We aspire (initially) to create a great user experience forresearch data management What would a “dropbox for science” look like?
  14. 14. • Collect • Annotate• Move • Publish• Sync • Search• Share • Backup• Analyze • ArchiveBIG DATA
  15. 15. A common work flow… RegistryStaging Ingest Store Store Community Store Analysis Store Archive Mirror
  16. 16. … with common challengesData movement, sync, and sharing Registry• Between facilities, archives, researchers Staging Ingest Store Store• Many files, large data volumes Community• With security, reliability, performance Store Analysis Store Archive Mirror
  17. 17. • Collect • Annotate • Move • Publish • Sync • Search • Share • Backup • Capabilities delivered using Analyze • ArchiveSoftware-as-Service (SaaS) model
  18. 18. 2 Globus Data Online Data Source moves/sy Destination ncs files1 User initiates transfer request Globus Online 3 notifies user
  19. 19. 2 Globus Online tracks Data shared files; no need Source to move files to cloud storage!1 User A selects 3 file(s) to share; User B logs in to selects Globus Online user/group, sets and accesses share permissions shared file
  20. 20. Extreme ease of use• InCommon, Oauth, OpenID, X.509, …• Credential management• Group definition and management• Transfer management and optimization• Reliability via transfer retries• Web interface, REST API, command line• One-click “Globus Connect” install• 5-minute Globus Connect Multi User install
  21. 21. Early adoption is encouraging
  22. 22. Early adoption is encouraging 8,000 registered users; ~100 daily ~10 PB moved; ~1B files10x (or better) performance vs. scp 99.9% availability Entirely hosted on AWS
  23. 23. Delivering a great user experience relies onhigh performance network infrastructure
  24. 24. Science DMZ+ optimizes performance
  25. 25. What is a Science DMZ?Three key components, all required:• “Friction free” network path – Highly capable network devices (wire-speed, deep queues) – Virtual circuit connectivity option – Security policy and enforcement specific to science workflows – Located at or near site perimeter if possible• Dedicated, high-performance Data Transfer Nodes (DTNs) – Hardware, operating system, libraries optimized for transfer – Optimized data transfer tools: Globus Online, GridFTP• Performance measurement/test node – perfSONARDetails at
  26. 26. Globus GridFTP architecture Parallel TCP LFN Globus XIO GridFTP UDP or RDMA Dedicated TCP SharedInternal layered XIO architecture allows alternative network and filesystem interfaces to be plugged in to the stack
  27. 27. GridFTP performance options • TCP configuration • Concurrency: Multiple flows per node • Parallelism: Multiple nodes • Pipelining of requests to support small files • Multiple cores for integrity, encryption • Alternative protocol selection* • Use of circuits and multiple paths* Globus Online can configure these options based on what it knows about a transfer* Experimental
  28. 28. Exploiting multiple paths • Take advantage of multiple interfaces in multi-homed data transfer nodes • Use circuit as well as production IP link • Data will flow even while the circuit is being set up • Once circuit is set up, use both paths to improve throughputRaj Kettimuthu, Ezra Kissel, Martin Swany, Jason Zurawski, Dan Gunter
  29. 29. Exploiting multiple paths Transfer between NERSC and ANL Transfer between UMich and Caltech multipath multipath Default, commodity IP routes + Dedicated circuits = Significant performance gainsRaj Kettimuthu, Ezra Kissel, Martin Swany, Jason Zurawski, Dan Gunter
  30. 30. Duration of runs, in seconds, over time. Red: >10 TB transfer; green: >1 TB transfer. 1e+07 1 week 1e+05 1 day 1 hourduration 1e+03 1 minute 1e+01 1 second 1e-01 2011 2012
  31. 31. K. Heitmann (Argonne)moves 22 TB of cosmologydata LANL  ANL at 5 Gb/s
  32. 32. B. Winjum (UCLA) moves900K-file plasma physicsdatasets UCLA NERSC
  33. 33. Dan Kozak (Caltech)replicates 1 PB LIGOastronomy data for resilience
  34. 34. • Collect • Annotate• Move • Publish• Sync • Search• Share • Backup• Analyze • ArchiveBIG DATA
  35. 35. • Collect • Annotate• Move • Publish• Sync • Search• Share • Backup• Analyze • ArchiveBIG DATA
  36. 36. Many more capabilities planned … Globus Online Research Data Management-as-a-Service Ingest, Sharing, Colla Backup,Cataloging, boration, Ann Archival, … SaaSIntegration otation RetrievalGlobus Integrate (Globus Nexus, Globus Connect) PaaS
  37. 37. A platform for integration
  38. 38. Catalog as a service Approach Three REST APIs • Hosted user-defined /query/ catalogs • Retrieve subjects • Based on tag model /tags/ • Create, delete, retrie <subject, name, value> ve tags • Optional schema /tagdef/ constraints • Create, delete, retrie • Integrated with other ve tag definitions Globus servicesBuilds on USC Tagfiler project (C. Kesselman et al.)
  39. 39. Other early successes inservices for science…
  40. 40.
  41. 41.
  42. 42. Other innovative scienceSaaS projects
  43. 43. Other innovative scienceSaaS projects
  44. 44. Our vision for a 21st century cyberinfrastructureTo provide more capability formore people at substantiallylower cost by creativelyaggregating (“cloud”) andfederating (“grid”) resources“Science as a service”
  45. 45. Thank you to our sponsors!