Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Process for Big Data @ NASA

1,858 views

Published on

A talk at NASA Goddard, February 27, 2013


Large and diverse data result in challenging data management problems that researchers and facilities are often ill-equipped to handle. I propose a new approach to these problems based on the outsourcing of research data management tasks to software-as-a-service providers. I argue that this approach can both achieve significant economies of scale and accelerate discovery by allowing researchers to focus on research rather than mundane information technology tasks. I present early results with the approach in the context of Globus Online

Published in: Technology
  • Login to see the comments

Big Process for Big Data @ NASA

  1. 1. Big process for big data Ian Foster foster@anl.gov NASA Goddard, February 27, 2013 computationinstitute.org
  2. 2. The Computation Institute= UChicago + Argonne= Cross-disciplinary nexus= Home of the Research Cloud computationinstitute.org
  3. 3. computationinstitute.org
  4. 4. Will data kill genomics? x10 in 6 years x105 in 6 yearsKahn, Science, 331 (6018): 728-729 computationinstitute.org
  5. 5. Moore’s Law for X-Ray Sources 18 orders of magnitude in 5 decades!12 ordersof magnitudeIn 6 decades! computationinstitute.org
  6. 6. 1.2 PB of climate dataDelivered to 23,000 users computationinstitute.org
  7. 7. We have exceptionalinfrastructure for the 1% computationinstitute.org
  8. 8. What about the 99%? computationinstitute.org
  9. 9. Big science. Small labs. computationinstitute.org
  10. 10. Need: A new way to deliverresearch cyberinfrastructure Frictionless Affordable Sustainable computationinstitute.org
  11. 11. We asked ourselves: What if the research work flowcould be managed as easily as……our pictures …our e-mail …home entertainment computationinstitute.org
  12. 12. What makes these services great? Great User Experience + High performance (but invisible) infrastructure computationinstitute.org
  13. 13. We aspire (initially) to create a great user experience forresearch data management What would a “dropbox for science” look like? computationinstitute.org
  14. 14. • Collect • Annotate• Move • Publish• Sync • Search• Share • Backup• Analyze • ArchiveBIG DATA computationinstitute.org
  15. 15. A common work flow… RegistryStaging Ingest Store Store Community Store Analysis Store Archive Mirror computationinstitute.org
  16. 16. … with common challengesData movement, sync, and sharing Registry• Between facilities, archives, researchers Staging Ingest Store Store• Many files, large data volumes Community• With security, reliability, performance Store Analysis Store Archive Mirror computationinstitute.org
  17. 17. • Collect • Annotate • Move • Publish • Sync • Search • Share • Backup • Capabilities delivered using Analyze • ArchiveSoftware-as-Service (SaaS) model computationinstitute.org
  18. 18. 2 Globus Data Online Data Source moves/sy Destination ncs files1 User initiates transfer request Globus Online 3 notifies user computationinstitute.org
  19. 19. 2 Globus Online tracks Data shared files; no need Source to move files to cloud storage!1 User A selects 3 file(s) to share; User B logs in to selects Globus Online user/group, sets and accesses share permissions shared file computationinstitute.org
  20. 20. Extreme ease of use• InCommon, Oauth, OpenID, X.509, …• Credential management• Group definition and management• Transfer management and optimization• Reliability via transfer retries• Web interface, REST API, command line• One-click “Globus Connect” install• 5-minute Globus Connect Multi User install computationinstitute.org
  21. 21. Early adoption is encouraging computationinstitute.org
  22. 22. Early adoption is encouraging 8,000 registered users; ~100 daily ~10 PB moved; ~1B files10x (or better) performance vs. scp 99.9% availability Entirely hosted on AWS computationinstitute.org
  23. 23. Delivering a great user experience relies onhigh performance network infrastructure computationinstitute.org
  24. 24. Science DMZ+ optimizes performance computationinstitute.org
  25. 25. What is a Science DMZ?Three key components, all required:• “Friction free” network path – Highly capable network devices (wire-speed, deep queues) – Virtual circuit connectivity option – Security policy and enforcement specific to science workflows – Located at or near site perimeter if possible• Dedicated, high-performance Data Transfer Nodes (DTNs) – Hardware, operating system, libraries optimized for transfer – Optimized data transfer tools: Globus Online, GridFTP• Performance measurement/test node – perfSONARDetails at http://fasterdata.es.net/science-dmz/ computationinstitute.org
  26. 26. Globus GridFTP architecture Parallel TCP LFN Globus XIO GridFTP UDP or RDMA Dedicated TCP SharedInternal layered XIO architecture allows alternative network and filesystem interfaces to be plugged in to the stack 28computationinstitute.org
  27. 27. GridFTP performance options • TCP configuration • Concurrency: Multiple flows per node • Parallelism: Multiple nodes • Pipelining of requests to support small files • Multiple cores for integrity, encryption • Alternative protocol selection* • Use of circuits and multiple paths* Globus Online can configure these options based on what it knows about a transfer* Experimental computationinstitute.org
  28. 28. Exploiting multiple paths • Take advantage of multiple interfaces in multi-homed data transfer nodes • Use circuit as well as production IP link • Data will flow even while the circuit is being set up • Once circuit is set up, use both paths to improve throughputRaj Kettimuthu, Ezra Kissel, Martin Swany, Jason Zurawski, Dan Gunter computationinstitute.org
  29. 29. Exploiting multiple paths Transfer between NERSC and ANL Transfer between UMich and Caltech multipath multipath Default, commodity IP routes + Dedicated circuits = Significant performance gainsRaj Kettimuthu, Ezra Kissel, Martin Swany, Jason Zurawski, Dan Gunter computationinstitute.org
  30. 30. Duration of runs, in seconds, over time. Red: >10 TB transfer; green: >1 TB transfer. 1e+07 1 week 1e+05 1 day 1 hourduration 1e+03 1 minute 1e+01 1 second 1e-01 2011 2012
  31. 31. K. Heitmann (Argonne)moves 22 TB of cosmologydata LANL  ANL at 5 Gb/s computationinstitute.org
  32. 32. B. Winjum (UCLA) moves900K-file plasma physicsdatasets UCLA NERSC computationinstitute.org
  33. 33. Dan Kozak (Caltech)replicates 1 PB LIGOastronomy data for resilience computationinstitute.org
  34. 34. • Collect • Annotate• Move • Publish• Sync • Search• Share • Backup• Analyze • ArchiveBIG DATA computationinstitute.org
  35. 35. • Collect • Annotate• Move • Publish• Sync • Search• Share • Backup• Analyze • ArchiveBIG DATA computationinstitute.org
  36. 36. Many more capabilities planned … Globus Online Research Data Management-as-a-Service Ingest, Sharing, Colla Backup,Cataloging, boration, Ann Archival, … SaaSIntegration otation RetrievalGlobus Integrate (Globus Nexus, Globus Connect) PaaS computationinstitute.org
  37. 37. A platform for integration computationinstitute.org
  38. 38. Catalog as a service Approach Three REST APIs • Hosted user-defined /query/ catalogs • Retrieve subjects • Based on tag model /tags/ • Create, delete, retrie <subject, name, value> ve tags • Optional schema /tagdef/ constraints • Create, delete, retrie • Integrated with other ve tag definitions Globus servicesBuilds on USC Tagfiler project (C. Kesselman et al.) computationinstitute.org
  39. 39. Other early successes inservices for science… computationinstitute.org
  40. 40. computationinstitute.org
  41. 41. computationinstitute.org
  42. 42. Other innovative scienceSaaS projects computationinstitute.org
  43. 43. Other innovative scienceSaaS projects computationinstitute.org
  44. 44. Our vision for a 21st century cyberinfrastructureTo provide more capability formore people at substantiallylower cost by creativelyaggregating (“cloud”) andfederating (“grid”) resources“Science as a service” computationinstitute.org
  45. 45. Thank you to our sponsors! computationinstitute.org

×