SlideShare a Scribd company logo
1 of 23
Download to read offline
Demonstrating a Pre-Exascale,
Cost-Effective Multi-Cloud Environment
for Scientific Computing
Igor Sfiligoi – UC San Diego
together with
Frank Würthwein – UC San Diego
Steve Barnet, Vladimir Brik, Benedikt Riedel & David Schultz – UW Madison
How we extended
the IceCube computing environment
to Pre-Exascale size
using “cheap” Cloud resources
Igor Sfiligoi – UC San Diego
together with
Frank Würthwein – UC San Diego
Steve Barnet, Vladimir Brik, Benedikt Riedel & David Schultz – UW Madison
The big picture idea
• Demonstrate the viability of a Pre-Exascale High Throughput Computing
pool using GPU instances from all major Cloud providers
• Without using any reservations or long-term commitments,
and using only ”cheap” instances
• Sustain the pool for a whole workday
• Running a real scientific application – IceCube simulation
• We went and did it
• Reached about 170 PFLOP32s
• Integrating an EFLOP32 hour
• At a $60k price tag
The Science
The Neutrino detector
How does one detect neutrinos?
• Cherenkov light - Sonic boom with light
• Cherenkov light appears when a
charged particle travels through matter
faster than light can
Ice happens to be
a great medium
• Solid
• Yet transparent
Classic example from nuclear reactor
• Using natural ice has its disadvantages
• Ice optical properties not homogeneous
(deposited over millennia)
Aya Ishihara
7
• Detector calibration via ray-tracing simulation
• Several ExaFLOP months of compute needed
• GPUs very effective for computing
photon propagation (ray-tracing)
• Orders of magnitude more effective than CPUs
• OpenCL makes it easy to use any GPU type
• Intrinsically an HTC problem
(High Throughput Computing)
• Each photon independent (pleasantly parallel)
• A statistical problem
Science and compute meet
Optical Prop
Aya Ishihara
Combining all the possible inform
These features are included in s
We re al a s be de eloping he
Nature never tell us a perfect a
satisfactory agreeme
The Cloud Burst
IceCube is an OSG user
• IceCube already doing HTC at scale through the Open Science Grid (OSG)
• Which is based on HTCondor as its Workload Management System (WMS)
• HTCondor makes is trivial to join
any kind of compute resource into a single HTC pool
• No special network requirements, no shared anything
• Cloud resources one of the easiest to add
• The desired scale was beyond what they normally use
• Just added some additional hardware to handle the additional load
• All jobs could still run in either on-prem (OSG, XSEDE or PRP/k8s)
or Cloud resources
10
AWS region map
MS Azure region map
Provisioning using native Cloud APIs
• Native Cloud APIs easy to use
• But be aware that each Cloud region completely independent
• Using group provisioning mechanisms in all three
• AWS – Spot Fleet
• Azure – Virtual Machine Scale Set (VMSS)
• Google – Instance Groups
• VM Image
• OS + HTCondor + CVMFS for software distribution
• Tailored for Cloud provider
• To use appropriate tools and drivers
• Parametrized to deal with region differences
(based on Cloud Metadata catalog info fetched during boot)
Integrated an EFLOP hour
• Executed on a random Tuesday in February 2020
• Provisioning from all over the world (from AWS+Azure+Google Cloud)
• Reached about 170 fp32 PFLOPS
• Sustained for a whole workday (6h at plateau)
• Happened to integrate just about one fp32 EFLOP hour
Spending about $60k
Negligible egress costs
during post-run data export
Comparable to TACC’s Frontera
1/3rd
of OLCF’s Summit
• Requested only spot/preemptible instances( ~1/3rd the cost of on-demand)
• T4s the most cost effective – 40 PFLOPS at ~$1k/hour
• Used also P40 and V100 to reach 170 PFLOPS at ~$10k/hour
• No network related costs
• Only incoming traffic during the Cloud Burst
Minimal impact of preemption
• Preemption only a minor nuisance
• Incurred less than 10% waste
• Great deal for a 70% discount
• Recovery completely transparent to users
• IceCube jobs ideal for such environment
• Job runtimes in the 20-60 minutes range
Input file handling
Just-in-time input fetching
• Each IceCube job requires a unique input file
• Created asynchronously ahead of time, stored in IceCube@UW storage
• About 45 MB in size
• Job wrapper pulls the appropriate file as first step
• We used standard HTTP fetch using aria2
• Photon propagation code reads it from local disk
• Waste due to file transfer minimal
• A few seconds out of O(1k) job runtime
• Even for remote resources
Plenty of network capacity
• In preparation for the actual burst
demonstrated close to 100 Gbps
• See poster 197 for more details
“Demonstrating 100 Gbps in and out of the public Clouds”
• IceCube Cloud Burst required
only about 4 Gbps on average
WAN bandwidth
• Short peaks of up to 6 Gbps
ThroughputinBytespersecond
Conclusions
The Cloud is an option
• We showed that there is plenty of capacity in the Cloud
(at least, that was the case in February 2020)
• We reached and sustained about 170 fp32 PFLOPS for a whole workday
• It is relatively easy to expand an HTC system into the Clouds
• Without any advanced planning or long-term commitments
• At a reasonable cost ($30k-$60k for an integrated fp32 EFLOP hour)
• While still running real scientific computing on it
• Nice experience, we are confident we could do this again anytime
• If interested, talk to me or OSG
19
This work was partially funded by
the US National Science Foundation (NSF)
under grants OAC-1941481, MPS-1148698,
OAC-1841530, OAC-1826967,
OAC-1904444 and OPP-1600823.
Acknowledgments
Backup slides
Integrated an EFLOP hour
• Executed on a random Tuesday in February 2020
• Provisioning from all over the world (from AWS+Azure+Google Cloud)
• Sustained for a whole workday (6h at plateau)
• Reached and sustained about 170 fp32 PFLOPS
• Happened to integrate just about one fp32 EFLOP hour
Quick note:
This was our 2nd Cloud Burst with IceCube
The 1st one was executed
in November 2019
and was aimed at showing
peak capacity (not sustained one)
and without budget considerations
More details in my ISC20 talk
https://youtu.be/VNhQxIVJOXw
380 fp32 PFLOPS at peak
About 2 hours total
FLOPS a good metric
• We talked about PFLOPS because it is easy to measure
• # GPUs * NVIDIA-provided specs
• But science output is what matters
• Turns out # jobs and # FLOPS correlate very well

More Related Content

What's hot

Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
 
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013Amazon Web Services
 
Near Exascale Computing in the Cloud
Near Exascale Computing in the CloudNear Exascale Computing in the Cloud
Near Exascale Computing in the CloudFrank Wuerthwein
 
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Rob Emanuele
 
The OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicThe OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicTim Bell
 
20150924 rda federation_v1
20150924 rda federation_v120150924 rda federation_v1
20150924 rda federation_v1Tim Bell
 
OpenStack @ CERN, by Tim Bell
OpenStack @ CERN, by Tim BellOpenStack @ CERN, by Tim Bell
OpenStack @ CERN, by Tim BellAmrita Prasad
 
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...Andrew Howard
 
20161025 OpenStack at CERN Barcelona
20161025 OpenStack at CERN Barcelona20161025 OpenStack at CERN Barcelona
20161025 OpenStack at CERN BarcelonaTim Bell
 
OpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspectiveOpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspectiveTim Bell
 
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013Amazon Web Services
 
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBHow a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBInfluxData
 
Stabilising the jenga tower
Stabilising the jenga towerStabilising the jenga tower
Stabilising the jenga towerGordon Chung
 
20181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v320181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v3Tim Bell
 
Cloud computing and bioinformatics
Cloud computing and bioinformaticsCloud computing and bioinformatics
Cloud computing and bioinformaticsEnis Afgan
 
Cycle Computing Record-breaking Petascale HPC Run
Cycle Computing Record-breaking Petascale HPC RunCycle Computing Record-breaking Petascale HPC Run
Cycle Computing Record-breaking Petascale HPC Runinside-BigData.com
 
Updates on the Fake Object Pipeline for HSC Survey
Updates on the Fake Object Pipeline for HSC Survey Updates on the Fake Object Pipeline for HSC Survey
Updates on the Fake Object Pipeline for HSC Survey Song Huang
 
Anatomy of an action
Anatomy of an actionAnatomy of an action
Anatomy of an actionGordon Chung
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리NAVER D2
 

What's hot (20)

Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
 
Near Exascale Computing in the Cloud
Near Exascale Computing in the CloudNear Exascale Computing in the Cloud
Near Exascale Computing in the Cloud
 
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
 
The OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicThe OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack Nordic
 
20150924 rda federation_v1
20150924 rda federation_v120150924 rda federation_v1
20150924 rda federation_v1
 
OpenStack @ CERN, by Tim Bell
OpenStack @ CERN, by Tim BellOpenStack @ CERN, by Tim Bell
OpenStack @ CERN, by Tim Bell
 
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
 
20161025 OpenStack at CERN Barcelona
20161025 OpenStack at CERN Barcelona20161025 OpenStack at CERN Barcelona
20161025 OpenStack at CERN Barcelona
 
OpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspectiveOpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspective
 
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
 
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBHow a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
 
Stabilising the jenga tower
Stabilising the jenga towerStabilising the jenga tower
Stabilising the jenga tower
 
20181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v320181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v3
 
Cloud computing and bioinformatics
Cloud computing and bioinformaticsCloud computing and bioinformatics
Cloud computing and bioinformatics
 
Cycle Computing Record-breaking Petascale HPC Run
Cycle Computing Record-breaking Petascale HPC RunCycle Computing Record-breaking Petascale HPC Run
Cycle Computing Record-breaking Petascale HPC Run
 
Updates on the Fake Object Pipeline for HSC Survey
Updates on the Fake Object Pipeline for HSC Survey Updates on the Fake Object Pipeline for HSC Survey
Updates on the Fake Object Pipeline for HSC Survey
 
Anatomy of an action
Anatomy of an actionAnatomy of an action
Anatomy of an action
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리
 

Similar to Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scientific Computing

Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Igor Sfiligoi
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores inside-BigData.com
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...Alluxio, Inc.
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...Dirk Petersen
 
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Larry Smarr
 
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Larry Smarr
 
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Larry Smarr
 
Rendering Takes Flight
Rendering Takes FlightRendering Takes Flight
Rendering Takes FlightAvere Systems
 
Rendering Takes Flight
Rendering Takes FlightRendering Takes Flight
Rendering Takes FlightMichelle Ray
 
Kubernetes - Hosted OSG Services
Kubernetes - Hosted OSG ServicesKubernetes - Hosted OSG Services
Kubernetes - Hosted OSG ServicesIgor Sfiligoi
 
CLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchCLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchTom Connor
 
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...Alluxio, Inc.
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardDocker, Inc.
 
BIOIT14: Deploying very low cost cloud storage technology in a traditional re...
BIOIT14: Deploying very low cost cloud storage technology in a traditional re...BIOIT14: Deploying very low cost cloud storage technology in a traditional re...
BIOIT14: Deploying very low cost cloud storage technology in a traditional re...Dirk Petersen
 
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain ProjectCeph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Projectinside-BigData.com
 

Similar to Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scientific Computing (20)

Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Thoughts on Cybersecurity
Thoughts on CybersecurityThoughts on Cybersecurity
Thoughts on Cybersecurity
 
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
 
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
 
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
 
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
Panel: Open Infrastructure for an Open Society: OSG, Commercial Clouds, and B...
 
Rendering Takes Flight
Rendering Takes FlightRendering Takes Flight
Rendering Takes Flight
 
Rendering Takes Flight
Rendering Takes FlightRendering Takes Flight
Rendering Takes Flight
 
Kubernetes - Hosted OSG Services
Kubernetes - Hosted OSG ServicesKubernetes - Hosted OSG Services
Kubernetes - Hosted OSG Services
 
CLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchCLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB Launch
 
Supercomputer @ manarat university by reza
Supercomputer  @ manarat university by rezaSupercomputer  @ manarat university by reza
Supercomputer @ manarat university by reza
 
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
 
BIOIT14: Deploying very low cost cloud storage technology in a traditional re...
BIOIT14: Deploying very low cost cloud storage technology in a traditional re...BIOIT14: Deploying very low cost cloud storage technology in a traditional re...
BIOIT14: Deploying very low cost cloud storage technology in a traditional re...
 
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain ProjectCeph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
 

More from Igor Sfiligoi

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROIgor Sfiligoi
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...Igor Sfiligoi
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Igor Sfiligoi
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingIgor Sfiligoi
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateIgor Sfiligoi
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeIgor Sfiligoi
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROIgor Sfiligoi
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksIgor Sfiligoi
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsIgor Sfiligoi
 
Serving HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorServing HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorIgor Sfiligoi
 
Characterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsCharacterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsIgor Sfiligoi
 
GRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOGRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOIgor Sfiligoi
 
The Open Science Grid and how it relates to PRAGMA
The Open Science Grid and how it relates to PRAGMAThe Open Science Grid and how it relates to PRAGMA
The Open Science Grid and how it relates to PRAGMAIgor Sfiligoi
 

More from Igor Sfiligoi (20)

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
 
Serving HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorServing HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondor
 
Characterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsCharacterizing network paths in and out of the Clouds
Characterizing network paths in and out of the Clouds
 
GRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOGRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGO
 
The Open Science Grid and how it relates to PRAGMA
The Open Science Grid and how it relates to PRAGMAThe Open Science Grid and how it relates to PRAGMA
The Open Science Grid and how it relates to PRAGMA
 

Recently uploaded

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scientific Computing

  • 1. Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scientific Computing Igor Sfiligoi – UC San Diego together with Frank Würthwein – UC San Diego Steve Barnet, Vladimir Brik, Benedikt Riedel & David Schultz – UW Madison
  • 2. How we extended the IceCube computing environment to Pre-Exascale size using “cheap” Cloud resources Igor Sfiligoi – UC San Diego together with Frank Würthwein – UC San Diego Steve Barnet, Vladimir Brik, Benedikt Riedel & David Schultz – UW Madison
  • 3. The big picture idea • Demonstrate the viability of a Pre-Exascale High Throughput Computing pool using GPU instances from all major Cloud providers • Without using any reservations or long-term commitments, and using only ”cheap” instances • Sustain the pool for a whole workday • Running a real scientific application – IceCube simulation • We went and did it • Reached about 170 PFLOP32s • Integrating an EFLOP32 hour • At a $60k price tag
  • 6. How does one detect neutrinos? • Cherenkov light - Sonic boom with light • Cherenkov light appears when a charged particle travels through matter faster than light can Ice happens to be a great medium • Solid • Yet transparent Classic example from nuclear reactor
  • 7. • Using natural ice has its disadvantages • Ice optical properties not homogeneous (deposited over millennia) Aya Ishihara 7
  • 8. • Detector calibration via ray-tracing simulation • Several ExaFLOP months of compute needed • GPUs very effective for computing photon propagation (ray-tracing) • Orders of magnitude more effective than CPUs • OpenCL makes it easy to use any GPU type • Intrinsically an HTC problem (High Throughput Computing) • Each photon independent (pleasantly parallel) • A statistical problem Science and compute meet Optical Prop Aya Ishihara Combining all the possible inform These features are included in s We re al a s be de eloping he Nature never tell us a perfect a satisfactory agreeme
  • 10. IceCube is an OSG user • IceCube already doing HTC at scale through the Open Science Grid (OSG) • Which is based on HTCondor as its Workload Management System (WMS) • HTCondor makes is trivial to join any kind of compute resource into a single HTC pool • No special network requirements, no shared anything • Cloud resources one of the easiest to add • The desired scale was beyond what they normally use • Just added some additional hardware to handle the additional load • All jobs could still run in either on-prem (OSG, XSEDE or PRP/k8s) or Cloud resources 10
  • 11. AWS region map MS Azure region map Provisioning using native Cloud APIs • Native Cloud APIs easy to use • But be aware that each Cloud region completely independent • Using group provisioning mechanisms in all three • AWS – Spot Fleet • Azure – Virtual Machine Scale Set (VMSS) • Google – Instance Groups • VM Image • OS + HTCondor + CVMFS for software distribution • Tailored for Cloud provider • To use appropriate tools and drivers • Parametrized to deal with region differences (based on Cloud Metadata catalog info fetched during boot)
  • 12. Integrated an EFLOP hour • Executed on a random Tuesday in February 2020 • Provisioning from all over the world (from AWS+Azure+Google Cloud) • Reached about 170 fp32 PFLOPS • Sustained for a whole workday (6h at plateau) • Happened to integrate just about one fp32 EFLOP hour
  • 13. Spending about $60k Negligible egress costs during post-run data export Comparable to TACC’s Frontera 1/3rd of OLCF’s Summit • Requested only spot/preemptible instances( ~1/3rd the cost of on-demand) • T4s the most cost effective – 40 PFLOPS at ~$1k/hour • Used also P40 and V100 to reach 170 PFLOPS at ~$10k/hour • No network related costs • Only incoming traffic during the Cloud Burst
  • 14. Minimal impact of preemption • Preemption only a minor nuisance • Incurred less than 10% waste • Great deal for a 70% discount • Recovery completely transparent to users • IceCube jobs ideal for such environment • Job runtimes in the 20-60 minutes range
  • 16. Just-in-time input fetching • Each IceCube job requires a unique input file • Created asynchronously ahead of time, stored in IceCube@UW storage • About 45 MB in size • Job wrapper pulls the appropriate file as first step • We used standard HTTP fetch using aria2 • Photon propagation code reads it from local disk • Waste due to file transfer minimal • A few seconds out of O(1k) job runtime • Even for remote resources
  • 17. Plenty of network capacity • In preparation for the actual burst demonstrated close to 100 Gbps • See poster 197 for more details “Demonstrating 100 Gbps in and out of the public Clouds” • IceCube Cloud Burst required only about 4 Gbps on average WAN bandwidth • Short peaks of up to 6 Gbps ThroughputinBytespersecond
  • 19. The Cloud is an option • We showed that there is plenty of capacity in the Cloud (at least, that was the case in February 2020) • We reached and sustained about 170 fp32 PFLOPS for a whole workday • It is relatively easy to expand an HTC system into the Clouds • Without any advanced planning or long-term commitments • At a reasonable cost ($30k-$60k for an integrated fp32 EFLOP hour) • While still running real scientific computing on it • Nice experience, we are confident we could do this again anytime • If interested, talk to me or OSG 19
  • 20. This work was partially funded by the US National Science Foundation (NSF) under grants OAC-1941481, MPS-1148698, OAC-1841530, OAC-1826967, OAC-1904444 and OPP-1600823. Acknowledgments
  • 22. Integrated an EFLOP hour • Executed on a random Tuesday in February 2020 • Provisioning from all over the world (from AWS+Azure+Google Cloud) • Sustained for a whole workday (6h at plateau) • Reached and sustained about 170 fp32 PFLOPS • Happened to integrate just about one fp32 EFLOP hour Quick note: This was our 2nd Cloud Burst with IceCube The 1st one was executed in November 2019 and was aimed at showing peak capacity (not sustained one) and without budget considerations More details in my ISC20 talk https://youtu.be/VNhQxIVJOXw 380 fp32 PFLOPS at peak About 2 hours total
  • 23. FLOPS a good metric • We talked about PFLOPS because it is easy to measure • # GPUs * NVIDIA-provided specs • But science output is what matters • Turns out # jobs and # FLOPS correlate very well