SlideShare a Scribd company logo
1 of 21
Download to read offline
Data-intensive IceCube Cloud Burst
leveraging Internet2 Cloud Connect Service
By Igor Sfiligoi – UCSD
For the rest of the IceCube Cloud Burst team (UCSD+UW Madison)
NRP-Pilot Weekly Meeting, Nov 12th 2020
IceCube GPU-based Cloud bursts
• Large amount of photon
propagation simulation needed
to properly calibrate natural Ice
• Simulation compute intensive,
and ideal for GPU compute
Nov’20
https://icecube.wisc.edu
Previous: https://doi.org/10.1145/3311790.3396625
Integral: 225 PFLOP hoursfp32
Egress data intensive Cloud run
• This IceCube simulation was
relatively heavy in egress data
• 2 GB per job
• Job length ~= 0.5 hour
• And very spiky
• The whole file
is transferred
after compute completed
• Input sizes small-ish
• 0.25 GB
• Cloud burst exceeded 10 GBps
• To make good use of a large fraction
of available Cloud GPUs
https://www.linkedin.com/pulse/cloudy-100-pflops-gbps-icecube-igor-sfiligoi
Storage backends
• UW Madison is IceCube’s
home institution
• Large Lustre-based filesystem
• 5x dedicated GridFTP servers,
each with 25 Gbps NIC
• At UCSD we used SDSC Qumulo
• Available as NFS mounts
inside the UCSD network
• Deployed GridFTP pods in Nautilus
• 3 x pods on 3 nodes at UCSD
• Each with 100 Gbps NIC
• Each pod had 5x NFS mountpoints
Using Internet2 Cloud Connect Service
• Egress costs are notoriously high
• Using dedicated links cheaper
• If provisioned on demand
• Internet2 acts as provider for
the research community
• For AWS, Azure and GCP
• No 100Gbps links available
• Had to stitch together 20+ links,
each 10Gbps, 5Gbps and 2 Gbps
Each color band belongs
to one network link
https://internet2.edu/services/cloud-connect/
Simplified list price comparison
Using Internet2 Cloud Connect Service
• Egress costs are notoriously high
• Using dedicated links cheaper
• If provisioned on demand
• Internet2 acts as provider for
the research community
• For AWS, Azure and GCP
• No 100Gbps links available
• Had to stitch together 20+ links,
each 10Gbps, 5Gbps and 2 Gbps
Each color band belongs
to one network link
https://internet2.edu/services/cloud-connect/
Produced 130 TB of data
• List price for commercial path: $11k
• We paid: $6k
Compute: $26k
Simplified list price comparison
The need for many links
• Internet2 has mostly 2x 10 Gbps links with Cloud providers
• The only bright exception is the California link to Azure at 2x 100 Gbps
• The links are shared, so one can never get the whole link for itself
• 5 Gbps limit in AWS and GCP
• 10 Gbps limit in Azure
• The link speeds are rigidly defined
• 1, 2, 5, 10 Gbps
• To fill an (almost) empty 10 Gbps link,
one needs three links: 5 + 2 + 2
Screenshot mesh of provisioned links
20x UW Madison + 2x UCSD
Very different provisioning in the 3 Clouds
• AWS the most complex
• And requires initiation by
on-prem network engineer
• Many steps after initial request
• Create VPC and subnets
• Accept connection request
• Create VPG
• Associate VPG with VPC
• Create DCG
• Create VIF
• Relay back to on-prem the BGP key
• Establish VPC -> VPG routing
• Associate DCG -> VPG
• And don’t forget the Internet routers
• GCP the simplest
• Create VPC and subnets
• Create Cloud Router
• Create Interconnect
• Provide key to on-prem
• Azure not much harder
• Create VN and subnets
• Make sure the VN has Gateway subnet
• Create ExpressRoute (ER)
• Provide key to on-prem
• Create VNG
• Create connection between ER and VNG
• Note: Azure comes with many more options
to choose from
Additional on-prem networking setup needed
• Quote from Michael Hare, UW Madison Network engineer:
In addition to network configuration [at] UW Madison (AS59), we
provisioned BGP based Layer 3 MPLS VPNs (L3VPNs) towards Internet2
via our regional aggregator, BTAA OmniPop.
This work involved reaching out to the BTAA NOC to coordinate on VLAN
numbers and paths and to [the] Internet2 NOC to make sure the newly
provisioned VLANs were configurable inside OESS.
Due to limitations in programmability or knowledge at the time regarding
duplicate IP address towards the cloud (GCP, Azure, AWS) endpoints, we
built several discrete L3VPNs inside the Internet2 network to accomplish
the desired topology.
• Tom Hutton did the UCSD part
Spiky nature of workload tricky for networking
• We could not actually do
a “burst” this time
• Results in too many
spikes and valleys
• We tried it at smaller scale
• Noticed that links
to different
providers behave
differently
• Some capped,
some flexible
• Long upload
times when
congested
Capped
Flexible
Aggregated UW Madison Storage Network – Smaller scale bursty test
Much more careful during big “burst”
• Ramp up for over 2 hours • Still not perfect
• But much smoother
GBps–Averagedover10mins
GBps–Averagedover10mins
fp32PFLOPS
Ramp up Stable Final push
Summary
•Using dedicated links made this Cloud run
a little more challenging
• But cost savings worth it
•Showed that data-intensive high-throughput
Cloud computing is doable
• With plenty of science data generated to show for it
Acknowledgments
• I would like to thank NSF for their support of this endeavor as part of
the OAC-1941481, MPS-1148698, OAC-1841530 , OAC-
1826967 and OPP-1600823.
• And all of this would of course not be possible without the hard work
of Michael Hare, David Schultz, Benedikt Riedel, Vladimir Brik,
Steve Barnet, Frank Wuerthwein, Tom Hutton, Matt Zekauskas and
John Hicks.
Backup slides
Application logs only provide dt+MBs for egress
• Different averaging techniques give sightly different insights
GBps–Averagedover1min
fp32PFLOPS
Ramp up Stable Final push GBps–Averagedover1min
Internet2 Cloud Connect Explained
• Each Cloud provider has its own
“dedicated link” mechanism
• Similar in spirit,
but technically different
• AWS has Direct Connect
https://aws.amazon.com/directconnect/
• Azure has Express Route
https://azure.microsoft.com/en-us/services/expressroute/
• GCP has Cloud Interconnect
https://cloud.google.com/network-connectivity/docs/interconnect
• Internet2 acts as a
service provider for
all three major Cloud providers
• Providing
• The physical network infrastructure
• A portal for on-prem operators
Azure
example
Example
AWS network
monitoring
Example
Azure network
monitoring
Example
GCP network
monitoring
Parties responsible for the 130 TB produced
UW Madison
UCSDGCP
AWS
Azure
Each outside slice
represents one
network
link

More Related Content

What's hot

Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
 
Near Exascale Computing in the Cloud
Near Exascale Computing in the CloudNear Exascale Computing in the Cloud
Near Exascale Computing in the CloudFrank Wuerthwein
 
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013Amazon Web Services
 
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Rob Emanuele
 
The OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicThe OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicTim Bell
 
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...Andrew Howard
 
20170926 cern cloud v4
20170926 cern cloud v420170926 cern cloud v4
20170926 cern cloud v4Tim Bell
 
20150924 rda federation_v1
20150924 rda federation_v120150924 rda federation_v1
20150924 rda federation_v1Tim Bell
 
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013Amazon Web Services
 
OpenStack @ CERN, by Tim Bell
OpenStack @ CERN, by Tim BellOpenStack @ CERN, by Tim Bell
OpenStack @ CERN, by Tim BellAmrita Prasad
 
OpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspectiveOpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspectiveTim Bell
 
20161025 OpenStack at CERN Barcelona
20161025 OpenStack at CERN Barcelona20161025 OpenStack at CERN Barcelona
20161025 OpenStack at CERN BarcelonaTim Bell
 
CERN OpenStack Cloud Control Plane - From VMs to K8s
CERN OpenStack Cloud Control Plane - From VMs to K8sCERN OpenStack Cloud Control Plane - From VMs to K8s
CERN OpenStack Cloud Control Plane - From VMs to K8sBelmiro Moreira
 
20181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v320181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v3Tim Bell
 
Stabilising the jenga tower
Stabilising the jenga towerStabilising the jenga tower
Stabilising the jenga towerGordon Chung
 
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBHow a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBInfluxData
 
Toward a National Research Platform
Toward a National Research PlatformToward a National Research Platform
Toward a National Research PlatformLarry Smarr
 
Cloud computing and bioinformatics
Cloud computing and bioinformaticsCloud computing and bioinformatics
Cloud computing and bioinformaticsEnis Afgan
 

What's hot (20)

Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Near Exascale Computing in the Cloud
Near Exascale Computing in the CloudNear Exascale Computing in the Cloud
Near Exascale Computing in the Cloud
 
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
 
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
 
Federated HPC Clouds applied to Radiation Therapy
Federated HPC Clouds applied to Radiation TherapyFederated HPC Clouds applied to Radiation Therapy
Federated HPC Clouds applied to Radiation Therapy
 
The OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicThe OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack Nordic
 
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
 
20170926 cern cloud v4
20170926 cern cloud v420170926 cern cloud v4
20170926 cern cloud v4
 
20150924 rda federation_v1
20150924 rda federation_v120150924 rda federation_v1
20150924 rda federation_v1
 
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
 
OpenStack @ CERN, by Tim Bell
OpenStack @ CERN, by Tim BellOpenStack @ CERN, by Tim Bell
OpenStack @ CERN, by Tim Bell
 
OpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspectiveOpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspective
 
20161025 OpenStack at CERN Barcelona
20161025 OpenStack at CERN Barcelona20161025 OpenStack at CERN Barcelona
20161025 OpenStack at CERN Barcelona
 
CERN OpenStack Cloud Control Plane - From VMs to K8s
CERN OpenStack Cloud Control Plane - From VMs to K8sCERN OpenStack Cloud Control Plane - From VMs to K8s
CERN OpenStack Cloud Control Plane - From VMs to K8s
 
20181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v320181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v3
 
Stabilising the jenga tower
Stabilising the jenga towerStabilising the jenga tower
Stabilising the jenga tower
 
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBHow a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
 
Toward a National Research Platform
Toward a National Research PlatformToward a National Research Platform
Toward a National Research Platform
 
Cloud computing and bioinformatics
Cloud computing and bioinformaticsCloud computing and bioinformatics
Cloud computing and bioinformatics
 

Similar to Data-intensive IceCube Cloud Burst

Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsIgor Sfiligoi
 
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011Toby Bloom
 
Reducing latency on the web with the Azure CDN- TechDays NL 2014
Reducing latency on the web with the Azure CDN- TechDays NL 2014Reducing latency on the web with the Azure CDN- TechDays NL 2014
Reducing latency on the web with the Azure CDN- TechDays NL 2014Maarten Balliauw
 
LAN, WAN, SAN upgrades: hyperconverged vs traditional vs cloud
LAN, WAN, SAN upgrades: hyperconverged vs traditional vs cloudLAN, WAN, SAN upgrades: hyperconverged vs traditional vs cloud
LAN, WAN, SAN upgrades: hyperconverged vs traditional vs cloudJisc
 
Flood modelling on the Cloud
Flood modelling on the CloudFlood modelling on the Cloud
Flood modelling on the Cloudasm100
 
Shoot the Bird: Linear Broadcast Distribution on AWS by Usman Shakeel of Amaz...
Shoot the Bird: Linear Broadcast Distribution on AWS by Usman Shakeel of Amaz...Shoot the Bird: Linear Broadcast Distribution on AWS by Usman Shakeel of Amaz...
Shoot the Bird: Linear Broadcast Distribution on AWS by Usman Shakeel of Amaz...ETCenter
 
3 Ways to Connect to the Oracle Cloud
3 Ways to Connect to the Oracle Cloud3 Ways to Connect to the Oracle Cloud
3 Ways to Connect to the Oracle CloudSimon Haslam
 
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...DevOps_Fest
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Igor Sfiligoi
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudHostedbyConfluent
 
Cloud Networking is not Virtual Networking - London VMUG 20130425
Cloud Networking is not Virtual Networking - London VMUG 20130425Cloud Networking is not Virtual Networking - London VMUG 20130425
Cloud Networking is not Virtual Networking - London VMUG 20130425Greg Ferro
 
Introducing Container Technology to TSUBAME3.0 Supercomputer
Introducing Container Technology to TSUBAME3.0 SupercomputerIntroducing Container Technology to TSUBAME3.0 Supercomputer
Introducing Container Technology to TSUBAME3.0 SupercomputerAkihiro Nomura
 
Linac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer RequirementsLinac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer Requirementsinside-BigData.com
 
Autoscaled Distributed Automation Expedia Know How
Autoscaled Distributed Automation Expedia Know HowAutoscaled Distributed Automation Expedia Know How
Autoscaled Distributed Automation Expedia Know Howaragavan
 
OSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin Honka
OSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin HonkaOSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin Honka
OSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin HonkaNETWAYS
 
Openstack Summit Vancouver 2018 - Multicloud Networking
Openstack Summit Vancouver 2018 - Multicloud NetworkingOpenstack Summit Vancouver 2018 - Multicloud Networking
Openstack Summit Vancouver 2018 - Multicloud NetworkingShannon McFarland
 
HPCS16 - Frederick Lefebvre - Bridging the last mile
HPCS16 - Frederick Lefebvre - Bridging the last mileHPCS16 - Frederick Lefebvre - Bridging the last mile
HPCS16 - Frederick Lefebvre - Bridging the last mileFrédérick Lefebvre
 

Similar to Data-intensive IceCube Cloud Burst (20)

Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
 
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
 
Reducing latency on the web with the Azure CDN- TechDays NL 2014
Reducing latency on the web with the Azure CDN- TechDays NL 2014Reducing latency on the web with the Azure CDN- TechDays NL 2014
Reducing latency on the web with the Azure CDN- TechDays NL 2014
 
LAN, WAN, SAN upgrades: hyperconverged vs traditional vs cloud
LAN, WAN, SAN upgrades: hyperconverged vs traditional vs cloudLAN, WAN, SAN upgrades: hyperconverged vs traditional vs cloud
LAN, WAN, SAN upgrades: hyperconverged vs traditional vs cloud
 
Flood modelling on the Cloud
Flood modelling on the CloudFlood modelling on the Cloud
Flood modelling on the Cloud
 
Accelerated SDN in Azure
Accelerated SDN in AzureAccelerated SDN in Azure
Accelerated SDN in Azure
 
Shoot the Bird: Linear Broadcast Distribution on AWS by Usman Shakeel of Amaz...
Shoot the Bird: Linear Broadcast Distribution on AWS by Usman Shakeel of Amaz...Shoot the Bird: Linear Broadcast Distribution on AWS by Usman Shakeel of Amaz...
Shoot the Bird: Linear Broadcast Distribution on AWS by Usman Shakeel of Amaz...
 
Big data, better networks
Big data, better networksBig data, better networks
Big data, better networks
 
3 Ways to Connect to the Oracle Cloud
3 Ways to Connect to the Oracle Cloud3 Ways to Connect to the Oracle Cloud
3 Ways to Connect to the Oracle Cloud
 
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
 
Cloud Networking is not Virtual Networking - London VMUG 20130425
Cloud Networking is not Virtual Networking - London VMUG 20130425Cloud Networking is not Virtual Networking - London VMUG 20130425
Cloud Networking is not Virtual Networking - London VMUG 20130425
 
Introducing Container Technology to TSUBAME3.0 Supercomputer
Introducing Container Technology to TSUBAME3.0 SupercomputerIntroducing Container Technology to TSUBAME3.0 Supercomputer
Introducing Container Technology to TSUBAME3.0 Supercomputer
 
Linac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer RequirementsLinac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer Requirements
 
Autoscaled Distributed Automation Expedia Know How
Autoscaled Distributed Automation Expedia Know HowAutoscaled Distributed Automation Expedia Know How
Autoscaled Distributed Automation Expedia Know How
 
OSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin Honka
OSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin HonkaOSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin Honka
OSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin Honka
 
Openstack Summit Vancouver 2018 - Multicloud Networking
Openstack Summit Vancouver 2018 - Multicloud NetworkingOpenstack Summit Vancouver 2018 - Multicloud Networking
Openstack Summit Vancouver 2018 - Multicloud Networking
 
HPCS16 - Frederick Lefebvre - Bridging the last mile
HPCS16 - Frederick Lefebvre - Bridging the last mileHPCS16 - Frederick Lefebvre - Bridging the last mile
HPCS16 - Frederick Lefebvre - Bridging the last mile
 

More from Igor Sfiligoi

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROIgor Sfiligoi
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...Igor Sfiligoi
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Igor Sfiligoi
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingIgor Sfiligoi
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateIgor Sfiligoi
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeIgor Sfiligoi
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROIgor Sfiligoi
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksIgor Sfiligoi
 
Serving HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorServing HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorIgor Sfiligoi
 
Characterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsCharacterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsIgor Sfiligoi
 
GRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOGRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOIgor Sfiligoi
 
The Open Science Grid and how it relates to PRAGMA
The Open Science Grid and how it relates to PRAGMAThe Open Science Grid and how it relates to PRAGMA
The Open Science Grid and how it relates to PRAGMAIgor Sfiligoi
 
Using CVMFS on a distributed Kubernetes cluster - The PRP Experience
Using CVMFS on a distributed Kubernetes cluster - The PRP ExperienceUsing CVMFS on a distributed Kubernetes cluster - The PRP Experience
Using CVMFS on a distributed Kubernetes cluster - The PRP ExperienceIgor Sfiligoi
 

More from Igor Sfiligoi (20)

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 
Serving HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorServing HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondor
 
Characterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsCharacterizing network paths in and out of the Clouds
Characterizing network paths in and out of the Clouds
 
GRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOGRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGO
 
The Open Science Grid and how it relates to PRAGMA
The Open Science Grid and how it relates to PRAGMAThe Open Science Grid and how it relates to PRAGMA
The Open Science Grid and how it relates to PRAGMA
 
Using CVMFS on a distributed Kubernetes cluster - The PRP Experience
Using CVMFS on a distributed Kubernetes cluster - The PRP ExperienceUsing CVMFS on a distributed Kubernetes cluster - The PRP Experience
Using CVMFS on a distributed Kubernetes cluster - The PRP Experience
 

Recently uploaded

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Data-intensive IceCube Cloud Burst

  • 1. Data-intensive IceCube Cloud Burst leveraging Internet2 Cloud Connect Service By Igor Sfiligoi – UCSD For the rest of the IceCube Cloud Burst team (UCSD+UW Madison) NRP-Pilot Weekly Meeting, Nov 12th 2020
  • 2. IceCube GPU-based Cloud bursts • Large amount of photon propagation simulation needed to properly calibrate natural Ice • Simulation compute intensive, and ideal for GPU compute Nov’20 https://icecube.wisc.edu Previous: https://doi.org/10.1145/3311790.3396625 Integral: 225 PFLOP hoursfp32
  • 3. Egress data intensive Cloud run • This IceCube simulation was relatively heavy in egress data • 2 GB per job • Job length ~= 0.5 hour • And very spiky • The whole file is transferred after compute completed • Input sizes small-ish • 0.25 GB • Cloud burst exceeded 10 GBps • To make good use of a large fraction of available Cloud GPUs https://www.linkedin.com/pulse/cloudy-100-pflops-gbps-icecube-igor-sfiligoi
  • 4. Storage backends • UW Madison is IceCube’s home institution • Large Lustre-based filesystem • 5x dedicated GridFTP servers, each with 25 Gbps NIC • At UCSD we used SDSC Qumulo • Available as NFS mounts inside the UCSD network • Deployed GridFTP pods in Nautilus • 3 x pods on 3 nodes at UCSD • Each with 100 Gbps NIC • Each pod had 5x NFS mountpoints
  • 5. Using Internet2 Cloud Connect Service • Egress costs are notoriously high • Using dedicated links cheaper • If provisioned on demand • Internet2 acts as provider for the research community • For AWS, Azure and GCP • No 100Gbps links available • Had to stitch together 20+ links, each 10Gbps, 5Gbps and 2 Gbps Each color band belongs to one network link https://internet2.edu/services/cloud-connect/ Simplified list price comparison
  • 6. Using Internet2 Cloud Connect Service • Egress costs are notoriously high • Using dedicated links cheaper • If provisioned on demand • Internet2 acts as provider for the research community • For AWS, Azure and GCP • No 100Gbps links available • Had to stitch together 20+ links, each 10Gbps, 5Gbps and 2 Gbps Each color band belongs to one network link https://internet2.edu/services/cloud-connect/ Produced 130 TB of data • List price for commercial path: $11k • We paid: $6k Compute: $26k Simplified list price comparison
  • 7. The need for many links • Internet2 has mostly 2x 10 Gbps links with Cloud providers • The only bright exception is the California link to Azure at 2x 100 Gbps • The links are shared, so one can never get the whole link for itself • 5 Gbps limit in AWS and GCP • 10 Gbps limit in Azure • The link speeds are rigidly defined • 1, 2, 5, 10 Gbps • To fill an (almost) empty 10 Gbps link, one needs three links: 5 + 2 + 2
  • 8. Screenshot mesh of provisioned links 20x UW Madison + 2x UCSD
  • 9. Very different provisioning in the 3 Clouds • AWS the most complex • And requires initiation by on-prem network engineer • Many steps after initial request • Create VPC and subnets • Accept connection request • Create VPG • Associate VPG with VPC • Create DCG • Create VIF • Relay back to on-prem the BGP key • Establish VPC -> VPG routing • Associate DCG -> VPG • And don’t forget the Internet routers • GCP the simplest • Create VPC and subnets • Create Cloud Router • Create Interconnect • Provide key to on-prem • Azure not much harder • Create VN and subnets • Make sure the VN has Gateway subnet • Create ExpressRoute (ER) • Provide key to on-prem • Create VNG • Create connection between ER and VNG • Note: Azure comes with many more options to choose from
  • 10. Additional on-prem networking setup needed • Quote from Michael Hare, UW Madison Network engineer: In addition to network configuration [at] UW Madison (AS59), we provisioned BGP based Layer 3 MPLS VPNs (L3VPNs) towards Internet2 via our regional aggregator, BTAA OmniPop. This work involved reaching out to the BTAA NOC to coordinate on VLAN numbers and paths and to [the] Internet2 NOC to make sure the newly provisioned VLANs were configurable inside OESS. Due to limitations in programmability or knowledge at the time regarding duplicate IP address towards the cloud (GCP, Azure, AWS) endpoints, we built several discrete L3VPNs inside the Internet2 network to accomplish the desired topology. • Tom Hutton did the UCSD part
  • 11. Spiky nature of workload tricky for networking • We could not actually do a “burst” this time • Results in too many spikes and valleys • We tried it at smaller scale • Noticed that links to different providers behave differently • Some capped, some flexible • Long upload times when congested Capped Flexible Aggregated UW Madison Storage Network – Smaller scale bursty test
  • 12. Much more careful during big “burst” • Ramp up for over 2 hours • Still not perfect • But much smoother GBps–Averagedover10mins GBps–Averagedover10mins fp32PFLOPS Ramp up Stable Final push
  • 13. Summary •Using dedicated links made this Cloud run a little more challenging • But cost savings worth it •Showed that data-intensive high-throughput Cloud computing is doable • With plenty of science data generated to show for it
  • 14. Acknowledgments • I would like to thank NSF for their support of this endeavor as part of the OAC-1941481, MPS-1148698, OAC-1841530 , OAC- 1826967 and OPP-1600823. • And all of this would of course not be possible without the hard work of Michael Hare, David Schultz, Benedikt Riedel, Vladimir Brik, Steve Barnet, Frank Wuerthwein, Tom Hutton, Matt Zekauskas and John Hicks.
  • 16. Application logs only provide dt+MBs for egress • Different averaging techniques give sightly different insights GBps–Averagedover1min fp32PFLOPS Ramp up Stable Final push GBps–Averagedover1min
  • 17. Internet2 Cloud Connect Explained • Each Cloud provider has its own “dedicated link” mechanism • Similar in spirit, but technically different • AWS has Direct Connect https://aws.amazon.com/directconnect/ • Azure has Express Route https://azure.microsoft.com/en-us/services/expressroute/ • GCP has Cloud Interconnect https://cloud.google.com/network-connectivity/docs/interconnect • Internet2 acts as a service provider for all three major Cloud providers • Providing • The physical network infrastructure • A portal for on-prem operators Azure example
  • 21. Parties responsible for the 130 TB produced UW Madison UCSDGCP AWS Azure Each outside slice represents one network link