Fusion simulations have traditionally required the use of leadership scale HPC resources in order to produce advances in physics. One such package is CGYRO, a premier tool for multi-scale plasma turbulence simulation. CGYRO is a typical HPC application that will not fit into a single node, as it requires several TeraBytes of memory and O(100) TFLOPS compute capability for cutting-edge simulations. CGYRO also requires high-throughput and low-latency networking, due to its reliance on global FFT computations. While in the past such compute may have required hundreds, or even thousands of nodes, recent advances in hardware capabilities allow for just tens of nodes to deliver the necessary compute power. We explored the feasibility of running CGYRO on Cloud resources provided by Microsoft on their Azure platform, using the infiniband-connected HPC resources in spot mode. We observed both that CPU-only resources were very efficient, and that running in spot mode was doable, with minimal side effects. The GPU-enabled resources were less cost effective but allowed for higher scaling.
1. Modest scale HPC on Azure
using CGYRO
Igor Sfiligoi –UC San Diego
Jeff Candy – General Atomics
A poster at
Instance Nodes Total time Comm. time Total cost
NDv2 24 161s 139s $5.24
NDv2 8 369s 316s $4.00
HBv2 36 272s 104s $1.81
HBv2 18 441s 190s $1.47
HC 35 416s 113s $2.42
HC 18 763s 151s $2.28
System Nodes Total time Comm. time
Summit 32 86s 67s
Cori 128 165s 62s
Cori 48 339s 160s
Azure
2. What is CGYRO?
• CGYRO is a premier tool for multi-scale plasma turbulence simulation
and has been in use by the fusion community for several years
• An Eulerian gyrokinetic solver and relies heavily on global FFT computations
• Fusion research still very active
• Several aspects of fusion energy physics are still not well understood
• Experimental methods are essential for gathering
new operational modes
• But simulations are used to validate basic theory,
plan experiments, interpret results on present devices,
and ultimately to design future devices.
3. Motivation for this exploratory work
• Leadership-class HPC centers are heavily sought for
and typically over-subscribed
• Can we do Fusion research in other venues?
• Commercial Clouds offer an appealing option
• They promise immediate access to resources
• All you need is $$$
• But can they deliver?
• Are there true HPC-class resources available?
• Can we afford it?
4. Microsoft Azure and HPC
• Among the commercial Cloud providers,
Microsoft Azure has the most HPC-like resources
• Several instance types with Infiniband connectivity
• The two most promising:
• NDv2 – 8x NVIDIA V100 GPUs with 100 Gbps EDR IB
• HBv2 – 120-core AMD EPYC CPUs with 200 Gbps HDR IB
5. Verifying IB performance
• CGYRO is extremely network latency and throughput sensitive
• One could say that it is network-bound
• Microsoft Azure IB shows great characteristics
using the OSU test benchmark tools
Measured network latencies in us, as reported by the osu_latency tool.
6. Submit environment
• Unlike HPC centers, Microsoft Azure does not provide
an HPC batch system or a shared file system out of the box
• But CycleCloud as a free add-on option
• CycleCloud provides several batch system options
• We chose SLURM, mostly due to our familiarity with that system
• Comes with ssh access and auto-scaling capabilities out-of-the-box
• More advanced options require the use of their API
• Initial setup relatively easy, but not trivial
• Mostly a documentation issue
• Hit also a couple of bugs in the advanced options (e.g. spot HPC use)
(since fixed)
7. Execution environment
• CycleCloud does most of the system/batch config
• Also comes with basic compiler and MPI config
• However, optimized for CPU instances
• No out-of-the-box GPU support
• To use the GPU instances we had to do some manual changes
• Create a GPU-enabled HPC VM Image, and point CycleCloud to it
• Install PGI compilers (note: Now called NVIDIA HPC SDK)
• Install and configure the MPI library
• Comes with PGI compilers
• We used the head-node NFS shared filesystem setup
• Not a true HPC storage solution, but good enough for CGYRO
8. Benchmarking CGYRO on Azure – With real science
• The main “benchmarking tool” was a brand new, cutting-edge
CGYRO simulation:
• A multi-scale simulation
• N_RADIAL=1024, N_TOROIDAL=128
• https://github.com/scidac/atom-open-doc/blob/master/2020.11-SC20/multiscale_input/input.cgyro
• Most of the compute time in Azure was spent
advancing the progress of the above simulation
• And most of that time was using Spot pricing (very little preemption incurred)
• We also ran some smaller test simulations for completeness
• nl03 and sh02, which represent more often used simulation profiles
• These benchmark tests used a minimal fraction of total resources
9. Multi-scale benchmark results on Azure
• We started with the GPU-providing NDv2 instance (8x NVIDIA V100)
• But observed that a very high fraction of the time spent in communication
• We thus switched the simulation to the CPU-providing instances
• HC uses “traditional” INTEL Xeon CPUs
• HBv2 uses the latest AMD EPYC CPUs
• Azure also has well defined per-hour price for each instance type,
making for an easy cost effectiveness comparison
• We focused on spot pricing
which seems feasible
at these scales
Instance Nodes Total time Comm. time Total cost
NDv2 24 161s 139s $5.24
NDv2 8 369s 316s $4.00
HBv2 36 272s 104s $1.81
HBv2 18 441s 190s $1.47
HC 35 416s 113s $2.42
HC 18 763s 151s $2.28
Slower per node, use more nodes
All numbers represent one typical step during the simulation. Cost is computed using spot instance pricing.
10. Multi-scale benchmark results on Azure
• We started with the GPU-providing NDv2 instance (8x NVIDIA V100)
• But observed that a very high fraction of the time spent in communication
• We thus switched the simulation to the CPU-providing instances
• HC uses “traditional” INTEL Xeon CPUs
• HBv2 uses the latest AMD EPYC CPUs
• Azure also has well defined per-hour price for each instance type,
making for an easy cost effectiveness comparison
• We focused on spot pricing
which seems feasible
at these scales
Instance Nodes Total time Comm. time Total cost
NDv2 24 161s 139s $5.24
NDv2 8 369s 316s $4.00
HBv2 36 272s 104s $1.81
HBv2 18 441s 190s $1.47
HC 35 416s 113s $2.42
HC 18 763s 151s $2.28
Slower per node, use more nodes
AMD CPU-based HBv2
a clear winner
Comparable speed to
NDv2, at much lower cost
All numbers represent one typical step during the simulation. Cost is computed using spot instance pricing.
11. Comparing to on-prem HPC centers(Multi-scale)
• To have a frame of reference, we also ran on
two on-prem HPC centers we had access to
• ORNL Summit – 6x NVIDIA V100 GPUs and 2x 100 Gbps IB per node
• NERSC Cori – INTEL Xeon Phi (KNL) CPU and 56 Gbps IB per node
• The Azure CPU instances
are comparable
to Cori results
• Summit is significantly faster
• Better networking shows
Instance Nodes Total time Comm. time Total cost
NDv2 24 161s 139s $5.24
NDv2 8 369s 316s $4.00
HBv2 36 272s 104s $1.81
HBv2 18 441s 190s $1.47
HC 35 416s 113s $2.42
HC 18 763s 151s $2.28
System Nodes Total time Comm. time
Summit 32 86s 67s
Cori 128 165s 62s
Cori 48 339s 160s
All numbers represent one typical step during the simulation. Cost is computed using spot instance pricing.
12. Smaller benchmark simulation – nl03
• Very similar insights when looking at the smaller nl03 test case
• AMD CPU based HBv2 still the most cost effective
• And an excellent
alternative to Cori
• NDv2 instances still
network limited
• Summit again
scales better
Instance Nodes Total time Comm. time Total cost
NDv2 16 121s 92s $2.64
NDv2 4 397s 293s $2.15
HBv2 36 87s 45s $0.58
HBv2 9 289s 64s $0.48
HC 24 223s 60s $0.89
HC 12 431s 96s $0.86
System Nodes Total time Comm. time
Summit 16 82s 46s
Cori 64 112s 46s
Cori 16 372s 120s
All numbers represent one typical step during the simulation. Cost is computed using spot instance pricing.
13. Suitability of spot pricing for HPC
• Spot instances definitely have downsides:
• lower availability and
• potential preemption during runtime
• But they typically cost 66% - 88% less than ”normal”, i.e. on-demand
• CGYRO can deal with occasional preemption
• Using checkpointing every couple hours, with minimal overhead
• At smaller node counts, we typically experienced at most
a couple preemptions per day
• But it does get worse with node count
• And were not able to reliably exceed 24x NDv2 or 36x HBv2
14. Summary and conclusions
• We explored the feasibility of running CGYRO on Azure HPC resources
• With an emphasis on using them in spot mode
• We observed both that
• CPU-only resources were very efficient, and
• that running in spot mode was doable, with minimal side effects.
• The GPU-enabled resources were less cost effective
but allowed for higher scaling
• When Cloud budget is available, Azure is an excellent place for CGYRO
15. Acknowledgements
• This presentation is based on the poster accepted at
and presented at SC20
https://sc20.supercomputing.org/presentation/?id=rpost106&sess=sess337
• The creation of this presentation was supported by the
U.S. Department of Energy under awards
DE-FG02-95ER54309 (General Atomics Theory grant) and
DE-SC0017992 (AToM SciDAC-4 project).
Computing resources were provided by the Oak Ridge Leadership Computing
Facility under Contract DE-AC05-00OR22725 (ALCC program) and the National
Energy Research Scientific Computing Center under Contract DE-AC02-05CH11231
16. Updates since the poster was created
• Microsoft Azure announced a few new improvements
• A new GPU-based HPC instance
• NDV4 – 8x NVIDIA A100 with 8x 200 Gbps IB (1.6 Tbps total) per node
• https://azure.microsoft.com/en-us/blog/bringing-ai-supercomputing-to-customers/
• An updated version of CycleCloud
• https://techcommunity.microsoft.com/t5/azure-compute/azure-cyclecloud-8-1-is-now-
available/ba-p/1898011