SlideShare a Scribd company logo
1 of 22
Lustre, RoCE and MAN
Łukasz Flis, Marek Magryś
Dominika Kałafut, Patryk Lasoń, Adrian Marszalik, Maciej Pawlik
Academic Computer Centre Cyfronet AGH
● The biggest Polish Academic Computer Centre
○ Over 45 years of experience in IT provision
○ Centre of excellence in HPC and Grid Computing
○ Home for Prometheus and Zeus supercomputers
● Legal status: an autonomous within AGH University of Science and Technology
● Staff: > 160 , ca. 60 in R&D
● Leader of PLGrid: Polish Grid and Cloud Infrastructure for Science
● NGI Coordination in EGI e-Infrastructure
2
Network backbone
●4 main links to achieve maximum reliability
●Each link with 7x 10 Gbps capacity
●Additional 2x 100 Gbps dedicated links
●Direct connection with GEANT scientific network
●Over 40 switches
●Security
●Monitoring
3
Academic Computer Centre Cyfronet AGH
Prometheus
● 2.4 PFLOPS
● 53 604 cores
● 1st HPC system
in Poland (174st on Top 500, 38th in 2015)
4
Zeus
● 374 TFLOPS
● 25 468 cores
● 1st HPC system in Poland
(from 2009 to 2015, highest
rank on Top500 – 81st in 2011)
Computing portals and
frameworks
● OneData
● PLG-Data
● DataNet
● Rimrock
● InSilicoLab
Data Centres
● 3 independent data centres
● dedicated backbone links
Research & Development
● distributed computing environments
● computing acceleration
● machine learning
● software development & optimization
Storage
● 48 PB
● hierarchical data management
Computational Cloud
● based on OpenStack
HPC@Cyfronet 5
●Prometheus and Zeus clusters
○ 6475 active users (at the end of 2018)
○ 350+ computational grants
○ 8+ millions of jobs in 2018
○ 371+ millions of CPU hours spent in 2018
○ Biggest jobs in 2018
■ 27 648 cores
■ 261 152 CPU hours in one job
○ 900+ (Prometheus) and 600+ (Zeus) software modules
○ Custom users helper tools developed in-house
The fastest supercomputer in Poland:
Prometheus 6
● Installed in Q2 2015 (upgraded in Q4 2015)
● Centos 7 + SLURM
● HP Apollo 8000 - direct warm cooled system – PUE 1.06
○ 20 racks (4 CDU, 16 compute)
● 2235 nodes, 53 604 CPU cores (Haswell, Xeon E5-2680v3 12C 2.5GHz), 282 TB RAM
○ 2160 regular nodes (2 CPUs, 128 GB RAM)
○ 72 nodes with GPGPUs (2x NVIDIA Tesla K40 XL)
○ 4 islands
● Main storage based on Lustre
○ Scratch: 5 PB, 120 GB/s, 4x DDN SFA12kx
○ Archive: 5 PB, 60 GB/s, 2x DDN SFA12kx
● 2.4 PFLOPS total performance (Rpeak)
● < 850 kW power (including cooling)
● TOP500: current 174th position, highest: 38th (XI 2015)
Project background 7
● Industrial partner
● Areas:
○ Data storage
■ POSIX
■ 10s of PBs
■ Incremental growth
○ HPC
○ Networking
○ Consulting
● PoC in 2017
● Infrastructure tests and design in 2018
● Production in Q1 2019
Photo: wikipedia.org
Challenges 8
● How to separate industrial and academic workloads?
○ Isolated storage platform
○ Dedicated network + dedicated IB partition
○ Custom compute OS image
○ Scheduler (SLURM) setup
○ Do not mix funding sources
● Which hardware platform to use?
○ ZFS JBOD vs RAID
○ Infiniband vs Ethernet
○ Capacity/performance ratio
○ Single vs partitioned namespace
Location 9
Storage to compute distance: 14 km over fibre (81 µs)
DC Nawojki
DC Pychowice
Map: openstreetmap.org
MAN backup link
Dark fibre
Infrastructure overview 10
Solution 11
● DDN SFA200NV for Lustre MDT
○ 10x 1.5 TB NVMe + 1 spare
● DDN ES7990 building block for OST
○ > 4 PiB usable space
○ ~ 20 GB/s performance
○ 450x 14 TB NL SAS
○ 4x 100 Gb/s Ethernet
○ Embedded Exascaler
● Juniper QFX10008
○ Deep buffers (100ms)
● Vertiv DCM racks
○ 48 U, custom depth: 130 cm
○ 1500 kg static load
Network: RDMA over Converged Ethernet
RoCE v1:
● L2 - Ethernet Link Layer Protocol (Ethertype 0x8915)
● requires link level flow control for lossless Ethernet
(PAUSE frames or Priority Flow Control)
● not routable
RoCE v2:
● L3 - uses UDP/IP packets, port 4791
● link level flow control optional
● can use ECN (Explicit Congestion Notification) for
controlling flows on lossy networks
● routable
Mellanox ConnectX HCAs implement hardware offload for
RoCE protocols
12
LNET: TCP vs RoCE v2
LNET selftest, default tuning for ksocknald and ko2iblnd, Lustre: 2.10.5, ConnectX-4 Adapters, 100 GbE, congestion free env., MTU 9216
(RoCE uses 4k max)
1310874.4
Local: MAX TCP: 4114.7 MiB/s @ 4 RPCs vs MAX RoCE v2: 10874.4 MiB/s @ 16 RPCs
Remote: MAX TCP: 3662.2 MiB/s @ 4 RPCs vs MAX RoCE v2: 6805.7 MiB/s @ 32 RPCs
Theoretical Max: 11682 MiB/s (12250 MB/s)
LNET: TCP vs RoCE v2
Short summary TCP vs RoCE v2 p2p (no congestion)
Short range test:
● RoCE v2 out-of-box LNET bandwidth 2.6x better than TCP
● link saturation 93%
Long range test (14km):
● out-of-box LNET: RoCE v2 1.85x better than TCP
● link saturation: 58% (default settings)
● tuning required - ko2iblnd concurrent_sends=4, peer_credits=64
gives 11332.66 MiB/s (97% saturation)
HW offloaded RoCE allows for full link utilization and low CPU usage.
Single LNET router is easily able to saturate 100 Gb/s link
14
Explicit Congestion Notification
● RoCEv2 can be used over lossy links
● Packet drops == retransmissions == bandwidth hiccups
● Enabling ECN effectively reduces packet drops on
congested ports
● ECN must be enabled on all devices over the path
● If HCA sees ECN mark on received packet:
○ 1. CNP packet is sent back to the sender
○ 2. Sender reduces transmission speed in reaction to CNP
15
ECN how to
1. Use ECN capable switches
2. Use RoCE capable host adapters (CX4 and CX5 were tested)
3. Use DSCP field in IP header to tag RDMA and CNP packets
on host (cma_roce_tos)
4. Enable ECN for RoCE traffic on switches
5. Prioritize CNP packets to assure proper congestion signaling
6. Enjoy stable transfers and significantly reduced frame drops
7. Optionally use L3 and OSPF or BGP to handle backup routes
16
LNET: congested long link
Lustre 2.10.5, DC1 to DC2 2x100 GbE, test: write 4:2
Congestion appears on the DC1 to DC2 link due to 4:2 link reduction
17
RoCEv2 no FC: 12818.9 MiB/s 54.86%
TCP no FC: 15368.3 MiB/s 65.78%
RoCEv2 ECN: 19426.8 MiB/s 83.14%
RoCEv2: ECN vs no ECN
Effects of disabling ECN
18
Real life test
2x DDN ES7990 (4 OSS), 4 LNET routers (RoCE <-> IB FDR), 14 km
Bandwidth: IOR 112 tasks @ 28 client nodes
Max Write: 29872.21 MiB/sec (31323.28 MB/sec)
Max Read: 34368.27 MiB/sec (36037.74 MB/sec)
19
Conclusions 20
● For bandwidth workloads latency on MAN distances is
not an issue
● ECN mechanisms for RoCE needs to be enabled to
significantly reduce packet drops during congestion
● Aggregation of links (LACP+Adaptive Load Balancing or
ECMP for L3) allows to scale bandwidth linearly by
evenly utilizing available links
● RoCE allows more flexibility in terms of transport links
compared to IB - ie. backup routing, cheaper and more
scalable infrastructure
Acknowledgements 21
Thanks for the test infrastructure and support
22
Visit us at booth H-710!
(and taste some krówka)
Thank you!

More Related Content

What's hot

[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
OpenStack Korea Community
 
Ns2240series users manual_07
Ns2240series users manual_07Ns2240series users manual_07
Ns2240series users manual_07
squat12
 

What's hot (20)

Accelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux KernelAccelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux Kernel
 
IPVS for Docker Containers
IPVS for Docker ContainersIPVS for Docker Containers
IPVS for Docker Containers
 
Dataplane programming with eBPF: architecture and tools
Dataplane programming with eBPF: architecture and toolsDataplane programming with eBPF: architecture and tools
Dataplane programming with eBPF: architecture and tools
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
FIFA 온라인 3의 MongoDB 사용기
FIFA 온라인 3의 MongoDB 사용기FIFA 온라인 3의 MongoDB 사용기
FIFA 온라인 3의 MongoDB 사용기
 
A guide of PostgreSQL on Kubernetes
A guide of PostgreSQL on KubernetesA guide of PostgreSQL on Kubernetes
A guide of PostgreSQL on Kubernetes
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
 
Linux Performance Analysis and Tools
Linux Performance Analysis and ToolsLinux Performance Analysis and Tools
Linux Performance Analysis and Tools
 
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
 
EBPF and Linux Networking
EBPF and Linux NetworkingEBPF and Linux Networking
EBPF and Linux Networking
 
오픈스택 기반 클라우드 서비스 구축 방안 및 사례
오픈스택 기반 클라우드 서비스 구축 방안 및 사례오픈스택 기반 클라우드 서비스 구축 방안 및 사례
오픈스택 기반 클라우드 서비스 구축 방안 및 사례
 
Using eBPF for High-Performance Networking in Cilium
Using eBPF for High-Performance Networking in CiliumUsing eBPF for High-Performance Networking in Cilium
Using eBPF for High-Performance Networking in Cilium
 
How Linux Processes Your Network Packet - Elazar Leibovich
How Linux Processes Your Network Packet - Elazar LeibovichHow Linux Processes Your Network Packet - Elazar Leibovich
How Linux Processes Your Network Packet - Elazar Leibovich
 
BPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabBPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLab
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Intro to kubernetes
Intro to kubernetesIntro to kubernetes
Intro to kubernetes
 
Scale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 servicesScale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 services
 
Ns2240series users manual_07
Ns2240series users manual_07Ns2240series users manual_07
Ns2240series users manual_07
 
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
 
Deep dive in container service discovery
Deep dive in container service discoveryDeep dive in container service discovery
Deep dive in container service discovery
 

Similar to Lustre, RoCE, and MAN

20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
JunZhao68
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
madhuinturi
 

Similar to Lustre, RoCE, and MAN (20)

Cilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDPCilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDP
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
 
SFScon 21 - Stefan Schmidt - The Rise of IPv6 in IoT Protocols
SFScon 21 - Stefan Schmidt - The Rise of IPv6 in IoT ProtocolsSFScon 21 - Stefan Schmidt - The Rise of IPv6 in IoT Protocols
SFScon 21 - Stefan Schmidt - The Rise of IPv6 in IoT Protocols
 
6LoWPAN: An open IoT Networking Protocol
6LoWPAN: An open IoT Networking Protocol6LoWPAN: An open IoT Networking Protocol
6LoWPAN: An open IoT Networking Protocol
 
cisco-n3k-c3172pq-10ge-datasheet.pdf
cisco-n3k-c3172pq-10ge-datasheet.pdfcisco-n3k-c3172pq-10ge-datasheet.pdf
cisco-n3k-c3172pq-10ge-datasheet.pdf
 
100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdf
 
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
 
An FPGA for high end Open Networking
An FPGA for high end Open NetworkingAn FPGA for high end Open Networking
An FPGA for high end Open Networking
 
100 M pps on PC.
100 M pps on PC.100 M pps on PC.
100 M pps on PC.
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
6LoWPAN: An Open IoT Networking Protocol
6LoWPAN: An Open IoT Networking Protocol6LoWPAN: An Open IoT Networking Protocol
6LoWPAN: An Open IoT Networking Protocol
 
Introduction to Internet of Things
Introduction to Internet of ThingsIntroduction to Internet of Things
Introduction to Internet of Things
 
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux DeviceAdding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
 
Networking essentials lect1
Networking essentials lect1Networking essentials lect1
Networking essentials lect1
 
Run Your Own 6LoWPAN Based IoT Network
Run Your Own 6LoWPAN Based IoT NetworkRun Your Own 6LoWPAN Based IoT Network
Run Your Own 6LoWPAN Based IoT Network
 
FAR/MARS Avionics CDR
FAR/MARS Avionics CDRFAR/MARS Avionics CDR
FAR/MARS Avionics CDR
 
Brkdct 3101
Brkdct 3101Brkdct 3101
Brkdct 3101
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
PLNOG 13: Piotr Szolkowski: 100G Ethernet – Case Study
PLNOG 13: Piotr Szolkowski: 100G Ethernet – Case StudyPLNOG 13: Piotr Szolkowski: 100G Ethernet – Case Study
PLNOG 13: Piotr Szolkowski: 100G Ethernet – Case Study
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Lustre, RoCE, and MAN

  • 1. Lustre, RoCE and MAN Łukasz Flis, Marek Magryś Dominika Kałafut, Patryk Lasoń, Adrian Marszalik, Maciej Pawlik
  • 2. Academic Computer Centre Cyfronet AGH ● The biggest Polish Academic Computer Centre ○ Over 45 years of experience in IT provision ○ Centre of excellence in HPC and Grid Computing ○ Home for Prometheus and Zeus supercomputers ● Legal status: an autonomous within AGH University of Science and Technology ● Staff: > 160 , ca. 60 in R&D ● Leader of PLGrid: Polish Grid and Cloud Infrastructure for Science ● NGI Coordination in EGI e-Infrastructure 2
  • 3. Network backbone ●4 main links to achieve maximum reliability ●Each link with 7x 10 Gbps capacity ●Additional 2x 100 Gbps dedicated links ●Direct connection with GEANT scientific network ●Over 40 switches ●Security ●Monitoring 3
  • 4. Academic Computer Centre Cyfronet AGH Prometheus ● 2.4 PFLOPS ● 53 604 cores ● 1st HPC system in Poland (174st on Top 500, 38th in 2015) 4 Zeus ● 374 TFLOPS ● 25 468 cores ● 1st HPC system in Poland (from 2009 to 2015, highest rank on Top500 – 81st in 2011) Computing portals and frameworks ● OneData ● PLG-Data ● DataNet ● Rimrock ● InSilicoLab Data Centres ● 3 independent data centres ● dedicated backbone links Research & Development ● distributed computing environments ● computing acceleration ● machine learning ● software development & optimization Storage ● 48 PB ● hierarchical data management Computational Cloud ● based on OpenStack
  • 5. HPC@Cyfronet 5 ●Prometheus and Zeus clusters ○ 6475 active users (at the end of 2018) ○ 350+ computational grants ○ 8+ millions of jobs in 2018 ○ 371+ millions of CPU hours spent in 2018 ○ Biggest jobs in 2018 ■ 27 648 cores ■ 261 152 CPU hours in one job ○ 900+ (Prometheus) and 600+ (Zeus) software modules ○ Custom users helper tools developed in-house
  • 6. The fastest supercomputer in Poland: Prometheus 6 ● Installed in Q2 2015 (upgraded in Q4 2015) ● Centos 7 + SLURM ● HP Apollo 8000 - direct warm cooled system – PUE 1.06 ○ 20 racks (4 CDU, 16 compute) ● 2235 nodes, 53 604 CPU cores (Haswell, Xeon E5-2680v3 12C 2.5GHz), 282 TB RAM ○ 2160 regular nodes (2 CPUs, 128 GB RAM) ○ 72 nodes with GPGPUs (2x NVIDIA Tesla K40 XL) ○ 4 islands ● Main storage based on Lustre ○ Scratch: 5 PB, 120 GB/s, 4x DDN SFA12kx ○ Archive: 5 PB, 60 GB/s, 2x DDN SFA12kx ● 2.4 PFLOPS total performance (Rpeak) ● < 850 kW power (including cooling) ● TOP500: current 174th position, highest: 38th (XI 2015)
  • 7. Project background 7 ● Industrial partner ● Areas: ○ Data storage ■ POSIX ■ 10s of PBs ■ Incremental growth ○ HPC ○ Networking ○ Consulting ● PoC in 2017 ● Infrastructure tests and design in 2018 ● Production in Q1 2019 Photo: wikipedia.org
  • 8. Challenges 8 ● How to separate industrial and academic workloads? ○ Isolated storage platform ○ Dedicated network + dedicated IB partition ○ Custom compute OS image ○ Scheduler (SLURM) setup ○ Do not mix funding sources ● Which hardware platform to use? ○ ZFS JBOD vs RAID ○ Infiniband vs Ethernet ○ Capacity/performance ratio ○ Single vs partitioned namespace
  • 9. Location 9 Storage to compute distance: 14 km over fibre (81 µs) DC Nawojki DC Pychowice Map: openstreetmap.org MAN backup link Dark fibre
  • 11. Solution 11 ● DDN SFA200NV for Lustre MDT ○ 10x 1.5 TB NVMe + 1 spare ● DDN ES7990 building block for OST ○ > 4 PiB usable space ○ ~ 20 GB/s performance ○ 450x 14 TB NL SAS ○ 4x 100 Gb/s Ethernet ○ Embedded Exascaler ● Juniper QFX10008 ○ Deep buffers (100ms) ● Vertiv DCM racks ○ 48 U, custom depth: 130 cm ○ 1500 kg static load
  • 12. Network: RDMA over Converged Ethernet RoCE v1: ● L2 - Ethernet Link Layer Protocol (Ethertype 0x8915) ● requires link level flow control for lossless Ethernet (PAUSE frames or Priority Flow Control) ● not routable RoCE v2: ● L3 - uses UDP/IP packets, port 4791 ● link level flow control optional ● can use ECN (Explicit Congestion Notification) for controlling flows on lossy networks ● routable Mellanox ConnectX HCAs implement hardware offload for RoCE protocols 12
  • 13. LNET: TCP vs RoCE v2 LNET selftest, default tuning for ksocknald and ko2iblnd, Lustre: 2.10.5, ConnectX-4 Adapters, 100 GbE, congestion free env., MTU 9216 (RoCE uses 4k max) 1310874.4 Local: MAX TCP: 4114.7 MiB/s @ 4 RPCs vs MAX RoCE v2: 10874.4 MiB/s @ 16 RPCs Remote: MAX TCP: 3662.2 MiB/s @ 4 RPCs vs MAX RoCE v2: 6805.7 MiB/s @ 32 RPCs Theoretical Max: 11682 MiB/s (12250 MB/s)
  • 14. LNET: TCP vs RoCE v2 Short summary TCP vs RoCE v2 p2p (no congestion) Short range test: ● RoCE v2 out-of-box LNET bandwidth 2.6x better than TCP ● link saturation 93% Long range test (14km): ● out-of-box LNET: RoCE v2 1.85x better than TCP ● link saturation: 58% (default settings) ● tuning required - ko2iblnd concurrent_sends=4, peer_credits=64 gives 11332.66 MiB/s (97% saturation) HW offloaded RoCE allows for full link utilization and low CPU usage. Single LNET router is easily able to saturate 100 Gb/s link 14
  • 15. Explicit Congestion Notification ● RoCEv2 can be used over lossy links ● Packet drops == retransmissions == bandwidth hiccups ● Enabling ECN effectively reduces packet drops on congested ports ● ECN must be enabled on all devices over the path ● If HCA sees ECN mark on received packet: ○ 1. CNP packet is sent back to the sender ○ 2. Sender reduces transmission speed in reaction to CNP 15
  • 16. ECN how to 1. Use ECN capable switches 2. Use RoCE capable host adapters (CX4 and CX5 were tested) 3. Use DSCP field in IP header to tag RDMA and CNP packets on host (cma_roce_tos) 4. Enable ECN for RoCE traffic on switches 5. Prioritize CNP packets to assure proper congestion signaling 6. Enjoy stable transfers and significantly reduced frame drops 7. Optionally use L3 and OSPF or BGP to handle backup routes 16
  • 17. LNET: congested long link Lustre 2.10.5, DC1 to DC2 2x100 GbE, test: write 4:2 Congestion appears on the DC1 to DC2 link due to 4:2 link reduction 17 RoCEv2 no FC: 12818.9 MiB/s 54.86% TCP no FC: 15368.3 MiB/s 65.78% RoCEv2 ECN: 19426.8 MiB/s 83.14%
  • 18. RoCEv2: ECN vs no ECN Effects of disabling ECN 18
  • 19. Real life test 2x DDN ES7990 (4 OSS), 4 LNET routers (RoCE <-> IB FDR), 14 km Bandwidth: IOR 112 tasks @ 28 client nodes Max Write: 29872.21 MiB/sec (31323.28 MB/sec) Max Read: 34368.27 MiB/sec (36037.74 MB/sec) 19
  • 20. Conclusions 20 ● For bandwidth workloads latency on MAN distances is not an issue ● ECN mechanisms for RoCE needs to be enabled to significantly reduce packet drops during congestion ● Aggregation of links (LACP+Adaptive Load Balancing or ECMP for L3) allows to scale bandwidth linearly by evenly utilizing available links ● RoCE allows more flexibility in terms of transport links compared to IB - ie. backup routing, cheaper and more scalable infrastructure
  • 21. Acknowledgements 21 Thanks for the test infrastructure and support
  • 22. 22 Visit us at booth H-710! (and taste some krówka) Thank you!