In this deck from the DDN User Group at ISC 2019, Marek Magryś from Cyfronet presents: Lustre, RoCE, and MAN.
"This talk will describe the architecture and implementation of high capacity Lustre file system for the need of a data intensive project. Storage is based on DDN ES7700 building block and uses RDMA over Converged Ethernet as network transport. What is unusual is that the storage system is located over 10 kilometers away from the supercomputer. Challenges, performance benchmarks and tuning will be the main topic of the presentation."
Watch the video: https://wp.me/p3RLHQ-kAn
Learn more: http://www.cyfronet.krakow.pl/
and
https://www.ddn.com/company/events/isc-user-group/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Lustre, RoCE, and MAN
1. Lustre, RoCE and MAN
Łukasz Flis, Marek Magryś
Dominika Kałafut, Patryk Lasoń, Adrian Marszalik, Maciej Pawlik
2. Academic Computer Centre Cyfronet AGH
● The biggest Polish Academic Computer Centre
○ Over 45 years of experience in IT provision
○ Centre of excellence in HPC and Grid Computing
○ Home for Prometheus and Zeus supercomputers
● Legal status: an autonomous within AGH University of Science and Technology
● Staff: > 160 , ca. 60 in R&D
● Leader of PLGrid: Polish Grid and Cloud Infrastructure for Science
● NGI Coordination in EGI e-Infrastructure
2
3. Network backbone
●4 main links to achieve maximum reliability
●Each link with 7x 10 Gbps capacity
●Additional 2x 100 Gbps dedicated links
●Direct connection with GEANT scientific network
●Over 40 switches
●Security
●Monitoring
3
4. Academic Computer Centre Cyfronet AGH
Prometheus
● 2.4 PFLOPS
● 53 604 cores
● 1st HPC system
in Poland (174st on Top 500, 38th in 2015)
4
Zeus
● 374 TFLOPS
● 25 468 cores
● 1st HPC system in Poland
(from 2009 to 2015, highest
rank on Top500 – 81st in 2011)
Computing portals and
frameworks
● OneData
● PLG-Data
● DataNet
● Rimrock
● InSilicoLab
Data Centres
● 3 independent data centres
● dedicated backbone links
Research & Development
● distributed computing environments
● computing acceleration
● machine learning
● software development & optimization
Storage
● 48 PB
● hierarchical data management
Computational Cloud
● based on OpenStack
5. HPC@Cyfronet 5
●Prometheus and Zeus clusters
○ 6475 active users (at the end of 2018)
○ 350+ computational grants
○ 8+ millions of jobs in 2018
○ 371+ millions of CPU hours spent in 2018
○ Biggest jobs in 2018
■ 27 648 cores
■ 261 152 CPU hours in one job
○ 900+ (Prometheus) and 600+ (Zeus) software modules
○ Custom users helper tools developed in-house
6. The fastest supercomputer in Poland:
Prometheus 6
● Installed in Q2 2015 (upgraded in Q4 2015)
● Centos 7 + SLURM
● HP Apollo 8000 - direct warm cooled system – PUE 1.06
○ 20 racks (4 CDU, 16 compute)
● 2235 nodes, 53 604 CPU cores (Haswell, Xeon E5-2680v3 12C 2.5GHz), 282 TB RAM
○ 2160 regular nodes (2 CPUs, 128 GB RAM)
○ 72 nodes with GPGPUs (2x NVIDIA Tesla K40 XL)
○ 4 islands
● Main storage based on Lustre
○ Scratch: 5 PB, 120 GB/s, 4x DDN SFA12kx
○ Archive: 5 PB, 60 GB/s, 2x DDN SFA12kx
● 2.4 PFLOPS total performance (Rpeak)
● < 850 kW power (including cooling)
● TOP500: current 174th position, highest: 38th (XI 2015)
7. Project background 7
● Industrial partner
● Areas:
○ Data storage
■ POSIX
■ 10s of PBs
■ Incremental growth
○ HPC
○ Networking
○ Consulting
● PoC in 2017
● Infrastructure tests and design in 2018
● Production in Q1 2019
Photo: wikipedia.org
8. Challenges 8
● How to separate industrial and academic workloads?
○ Isolated storage platform
○ Dedicated network + dedicated IB partition
○ Custom compute OS image
○ Scheduler (SLURM) setup
○ Do not mix funding sources
● Which hardware platform to use?
○ ZFS JBOD vs RAID
○ Infiniband vs Ethernet
○ Capacity/performance ratio
○ Single vs partitioned namespace
9. Location 9
Storage to compute distance: 14 km over fibre (81 µs)
DC Nawojki
DC Pychowice
Map: openstreetmap.org
MAN backup link
Dark fibre
11. Solution 11
● DDN SFA200NV for Lustre MDT
○ 10x 1.5 TB NVMe + 1 spare
● DDN ES7990 building block for OST
○ > 4 PiB usable space
○ ~ 20 GB/s performance
○ 450x 14 TB NL SAS
○ 4x 100 Gb/s Ethernet
○ Embedded Exascaler
● Juniper QFX10008
○ Deep buffers (100ms)
● Vertiv DCM racks
○ 48 U, custom depth: 130 cm
○ 1500 kg static load
12. Network: RDMA over Converged Ethernet
RoCE v1:
● L2 - Ethernet Link Layer Protocol (Ethertype 0x8915)
● requires link level flow control for lossless Ethernet
(PAUSE frames or Priority Flow Control)
● not routable
RoCE v2:
● L3 - uses UDP/IP packets, port 4791
● link level flow control optional
● can use ECN (Explicit Congestion Notification) for
controlling flows on lossy networks
● routable
Mellanox ConnectX HCAs implement hardware offload for
RoCE protocols
12
13. LNET: TCP vs RoCE v2
LNET selftest, default tuning for ksocknald and ko2iblnd, Lustre: 2.10.5, ConnectX-4 Adapters, 100 GbE, congestion free env., MTU 9216
(RoCE uses 4k max)
1310874.4
Local: MAX TCP: 4114.7 MiB/s @ 4 RPCs vs MAX RoCE v2: 10874.4 MiB/s @ 16 RPCs
Remote: MAX TCP: 3662.2 MiB/s @ 4 RPCs vs MAX RoCE v2: 6805.7 MiB/s @ 32 RPCs
Theoretical Max: 11682 MiB/s (12250 MB/s)
14. LNET: TCP vs RoCE v2
Short summary TCP vs RoCE v2 p2p (no congestion)
Short range test:
● RoCE v2 out-of-box LNET bandwidth 2.6x better than TCP
● link saturation 93%
Long range test (14km):
● out-of-box LNET: RoCE v2 1.85x better than TCP
● link saturation: 58% (default settings)
● tuning required - ko2iblnd concurrent_sends=4, peer_credits=64
gives 11332.66 MiB/s (97% saturation)
HW offloaded RoCE allows for full link utilization and low CPU usage.
Single LNET router is easily able to saturate 100 Gb/s link
14
15. Explicit Congestion Notification
● RoCEv2 can be used over lossy links
● Packet drops == retransmissions == bandwidth hiccups
● Enabling ECN effectively reduces packet drops on
congested ports
● ECN must be enabled on all devices over the path
● If HCA sees ECN mark on received packet:
○ 1. CNP packet is sent back to the sender
○ 2. Sender reduces transmission speed in reaction to CNP
15
16. ECN how to
1. Use ECN capable switches
2. Use RoCE capable host adapters (CX4 and CX5 were tested)
3. Use DSCP field in IP header to tag RDMA and CNP packets
on host (cma_roce_tos)
4. Enable ECN for RoCE traffic on switches
5. Prioritize CNP packets to assure proper congestion signaling
6. Enjoy stable transfers and significantly reduced frame drops
7. Optionally use L3 and OSPF or BGP to handle backup routes
16
17. LNET: congested long link
Lustre 2.10.5, DC1 to DC2 2x100 GbE, test: write 4:2
Congestion appears on the DC1 to DC2 link due to 4:2 link reduction
17
RoCEv2 no FC: 12818.9 MiB/s 54.86%
TCP no FC: 15368.3 MiB/s 65.78%
RoCEv2 ECN: 19426.8 MiB/s 83.14%
19. Real life test
2x DDN ES7990 (4 OSS), 4 LNET routers (RoCE <-> IB FDR), 14 km
Bandwidth: IOR 112 tasks @ 28 client nodes
Max Write: 29872.21 MiB/sec (31323.28 MB/sec)
Max Read: 34368.27 MiB/sec (36037.74 MB/sec)
19
20. Conclusions 20
● For bandwidth workloads latency on MAN distances is
not an issue
● ECN mechanisms for RoCE needs to be enabled to
significantly reduce packet drops during congestion
● Aggregation of links (LACP+Adaptive Load Balancing or
ECMP for L3) allows to scale bandwidth linearly by
evenly utilizing available links
● RoCE allows more flexibility in terms of transport links
compared to IB - ie. backup routing, cheaper and more
scalable infrastructure