SlideShare a Scribd company logo
1 of 32
Download to read offline
David Lo, Dragos Sbirlea, Rohit Jnagal
Managing Memory Bandwidth Antagonism @ Scale
David Dragos & Rohit
Borg Model
● Large clusters with multi-tenant hosts.
● Run a mix of :
○ high and low priority workloads.
○ latency-sensitive and batch workloads.
● Isolation through bare-metal containers
(cgroups/namespaces)
○ Cgroups and perf to monitor host and job
performance.
○ Cgroups and h/w controls to manage
on-node performance.
○ Cluster scheduling and balancing manages
service performance.
Efficiency
Availability
Performance
3
The Memory Bandwidth Problem
● Large variation in performance
on multi-tenant hosts.
● On average, saturation events are
few, but:
○ Periodically causes significant
cluster-wide performance
degradation.
● Some workloads are much more
seriously affected than others.
○ Does not necessarily correlate
with victim’s memory bandwidth
use.
Latency
time
Antagonist task starts
4
Note : This talk is focussed on membw problem for general servers and
does not cover GPUs and other special devices. Similar techniques apply
there too.
Memory BW Saturation is Increasing Over Time
Nov
2018
5
Time
Fractionofmachineswithsaturation
Jan
2018
Fraction of machines that experienced mem BW saturation
● Large machines need to pack more jobs to maintain
utilization, resulting in more “noisy neighbor” problems.
Why It Is a (Bigger) Problem Now
● ML workloads are memory BW intensive
6
● Track per-socket local and remote memory bandwidth use
● Identify per-platform thresholds for performance dips (saturation)
● Characterize saturation by platform and clusters
Understanding the Scope : Socket-Level MonitoringMEM
Local
Write
Remote
MEM
LocalRemote
Read WriteWriteReadRead WriteRead
Socket 0 Socket 1
7
Saturation behavior varies with platform and cluster, due to
● hardware differences (membw/core ratio)
● workload (large CPU consumers run on bigger platforms)
Platform and Cluster Variation
8
By platform
By cluster
● Socket-level information gives the magnitude of the
problem and hot-spots
● Need task-level information to identify:
○ Abusers : tasks using disproportionate amount of
bandwidth
○ Victims : tasks seeing performance drop
● New platforms provide task-level memory bandwidth
monitoring, but:
○ RDT cgroup was on its way out
○ Have no data on older platforms
For our purposes, a rough attribution of memory bandwidth
was good enough
Monitoring Sockets ↣ Monitoring Tasks
Saturation threshold
9
Totalmemorybandwidth
MemoryBWbreakdown
● Summary of requirements:
○ Local and remote bandwidth breakdown
○ Compatible with with cgroup model
● What's available in hardware?
○ Uncore counters (IMC, CHA)
■ Difficult to attribute to HyperThread => cgroup
○ CPU PMU counters
■ Counters are HyperThread local
■ Works with cgroup profiling mode
D
D
R
I
M
C
CPU Core
CHA
HT0 HT1
CPU Core
CHA
HT0 HT1
Per-task Memory Bandwidth Estimation
10
● OFFCORE_RESPONSE for Intel CPUs
● Programmable filter to specify events of interest (i.e. DRAM local and DRAM remote)
● Captures both demand load and HW prefetcher traffic
● Online documentation of the meaning of bits, per CPU (download.01.org)
● How to interpret: cache lines / sec X 64b/cache line = BW
Intel SDM Vol 3
Which CPU Perfmon to Use?
11
Abuser insights
● Large percentage of time, a single consumer uses up most bandwidth.
● The share of CPU of that consumer are much lower than its share of membw.
Victim insight
● Many jobs are sensitive to membw saturation.
● Jobs are sensitive even though they are not big users of membw.
Guidance on enforcement options
● How much saturation would we avoid if we do X?
● Which jobs would get caught in the crossfire?
Insights from Task Measurement
CPI degradation on saturation
(as a fraction)
Numberofjobs
Combinations of jobs (by CPU
requirements) during saturation
12
Enforcement : Controlling Different Workloads
BW Usage
Priority
Moderate Heavy
LowMediumHigh
Isolate
Disable
ThrottleThrottle
Reactive rescheduling
Isolate
13
What Can We Do ? Node and Cluster Level Actuators
Node
Memory Bandwidth Allocation in hardware
Use HW QoS to apply max limits to tasks
overusing memory bandwidth.
CPU throttling for indirect control
Limit CPU access of over-using tasks to
indirectly limit the memory bandwidth used.
Cluster
Reactive evictions & re-scheduling
Hosts experiencing memory BW saturation
signals scheduler to re-distribute bigger memory
bandwidth users to lightly-loaded machines.
Disabling heavy antagonist workloads
Tasks that saturate a socket by itself cannot be
effectively redistributed. If slowing down is not
an option, de-schedule them.
14
+ Very effective in reducing saturation;
+ Works on all platforms
Node : CPU Throttling
Socket 0 (saturated) Socket 1
CPUs running memBW over-users
- Too coarse in granularity;
- Interacts poorly with Autoscaling & Load-balancing
15
Socket memory BW
saturation detector
Cgroup memory BW
estimator
Memory BW enforcer
Socket
perf counters
Every x seconds
If socket BW > saturation threshold
Socket, Cgroup
perf counters
Profile potentially
eligible tasks
Policy filter
CPU runnable mask
Select eligible tasks
for throttling
If socket BW < unthrottle threshold,
unthrottle tasks
16
Throttling - Enforcement Algorithm
Node : Memory Bandwidth Allocation
Intel RDT
Memory Bandwidth Allocation
+ Reduced bandwidth without lowering CPU
utilization.
+ Somewhat fine-grained than cpu-level
controls.
- Newer platforms only.
- Can’t isolate well between hyperthreads.
Supported through resctrl in kernel
(more on that later)
17
In many cases, there are:
● A low-percentage of saturated sockets in cluster, and
● Multiple tasks contributing to saturation.
Re-scheduling the tasks to less loaded machines can avoid
slow-downs.
Does not help with large antagonists that can saturate any
socket it runs on.
Cluster : Reactive Re-Scheduling
ObserverScheduler
host
A
host
B
host
C
host
D
saturated
1.Callforhelp
2.Evict
3.Reschedule
18
Low priority jobs can be dealt at node-level through throttling.
If SLOs do not permit throttling and the antagonists cannot be
redistributed :
● Disable (kick out of the cluster)
● Users can then reconfigure their service to use different product.
● Area of continual work.
Alternative :
● Colocate multiple antagonists (that’s just working around SLOs)
Handling Cluster-Wide Saturation
Cluster Membw distribution
amenable to rescheduling
Cluster Membw distribution
amenable to job disabling
Saturation
threshold
Saturation
threshold
19
Results : CPU Throttling + Rescheduling
20
Results : Rebalancing
21
● New, unified interface: resctrl
● resctrl is a big improvement over the previous non-standard cgroup interface
● Uniform way of monitoring/controlling HW QoS across vendors/architectures
○ AMD, ARM, Intel
● (Non-exhaustive) list of HW features supported:
○ Memory BW monitoring
○ Memory BW throttling
○ L3 cache usage monitoring
○ L3 cache partitioning
resctrl : HW QoS Support in Kernel
22
● Below is using x86 terminology
● CLass of Service ID (CLOSID): maps to a QoS configuration. Typically O(10) unique
ones in HW.
● Resource Monitoring ID (RMID): used to tag workloads and their used resources to
aggregate their resource usage. Typically O(100) unique ones in HW.
Intro to HW QoS Terms and Concepts
Hi priority (CLOSID 0)
100% L3 cache
100% mem BW
Low priority (CLOSID 1)
50% L3 cache
20% mem BW
RMID0 RMID1 RMID2 RMID3 RMID4
Workload A Workload B Workload C
23
resctrl/
|- groupA/
| |- mon_groups/
| | |- monA/
| | | |- mon_data/
| | | |- tasks
| | | |- ...
| | |- monB/
| | |- mon_data/
| | |- ...
| |- schemata
| |- tasks
| |- ...
|- groupB/
|- ...
Overview of resctrl Filesystem
Documentation: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
A resource control group. Represents one unique HW CLOSID.
A monitoring group. Represents one unique HW RMID.
TIDs in monitoring group
TIDs in resource control group
QoS configuration for resource control group
Resource usage data for entire resource control group
Resource usage data for monitoring group
24
Example Usage of resctrl Interfaces
$ cat groupA/schemata
L3:0=ff;1=ff
MB:0=90;1=90
$ READING0=$(cat groupA/mon_data/mon_L3_00/mbm_total_bytes)
$ sleep 1
$ READING1=$(cat groupA/mon_data/mon_L3_00/mbm_total_bytes)
$ echo $((READING1-READING0))
1816234126
Allowed to use 8 cache ways for L3 on both sockets.
Per-core memory BW constrained to 90% on both sockets.
Compute memory BW by taking a rate.
In this case, BW ~= 1.8GiB/s
25
Reconciling resctrl and cgroups: First Try
resctrl/
|- no_throttle/
| |- mon_groups/
| | |- cgroupX/
| | | |- mon_data/
| | | |- tasks
| | | |- ...
| | |- monB/
| | |- mon_data/
| | |- ...
| |- schemata
| |- tasks
| |- ...
|- bw_throttled/
|- ...
<< #1
<< #1
<< #1
<< #3
<< #5 ↻
<< #6 ↻
Use case: dynamically apply memory BW throttling if
machine is in trouble
1. Node SW creates 2 resctrl groups: no_throttle
and bw_throttled
2. On cgroup creation, logically assign cgroupX to
no_throttle
3. Create a mongroup for cgroupX in
no_throttle
4. Start cgroupX
5. Move TIDs into no_throttle/tasks
6. Move TIDs into
no_throttle/mon_groups/cgroupX/tasks
7. Move TIDs of high BW user into bw_throttled
26
Use case: dynamically apply memory BW throttling if
machine is in trouble
1. Node SW creates 2 resctrl groups: no_throttle
and bw_throttled
2. On cgroup creation, logically assign cgroupX to
no_throttle
3. Create a mongroup for cgroupX in
no_throttle
4. Start cgroupX
5. Move TIDs into no_throttle/tasks
6. Move TIDs into
no_throttle/mon_groups/cgroupX/tasks
7. Move TIDs of high BW user into bw_throttled
Challenges with Naive Approach
Race in moving TIDs if cgroup is
creating threads. Expensive if lots
of TIDs and to deal with the race.
Desynchronization of L3 cache
occupancy data, since existing
data is tagged with an old RMID.
27
● What if we had the ability to have a 1:1 mapping of cgroups to resctrl groups
○ To change QoS configs, just rewrite schemata
○ More efficient, remove need to move TIDs around
○ Keep existing RMID, prevent L3 occupancy desynchronization issue
○ 100% compatible with existing resctrl abstraction
● CHALLENGE: with existing system, will run out of CLOSIDs very quickly
● SOLUTION: share CLOSIDs between resource control groups with the same schemata
● Google-developed kernel patch for this functionality to be released soon
● Demonstrates need to make cgroup model a first class consideration for QoS
interfaces
A Better Approach for resctrl and cgroups
28
cgroups and resctrl with After the Change
resctrl/
|- cgroupX/
| |- mon_groups/
| | |- mon_data/
| | |- ...
| |- schemata
| |- tasks
| |- ...
|- high_bw_cgroup/
| |- schemata
| |- ...
|- ...
<< #1
<< #4 ↻
Use case: dynamically apply memory BW throttling if
machine is in trouble
1. Create a resctrl group cgroupX
2. Write no throttling configuration to
cgroupX/schemata
3. Start cgroupX
4. Move TIDs into cgroupX/tasks
5. Rewrite schemata of high BW using cgroup to
throttle
<< #2
<< #5
29
● Measuring µArch impact is not a first class component of
most container runtimes.
○ Can’t manage what we can’t see...
● Most container runtimes expose isolation knobs per
container.
● Managing µArch isolation requires node and cluster level
feedback-loops.
○ Dual operating mode : admins & users.
○ Performance isolation not necessarily controllable by
end-users.
We would love to contribute to a standard framework around
performance management for container runtimes.
µArch Features & Container Runtimes
Efficiency
Availability
Performance
30
Takeaways and Future work
● Memory bandwidth and low-level isolation issues becoming more significant.
● Continuous monitoring is critical to run successful multi-tenant hosts.
● Defining requirements for h/w providers and s/w interfaces on QoS knobs.
○ Critical to have these solutions work for containers / process-groups.
● Increasing success rate with current approach:
○ Handling of minimum guaranteed membw usage
○ Handling logically related jobs - Borg allocs
● A general framework would help collaboration.
● Future : Memory BW scheduling (based on hints)
○ Based on membw usage
○ Based on membw sensitivity
31
Find us at the conf or reach out at :
davidlo@
dragoss@
google.com
jnagal@
eranian@
Thanks !
32

More Related Content

What's hot

What's hot (20)

Continguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelContinguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux Kernel
 
Linux on ARM 64-bit Architecture
Linux on ARM 64-bit ArchitectureLinux on ARM 64-bit Architecture
Linux on ARM 64-bit Architecture
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructions
 
The Linux Kernel Scheduler (For Beginners) - SFO17-421
The Linux Kernel Scheduler (For Beginners) - SFO17-421The Linux Kernel Scheduler (For Beginners) - SFO17-421
The Linux Kernel Scheduler (For Beginners) - SFO17-421
 
Linux Internals - Part II
Linux Internals - Part IILinux Internals - Part II
Linux Internals - Part II
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
semaphore & mutex.pdf
semaphore & mutex.pdfsemaphore & mutex.pdf
semaphore & mutex.pdf
 
Intel® RDT Hands-on Lab
Intel® RDT Hands-on LabIntel® RDT Hands-on Lab
Intel® RDT Hands-on Lab
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
 
Process scheduling linux
Process scheduling linuxProcess scheduling linux
Process scheduling linux
 
Linux memory-management-kamal
Linux memory-management-kamalLinux memory-management-kamal
Linux memory-management-kamal
 
Reverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelReverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux Kernel
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
 
Improving the Performance of the qcow2 Format (KVM Forum 2017)
Improving the Performance of the qcow2 Format (KVM Forum 2017)Improving the Performance of the qcow2 Format (KVM Forum 2017)
Improving the Performance of the qcow2 Format (KVM Forum 2017)
 
Linux Preempt-RT Internals
Linux Preempt-RT InternalsLinux Preempt-RT Internals
Linux Preempt-RT Internals
 
How Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferHow Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and Safer
 
Linux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA'sLinux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA's
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
 
Filesystem Performance from a Database Perspective
Filesystem Performance from a Database PerspectiveFilesystem Performance from a Database Perspective
Filesystem Performance from a Database Perspective
 
Slab Allocator in Linux Kernel
Slab Allocator in Linux KernelSlab Allocator in Linux Kernel
Slab Allocator in Linux Kernel
 

Similar to Memory Bandwidth QoS

Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
James McGalliard
 

Similar to Memory Bandwidth QoS (20)

R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systems
 
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackCeph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
 
Measuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneMeasuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data Plane
 
Q2.12: Existing Linux Mechanisms to Support big.LITTLE
Q2.12: Existing Linux Mechanisms to Support big.LITTLEQ2.12: Existing Linux Mechanisms to Support big.LITTLE
Q2.12: Existing Linux Mechanisms to Support big.LITTLE
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
High Speed Design Closure Techniques-Balachander Krishnamurthy
High Speed Design Closure Techniques-Balachander KrishnamurthyHigh Speed Design Closure Techniques-Balachander Krishnamurthy
High Speed Design Closure Techniques-Balachander Krishnamurthy
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
 
Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
 
Caching in
Caching inCaching in
Caching in
 
module01.ppt
module01.pptmodule01.ppt
module01.ppt
 
module4.ppt
module4.pptmodule4.ppt
module4.ppt
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
 
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxPACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
 

More from Rohit Jnagal (6)

Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIU
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
Kubernetes intro public - kubernetes meetup 4-21-2015
Kubernetes intro   public - kubernetes meetup 4-21-2015Kubernetes intro   public - kubernetes meetup 4-21-2015
Kubernetes intro public - kubernetes meetup 4-21-2015
 
Docker n co
Docker n coDocker n co
Docker n co
 
Docker Overview
Docker OverviewDocker Overview
Docker Overview
 
Docker internals
Docker internalsDocker internals
Docker internals
 

Recently uploaded

Recently uploaded (20)

AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, Ocado
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 

Memory Bandwidth QoS

  • 1. David Lo, Dragos Sbirlea, Rohit Jnagal Managing Memory Bandwidth Antagonism @ Scale
  • 3. Borg Model ● Large clusters with multi-tenant hosts. ● Run a mix of : ○ high and low priority workloads. ○ latency-sensitive and batch workloads. ● Isolation through bare-metal containers (cgroups/namespaces) ○ Cgroups and perf to monitor host and job performance. ○ Cgroups and h/w controls to manage on-node performance. ○ Cluster scheduling and balancing manages service performance. Efficiency Availability Performance 3
  • 4. The Memory Bandwidth Problem ● Large variation in performance on multi-tenant hosts. ● On average, saturation events are few, but: ○ Periodically causes significant cluster-wide performance degradation. ● Some workloads are much more seriously affected than others. ○ Does not necessarily correlate with victim’s memory bandwidth use. Latency time Antagonist task starts 4 Note : This talk is focussed on membw problem for general servers and does not cover GPUs and other special devices. Similar techniques apply there too.
  • 5. Memory BW Saturation is Increasing Over Time Nov 2018 5 Time Fractionofmachineswithsaturation Jan 2018 Fraction of machines that experienced mem BW saturation
  • 6. ● Large machines need to pack more jobs to maintain utilization, resulting in more “noisy neighbor” problems. Why It Is a (Bigger) Problem Now ● ML workloads are memory BW intensive 6
  • 7. ● Track per-socket local and remote memory bandwidth use ● Identify per-platform thresholds for performance dips (saturation) ● Characterize saturation by platform and clusters Understanding the Scope : Socket-Level MonitoringMEM Local Write Remote MEM LocalRemote Read WriteWriteReadRead WriteRead Socket 0 Socket 1 7
  • 8. Saturation behavior varies with platform and cluster, due to ● hardware differences (membw/core ratio) ● workload (large CPU consumers run on bigger platforms) Platform and Cluster Variation 8 By platform By cluster
  • 9. ● Socket-level information gives the magnitude of the problem and hot-spots ● Need task-level information to identify: ○ Abusers : tasks using disproportionate amount of bandwidth ○ Victims : tasks seeing performance drop ● New platforms provide task-level memory bandwidth monitoring, but: ○ RDT cgroup was on its way out ○ Have no data on older platforms For our purposes, a rough attribution of memory bandwidth was good enough Monitoring Sockets ↣ Monitoring Tasks Saturation threshold 9 Totalmemorybandwidth MemoryBWbreakdown
  • 10. ● Summary of requirements: ○ Local and remote bandwidth breakdown ○ Compatible with with cgroup model ● What's available in hardware? ○ Uncore counters (IMC, CHA) ■ Difficult to attribute to HyperThread => cgroup ○ CPU PMU counters ■ Counters are HyperThread local ■ Works with cgroup profiling mode D D R I M C CPU Core CHA HT0 HT1 CPU Core CHA HT0 HT1 Per-task Memory Bandwidth Estimation 10
  • 11. ● OFFCORE_RESPONSE for Intel CPUs ● Programmable filter to specify events of interest (i.e. DRAM local and DRAM remote) ● Captures both demand load and HW prefetcher traffic ● Online documentation of the meaning of bits, per CPU (download.01.org) ● How to interpret: cache lines / sec X 64b/cache line = BW Intel SDM Vol 3 Which CPU Perfmon to Use? 11
  • 12. Abuser insights ● Large percentage of time, a single consumer uses up most bandwidth. ● The share of CPU of that consumer are much lower than its share of membw. Victim insight ● Many jobs are sensitive to membw saturation. ● Jobs are sensitive even though they are not big users of membw. Guidance on enforcement options ● How much saturation would we avoid if we do X? ● Which jobs would get caught in the crossfire? Insights from Task Measurement CPI degradation on saturation (as a fraction) Numberofjobs Combinations of jobs (by CPU requirements) during saturation 12
  • 13. Enforcement : Controlling Different Workloads BW Usage Priority Moderate Heavy LowMediumHigh Isolate Disable ThrottleThrottle Reactive rescheduling Isolate 13
  • 14. What Can We Do ? Node and Cluster Level Actuators Node Memory Bandwidth Allocation in hardware Use HW QoS to apply max limits to tasks overusing memory bandwidth. CPU throttling for indirect control Limit CPU access of over-using tasks to indirectly limit the memory bandwidth used. Cluster Reactive evictions & re-scheduling Hosts experiencing memory BW saturation signals scheduler to re-distribute bigger memory bandwidth users to lightly-loaded machines. Disabling heavy antagonist workloads Tasks that saturate a socket by itself cannot be effectively redistributed. If slowing down is not an option, de-schedule them. 14
  • 15. + Very effective in reducing saturation; + Works on all platforms Node : CPU Throttling Socket 0 (saturated) Socket 1 CPUs running memBW over-users - Too coarse in granularity; - Interacts poorly with Autoscaling & Load-balancing 15
  • 16. Socket memory BW saturation detector Cgroup memory BW estimator Memory BW enforcer Socket perf counters Every x seconds If socket BW > saturation threshold Socket, Cgroup perf counters Profile potentially eligible tasks Policy filter CPU runnable mask Select eligible tasks for throttling If socket BW < unthrottle threshold, unthrottle tasks 16 Throttling - Enforcement Algorithm
  • 17. Node : Memory Bandwidth Allocation Intel RDT Memory Bandwidth Allocation + Reduced bandwidth without lowering CPU utilization. + Somewhat fine-grained than cpu-level controls. - Newer platforms only. - Can’t isolate well between hyperthreads. Supported through resctrl in kernel (more on that later) 17
  • 18. In many cases, there are: ● A low-percentage of saturated sockets in cluster, and ● Multiple tasks contributing to saturation. Re-scheduling the tasks to less loaded machines can avoid slow-downs. Does not help with large antagonists that can saturate any socket it runs on. Cluster : Reactive Re-Scheduling ObserverScheduler host A host B host C host D saturated 1.Callforhelp 2.Evict 3.Reschedule 18
  • 19. Low priority jobs can be dealt at node-level through throttling. If SLOs do not permit throttling and the antagonists cannot be redistributed : ● Disable (kick out of the cluster) ● Users can then reconfigure their service to use different product. ● Area of continual work. Alternative : ● Colocate multiple antagonists (that’s just working around SLOs) Handling Cluster-Wide Saturation Cluster Membw distribution amenable to rescheduling Cluster Membw distribution amenable to job disabling Saturation threshold Saturation threshold 19
  • 20. Results : CPU Throttling + Rescheduling 20
  • 22. ● New, unified interface: resctrl ● resctrl is a big improvement over the previous non-standard cgroup interface ● Uniform way of monitoring/controlling HW QoS across vendors/architectures ○ AMD, ARM, Intel ● (Non-exhaustive) list of HW features supported: ○ Memory BW monitoring ○ Memory BW throttling ○ L3 cache usage monitoring ○ L3 cache partitioning resctrl : HW QoS Support in Kernel 22
  • 23. ● Below is using x86 terminology ● CLass of Service ID (CLOSID): maps to a QoS configuration. Typically O(10) unique ones in HW. ● Resource Monitoring ID (RMID): used to tag workloads and their used resources to aggregate their resource usage. Typically O(100) unique ones in HW. Intro to HW QoS Terms and Concepts Hi priority (CLOSID 0) 100% L3 cache 100% mem BW Low priority (CLOSID 1) 50% L3 cache 20% mem BW RMID0 RMID1 RMID2 RMID3 RMID4 Workload A Workload B Workload C 23
  • 24. resctrl/ |- groupA/ | |- mon_groups/ | | |- monA/ | | | |- mon_data/ | | | |- tasks | | | |- ... | | |- monB/ | | |- mon_data/ | | |- ... | |- schemata | |- tasks | |- ... |- groupB/ |- ... Overview of resctrl Filesystem Documentation: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt A resource control group. Represents one unique HW CLOSID. A monitoring group. Represents one unique HW RMID. TIDs in monitoring group TIDs in resource control group QoS configuration for resource control group Resource usage data for entire resource control group Resource usage data for monitoring group 24
  • 25. Example Usage of resctrl Interfaces $ cat groupA/schemata L3:0=ff;1=ff MB:0=90;1=90 $ READING0=$(cat groupA/mon_data/mon_L3_00/mbm_total_bytes) $ sleep 1 $ READING1=$(cat groupA/mon_data/mon_L3_00/mbm_total_bytes) $ echo $((READING1-READING0)) 1816234126 Allowed to use 8 cache ways for L3 on both sockets. Per-core memory BW constrained to 90% on both sockets. Compute memory BW by taking a rate. In this case, BW ~= 1.8GiB/s 25
  • 26. Reconciling resctrl and cgroups: First Try resctrl/ |- no_throttle/ | |- mon_groups/ | | |- cgroupX/ | | | |- mon_data/ | | | |- tasks | | | |- ... | | |- monB/ | | |- mon_data/ | | |- ... | |- schemata | |- tasks | |- ... |- bw_throttled/ |- ... << #1 << #1 << #1 << #3 << #5 ↻ << #6 ↻ Use case: dynamically apply memory BW throttling if machine is in trouble 1. Node SW creates 2 resctrl groups: no_throttle and bw_throttled 2. On cgroup creation, logically assign cgroupX to no_throttle 3. Create a mongroup for cgroupX in no_throttle 4. Start cgroupX 5. Move TIDs into no_throttle/tasks 6. Move TIDs into no_throttle/mon_groups/cgroupX/tasks 7. Move TIDs of high BW user into bw_throttled 26
  • 27. Use case: dynamically apply memory BW throttling if machine is in trouble 1. Node SW creates 2 resctrl groups: no_throttle and bw_throttled 2. On cgroup creation, logically assign cgroupX to no_throttle 3. Create a mongroup for cgroupX in no_throttle 4. Start cgroupX 5. Move TIDs into no_throttle/tasks 6. Move TIDs into no_throttle/mon_groups/cgroupX/tasks 7. Move TIDs of high BW user into bw_throttled Challenges with Naive Approach Race in moving TIDs if cgroup is creating threads. Expensive if lots of TIDs and to deal with the race. Desynchronization of L3 cache occupancy data, since existing data is tagged with an old RMID. 27
  • 28. ● What if we had the ability to have a 1:1 mapping of cgroups to resctrl groups ○ To change QoS configs, just rewrite schemata ○ More efficient, remove need to move TIDs around ○ Keep existing RMID, prevent L3 occupancy desynchronization issue ○ 100% compatible with existing resctrl abstraction ● CHALLENGE: with existing system, will run out of CLOSIDs very quickly ● SOLUTION: share CLOSIDs between resource control groups with the same schemata ● Google-developed kernel patch for this functionality to be released soon ● Demonstrates need to make cgroup model a first class consideration for QoS interfaces A Better Approach for resctrl and cgroups 28
  • 29. cgroups and resctrl with After the Change resctrl/ |- cgroupX/ | |- mon_groups/ | | |- mon_data/ | | |- ... | |- schemata | |- tasks | |- ... |- high_bw_cgroup/ | |- schemata | |- ... |- ... << #1 << #4 ↻ Use case: dynamically apply memory BW throttling if machine is in trouble 1. Create a resctrl group cgroupX 2. Write no throttling configuration to cgroupX/schemata 3. Start cgroupX 4. Move TIDs into cgroupX/tasks 5. Rewrite schemata of high BW using cgroup to throttle << #2 << #5 29
  • 30. ● Measuring µArch impact is not a first class component of most container runtimes. ○ Can’t manage what we can’t see... ● Most container runtimes expose isolation knobs per container. ● Managing µArch isolation requires node and cluster level feedback-loops. ○ Dual operating mode : admins & users. ○ Performance isolation not necessarily controllable by end-users. We would love to contribute to a standard framework around performance management for container runtimes. µArch Features & Container Runtimes Efficiency Availability Performance 30
  • 31. Takeaways and Future work ● Memory bandwidth and low-level isolation issues becoming more significant. ● Continuous monitoring is critical to run successful multi-tenant hosts. ● Defining requirements for h/w providers and s/w interfaces on QoS knobs. ○ Critical to have these solutions work for containers / process-groups. ● Increasing success rate with current approach: ○ Handling of minimum guaranteed membw usage ○ Handling logically related jobs - Borg allocs ● A general framework would help collaboration. ● Future : Memory BW scheduling (based on hints) ○ Based on membw usage ○ Based on membw sensitivity 31
  • 32. Find us at the conf or reach out at : davidlo@ dragoss@ google.com jnagal@ eranian@ Thanks ! 32