SlideShare a Scribd company logo
1 of 58
Download to read offline
Stop the Guessing
Performance Methodologies for
Production Systems
Brendan Gregg
Lead Performance Engineer, Joyent
Wednesday, June 19, 13
Audience
 This is for developers, support, DBAs, sysadmins
 When perf isn’t your day job, but you want to:
- Fix common performance issues, quickly
- Have guidance for using performance monitoring tools
 Environments with small to large scale production systems
Wednesday, June 19, 13
whoami
 Lead Performance Engineer: analyze everything from apps to metal
 Work/Research: tools, visualizations, methodologies
 Methodologies is the focus of my next book
Wednesday, June 19, 13
Joyent
 High-Performance Cloud Infrastructure
- Public/private cloud provider
 OS Virtualization for bare metal performance
 KVM for Linux and Windows guests
 Core developers of SmartOS and node.js
Wednesday, June 19, 13
Performance Analysis
 Where do I start?
 Then what do I do?
Wednesday, June 19, 13
Performance Methodologies
 Provide
- Beginners: a starting point
- Casual users: a checklist
- Guidance for using existing tools: pose questions to ask
 The following six are for production system monitoring
Wednesday, June 19, 13
Production System Monitoring
 Guessing Methodologies
- 1. Traffic Light Anti-Method
- 2. Average Anti-Method
- 3. Concentration Game Anti-Method
 Not Guessing Methodologies
- 4. Workload Characterization Method
- 5. USE Method
- 6. Thread State Analysis Method
Wednesday, June 19, 13
Traffic Light Anti-Method
Wednesday, June 19, 13
Traffic Light Anti-Method
 1. Open monitoring dashboard
 2. All green? Everything good, mate.
= BAD
= GOOD
Wednesday, June 19, 13
Traffic Light Anti-Method, cont.
 Performance is subjective
- Depends on environment, requirements
- No universal thresholds for good/bad
 Latency outlier example:
- customer A) 200 ms is bad
- customer B) 2 ms is bad (an “eternity”)
 Developer may have chosen thresholds by guessing
Wednesday, June 19, 13
Traffic Light Anti-Method, cont.
 Performance is complex
- Not just one threshold required, but multiple different tests
 For example, a disk traffic light:
- Utilization-based: one disk at 100% for less than 2 seconds means green
(variance), for more than 2 seconds is red (outliers or imbalance), but if all
disks are at 100% for more than 2 seconds, that may be green (FS flush)
provided it is async write I/O, if sync then red, also if their IOPS is less than
10 each (errors), that’s red (sloth disks), unless those I/O are actually huge,
say, 1 Mbyte each or larger, as that can be green, ... etc ...
- Latency-based: I/O more than 100 ms means red, except for async writes
which are green, but slowish I/O more than 20 ms can red in combination,
unless they are more than 1 Mbyte each as that can be green ...
Wednesday, June 19, 13
Traffic Light Anti-Method, cont.
 Types of error:
- I. False positive: red instead of green
- Team wastes time
- II. False negative: green insead of red
- Performance issues remain undiagnosed
- Team wastes more time looking elsewhere
Wednesday, June 19, 13
Traffic Light Anti-Method, cont.
 Subjective metrics (opinion):
- utilization, IOPS, latency
 Objective metrics (fact):
- errors, alerts, SLAs
 For subjective metrics, use
weather icons
- implies an inexact science,
with no hard guarantees
- also attention grabbing
 A dashboard can use both as
appropriate for the metric
http://dtrace.org/blogs/brendan/2008/11/10/status-dashboard
Wednesday, June 19, 13
Traffic Light Anti-Method, cont.
 Pros:
- Intuitive, attention grabbing
- Quick (initially)
 Cons:
- Type I error (red not green): time wasted
- Type II error (green not red): more time wasted & undiagnosed errors
- Misleading for subjective metrics: green might not mean what you think it
means - depends on tests
- Over-simplification
Wednesday, June 19, 13
Average Anti-Method
Wednesday, June 19, 13
Average Anti-Method
 1. Measure the average (mean)
 2. Assume a normal-like distribution (unimodal)
 3. Focus investigation on explaining the average
Wednesday, June 19, 13
Average Anti-Method: You Have
mean stddevstddev 99th
Latency
Wednesday, June 19, 13
Average Anti-Method: You Guess
mean stddevstddev 99th
Latency
Wednesday, June 19, 13
Average Anti-Method: Reality
mean stddevstddev 99th
Latency
Wednesday, June 19, 13
Average Anti-Method: Reality x50
http://dtrace.org/blogs/brendan/2013/06/19/frequency-trails
Wednesday, June 19, 13
Average Anti-Method: Examine the Distribution
 Many distributions aren’t normal, gaussian, or unimodal
 Many distributions have outliers
- seen by the max; may not be visible in the 99...th percentiles
- influence mean and stddev
Wednesday, June 19, 13
Average Anti-Method: Outliers
mean stddev 99th
Latency
Wednesday, June 19, 13
Average Anti-Method: Visualizations
 Distribution is best understood by examining it
- Histogram summary
- Density Plot detailed summary (shown earlier)
- Frequency Trail detailed summary, highlights outliers (previous slides)
- Scatter Plot show distribution over time
- Heat Map show distribution over time, and is scaleable
Wednesday, June 19, 13
Average Anti-Method: Heat Map
http://dtrace.org/blogs/brendan/2013/05/19/revealing-hidden-latency-patterns
http://queue.acm.org/detail.cfm?id=1809426
Time (s)
Latency(us)
Wednesday, June 19, 13
Average Anti-Method
 Pros:
- Averages are versitile: time series line graphs, Little’s Law
 Cons:
- Misleading for multimodal distributions
- Misleading when outliers are present
- Averages are average
Wednesday, June 19, 13
Concentration Game Anti-Method
Wednesday, June 19, 13
Concentration Game Anti-Method
 1. Pick one metric
 2. Pick another metric
 3. Do their time series look the same?
- If so, investigate correlation!
 4. Problem not solved? goto 1
Wednesday, June 19, 13
Concentration Game Anti-Method, cont.
App Latency
Wednesday, June 19, 13
Concentration Game Anti-Method, cont.
NO
App Latency
Wednesday, June 19, 13
Concentration Game Anti-Method, cont.
YES!
App Latency
Wednesday, June 19, 13
Concentration Game Anti-Method, cont.
 Pros:
- Ages 3 and up
- Can discover important correlations between distant systems
 Cons:
- Time consuming: can discover many symptoms before the cause
- Incomplete: missing metrics
Wednesday, June 19, 13
Workload Characterization Method
Wednesday, June 19, 13
Workload Characterization Method
 1. Who is causing the load?
 2. Why is the load called?
 3. What is the load?
 4. How is the load changing over time?
Wednesday, June 19, 13
Workload Characterization Method, cont.
 1. Who: PID, user, IP addr, country, browser
 2. Why: code path, logic
 3. What: targets, URLs, I/O types, request rate (IOPS)
 4. How: minute, hour, day
 The target is the system input (the workload)
not the resulting performance
SystemWorkload
Wednesday, June 19, 13
Workload Characterization Method, cont.
 Pros:
- Potentially largest wins: eliminating unnecessary work
 Cons:
- Only solves a class of issues – load
- Can be time consuming and discouraging – most attributes examined will not
be a problem
Wednesday, June 19, 13
USE Method
Wednesday, June 19, 13
USE Method
 For every resource, check:
 1. Utilization
 2. Saturation
 3. Errors
Wednesday, June 19, 13
USE Method, cont.
 For every resource, check:
 1. Utilization: time resource was busy, or degree used
 2. Saturation: degree of queued extra work
 3. Errors: any errors
 Identifies resource bottnecks
quickly
Saturation
Errors
X
Utilization
Wednesday, June 19, 13
USE Method, cont.
 Hardware Resources:
- CPUs
- Main Memory
- Network Interfaces
- Storage Devices
- Controllers
- Interconnects
 Find the functional diagram and examine every item in the data path...
Wednesday, June 19, 13
USE Method, cont.: System Functional Diagram
DRAM
CPU
1
DRAM
CPU
Interconnect
Memory
Bus
Memory
Bus
I/O Bridge
I/O Controller Network Controller
Disk Disk Port Port
I/O Bus
Expander Interconnect
Interface
Transports
CPU
1
For each check:
1. Utilization
2. Saturation
3. Errors
Wednesday, June 19, 13
USE Method, cont.: Linux System Checklist
Resource Type Metric
CPU Utilization
per-cpu: mpstat -P ALL 1, “%idle”; sar -P ALL, “%idle”;
system-wide: vmstat 1, “id”; sar -u, “%idle”; dstat -c, “idl”;
per-process:top, “%CPU”; htop, “CPU%”; ps -o pcpu; pidstat
1, “%CPU”; per-kernel-thread: top/htop (“K” to toggle), where VIRT
== 0 (heuristic).
CPU Saturation
system-wide: vmstat 1, “r” > CPU count [2]; sar -q, “runq-sz” >
CPU count; dstat -p, “run” > CPU count; per-process: /proc/PID/
schedstat 2nd field (sched_info.run_delay); perf sched latency
(shows “Average” and “Maximum” delay per-schedule); dynamic
tracing, eg, SystemTap schedtimes.stp “queued(us)”
CPU Errors
perf (LPE) if processor specific error events (CPC) are available; eg,
AMD64′s “04Ah Single-bit ECC Errors Recorded by Scrubber”
... ... ...
http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-linux-performance-checklist
Wednesday, June 19, 13
USE Method, cont.: Monitoring Tools
 Average metrics don’t work: individual components can become bottlenecks
 Eg, CPU utilization
 Utilization heat map on the right
shows 5,312 CPUs for 60 secs;
can still identify “hot CPUs”
100
0
Utilization
Time
darkness == # of CPUs
hot CPUs
http://dtrace.org/blogs/brendan/2011/12/18/visualizing-device-utilization
Wednesday, June 19, 13
USE Method, cont.: Other Targets
 For cloud computing, must study any resource limits as well as physical; eg:
- physical network interface U.S.E.
- AND instance network cap U.S.E.
 Other software resources can also be studied with USE metrics:
- Mutex Locks
- Thread Pools
 The application environment can also be studied
- Find or draw a functional diagram
- Decompose into queueing systems
Wednesday, June 19, 13
USE Method, cont.: Homework
 Your ToDo:
- 1. find a system functional diagram
- 2. based on it, create a USE checklist on your internal wiki
- 3. fill out metrics based on your available toolset
- 4. repeat for your application environment
 You get:
- A checklist for all staff for quickly finding bottlenecks
- Awareness of what you cannot measure:
- unknown unknowns become known unknowns
- ... and known unknowns can become feature requests!
Wednesday, June 19, 13
USE Method, cont.
 Pros:
- Complete: all resource bottlenecks and errors
- Not limited in scope by available metrics
- No unknown unknowns – at least known unknowns
- Efficient: picks three metrics for each resource –
from what may be hundreds available
 Cons:
- Limited to a class of issues: resource bottlenecks
Wednesday, June 19, 13
Thread State Analysis Method
Wednesday, June 19, 13
Thread State Analysis Method
 1. Divide thread time into operating system states
 2. Measure states for each application thread
 3. Investigate largest non-idle state
Wednesday, June 19, 13
Thread State Analysis Method, cont.: 2 State
 A minimum of two states:
On-CPU
Off-CPU
Wednesday, June 19, 13
Thread State Analysis Method, cont.: 2 State
 A minimum of two states:
 Simple, but off-CPU state ambiguous without further division
On-CPU
executing
spinning on a lock
Off-CPU
waiting for a turn on-CPU
waiting for storage or network I/O
waiting for swap ins or page ins
blocked on a lock
idle waiting for work
Wednesday, June 19, 13
Thread State Analysis Method, cont.: 6 State
 Six states, based on Unix process states:
Executing
Runnable
Anonymous Paging
Sleeping
Lock
Idle
Wednesday, June 19, 13
Thread State Analysis Method, cont.: 6 State
 Six states, based on Unix process states:
 Generic: works for all applications
Executing on-CPU
Runnable and waiting for a turn on CPU
Anonymous Paging runnable, but blocked waiting for page ins
Sleeping waiting for I/O: storage, network, and data/text page ins
Lock waiting to acquire a synchronization lock
Idle waiting for work
Wednesday, June 19, 13
Thread State Analysis Method, cont.
 As with other methodologies, these pose questions to answer
- Even if they are hard to answer
 Measuring states isn’t currently easy, but can be done
- Linux: /proc, schedstats, delay accounting, I/O accounting, DTrace
- SmartOS: /proc, microstate accounting, DTrace
 Idle state may be the most difficult: applications use different techniques to
wait for work
Wednesday, June 19, 13
Thread State Analysis Method, cont.
 States lead to further investigation and actionable items:
Executing Profile stacks; split into usr/sys; sys = analyze syscalls
Runnable Examine CPU load for entire system, and caps
Anonymous Paging Check main memory free, and process memory usage
Sleeping Identify resource thread is blocked on; syscall analysis
Lock Lock analysis
Wednesday, June 19, 13
Thread State Analysis Method, cont.
 Compare to database query time. This alone can be misleading, including:
- swap time (anonymous paging) due to a memory misconfig
- CPU scheduler latency due to another application
 Same for any “time spent in ...” metric
- is it really in ...?
Wednesday, June 19, 13
Thread State Analysis Method, cont.
 Pros:
- Identifies common problem sources, including from other applications
- Quantifies application effects: compare times numerically
- Directs further analysis and actions
 Cons:
- Currently difficult to measure all states
Wednesday, June 19, 13
More Methodologies
 Include:
- Drill Down Analysis
- Latency Analysis
- Event Tracing
- Scientific Method
- Micro Benchmarking
- Baseline Statistics
- Modelling
 For when performance is your day job
Wednesday, June 19, 13
Stop the Guessing
 The anti-methodolgies involved:
- guesswork
- beginning with the tools or metrics (answers)
 The actual methodolgies posed questions, then sought metrics to answer them
 You don’t need to guess – post-DTrace, practically everything can be known
 Stop guessing and start asking questions!
Wednesday, June 19, 13
Thank You!
 email: brendan@joyent.com
 twitter: @brendangregg
 github: https://github.com/brendangregg
 blog: http://dtrace.org/blogs/brendan
 blog resources:
- http://dtrace.org/blogs/brendan/2008/11/10/status-dashboard
- http://dtrace.org/blogs/brendan/2013/06/19/frequency-trails
- http://dtrace.org/blogs/brendan/2013/05/19/revealing-hidden-latency-patterns
- http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-linux-performance-checklist
- http://dtrace.org/blogs/brendan/2011/12/18/visualizing-device-utilization
Wednesday, June 19, 13

More Related Content

What's hot

Secure container: Kata container and gVisor
Secure container: Kata container and gVisorSecure container: Kata container and gVisor
Secure container: Kata container and gVisorChing-Hsuan Yen
 
Flexible Authentication Strategies with SASL/OAUTHBEARER (Michael Kaminski, T...
Flexible Authentication Strategies with SASL/OAUTHBEARER (Michael Kaminski, T...Flexible Authentication Strategies with SASL/OAUTHBEARER (Michael Kaminski, T...
Flexible Authentication Strategies with SASL/OAUTHBEARER (Michael Kaminski, T...confluent
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance AnalysisBrendan Gregg
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
Scaling WebRTC applications with Janus
Scaling WebRTC applications with JanusScaling WebRTC applications with Janus
Scaling WebRTC applications with JanusLorenzo Miniero
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsAlexander Korotkov
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent
 
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...ScyllaDB
 
Patroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyPatroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyAlexander Kukushkin
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaAvinash Ramineni
 
[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOSAkihiro Suda
 
Efficient monitoring and alerting
Efficient monitoring and alertingEfficient monitoring and alerting
Efficient monitoring and alertingTobias Schmidt
 
Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureDan McKinley
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheusCeline George
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE MethodBrendan Gregg
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems confluent
 
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOAltinity Ltd
 

What's hot (20)

Secure container: Kata container and gVisor
Secure container: Kata container and gVisorSecure container: Kata container and gVisor
Secure container: Kata container and gVisor
 
Flexible Authentication Strategies with SASL/OAUTHBEARER (Michael Kaminski, T...
Flexible Authentication Strategies with SASL/OAUTHBEARER (Michael Kaminski, T...Flexible Authentication Strategies with SASL/OAUTHBEARER (Michael Kaminski, T...
Flexible Authentication Strategies with SASL/OAUTHBEARER (Michael Kaminski, T...
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
Scaling WebRTC applications with Janus
Scaling WebRTC applications with JanusScaling WebRTC applications with Janus
Scaling WebRTC applications with Janus
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
 
Patroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyPatroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easy
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
 
[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS
 
Efficient monitoring and alerting
Efficient monitoring and alertingEfficient monitoring and alerting
Efficient monitoring and alerting
 
Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds Architecture
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE Method
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
 

Viewers also liked

Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016Brendan Gregg
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to RootsBrendan Gregg
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsBrendan Gregg
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF SuperpowersBrendan Gregg
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFBrendan Gregg
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBrendan Gregg
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 InstancesBrendan Gregg
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixBrendan Gregg
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesBrendan Gregg
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFBrendan Gregg
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsBrendan Gregg
 

Viewers also liked (12)

Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to Roots
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPF
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 Instances
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at Netflix
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 

Similar to Stop the Guessing: Performance Methodologies for Production Systems

LOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring EnvironmentLOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring EnvironmentMike Julian
 
Performance engineering methodologies
Performance engineering  methodologiesPerformance engineering  methodologies
Performance engineering methodologiesManeesh Chaturvedi
 
Velocity 2013 - Rum vs Synthetic
Velocity 2013 - Rum vs SyntheticVelocity 2013 - Rum vs Synthetic
Velocity 2013 - Rum vs Syntheticjfox85
 
Hotspot Garbage Collection - Tuning Guide
Hotspot Garbage Collection - Tuning GuideHotspot Garbage Collection - Tuning Guide
Hotspot Garbage Collection - Tuning GuidejClarity
 
V center operations enterprise standalone technical presentation
V center operations enterprise standalone technical presentationV center operations enterprise standalone technical presentation
V center operations enterprise standalone technical presentationsolarisyourep
 
Dynamic Population Discovery for Lateral Movement (Using Machine Learning)
Dynamic Population Discovery for Lateral Movement (Using Machine Learning)Dynamic Population Discovery for Lateral Movement (Using Machine Learning)
Dynamic Population Discovery for Lateral Movement (Using Machine Learning)Rod Soto
 
How to find_vulnerability_in_software
How to find_vulnerability_in_softwareHow to find_vulnerability_in_software
How to find_vulnerability_in_softwaresanghwan ahn
 
Building Scalable, Distributed Job Queues with Redis and Ruby
Building Scalable, Distributed Job Queues with Redis and Ruby Building Scalable, Distributed Job Queues with Redis and Ruby
Building Scalable, Distributed Job Queues with Redis and Ruby Francesco Laurita
 
Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Dieter Plaetinck
 
Big Data Science - hype?
Big Data Science - hype?Big Data Science - hype?
Big Data Science - hype?BalaBit
 
Scheduling techniques for reducing processor energy use in MacOS
Scheduling techniques for reducing processor energy use in MacOSScheduling techniques for reducing processor energy use in MacOS
Scheduling techniques for reducing processor energy use in MacOSbane5isp
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.Theo Schlossnagle
 
So long, and thanks for all the tests (Scottish Developers 2014)
So long, and thanks for all the tests (Scottish Developers 2014)So long, and thanks for all the tests (Scottish Developers 2014)
So long, and thanks for all the tests (Scottish Developers 2014)Seb Rose
 
Eric Proegler Early Performance Testing from CAST2014
Eric Proegler Early Performance Testing from CAST2014Eric Proegler Early Performance Testing from CAST2014
Eric Proegler Early Performance Testing from CAST2014Eric Proegler
 

Similar to Stop the Guessing: Performance Methodologies for Production Systems (20)

Quals
QualsQuals
Quals
 
Quals
QualsQuals
Quals
 
LOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring EnvironmentLOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring Environment
 
Performance engineering methodologies
Performance engineering  methodologiesPerformance engineering  methodologies
Performance engineering methodologies
 
Velocity 2013 - Rum vs Synthetic
Velocity 2013 - Rum vs SyntheticVelocity 2013 - Rum vs Synthetic
Velocity 2013 - Rum vs Synthetic
 
Hotspot Garbage Collection - Tuning Guide
Hotspot Garbage Collection - Tuning GuideHotspot Garbage Collection - Tuning Guide
Hotspot Garbage Collection - Tuning Guide
 
V center operations enterprise standalone technical presentation
V center operations enterprise standalone technical presentationV center operations enterprise standalone technical presentation
V center operations enterprise standalone technical presentation
 
Dynamic Population Discovery for Lateral Movement (Using Machine Learning)
Dynamic Population Discovery for Lateral Movement (Using Machine Learning)Dynamic Population Discovery for Lateral Movement (Using Machine Learning)
Dynamic Population Discovery for Lateral Movement (Using Machine Learning)
 
How to find_vulnerability_in_software
How to find_vulnerability_in_softwareHow to find_vulnerability_in_software
How to find_vulnerability_in_software
 
Building Scalable, Distributed Job Queues with Redis and Ruby
Building Scalable, Distributed Job Queues with Redis and Ruby Building Scalable, Distributed Job Queues with Redis and Ruby
Building Scalable, Distributed Job Queues with Redis and Ruby
 
Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016
 
The Shape of Cloud to Come
The Shape of Cloud to ComeThe Shape of Cloud to Come
The Shape of Cloud to Come
 
Computer Fundamental
Computer FundamentalComputer Fundamental
Computer Fundamental
 
Big Data Science - hype?
Big Data Science - hype?Big Data Science - hype?
Big Data Science - hype?
 
Scheduling techniques for reducing processor energy use in MacOS
Scheduling techniques for reducing processor energy use in MacOSScheduling techniques for reducing processor energy use in MacOS
Scheduling techniques for reducing processor energy use in MacOS
 
Slides barcelona risk data
Slides barcelona risk dataSlides barcelona risk data
Slides barcelona risk data
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
Unlock Security Insight from Machine Data
Unlock Security Insight from Machine DataUnlock Security Insight from Machine Data
Unlock Security Insight from Machine Data
 
So long, and thanks for all the tests (Scottish Developers 2014)
So long, and thanks for all the tests (Scottish Developers 2014)So long, and thanks for all the tests (Scottish Developers 2014)
So long, and thanks for all the tests (Scottish Developers 2014)
 
Eric Proegler Early Performance Testing from CAST2014
Eric Proegler Early Performance Testing from CAST2014Eric Proegler Early Performance Testing from CAST2014
Eric Proegler Early Performance Testing from CAST2014
 

More from Brendan Gregg

IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingBrendan Gregg
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedBrendan Gregg
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedBrendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsBrendan Gregg
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityBrendan Gregg
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixBrendan Gregg
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019Brendan Gregg
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityBrendan Gregg
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018Brendan Gregg
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Brendan Gregg
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance AnalysisBrendan Gregg
 

More from Brendan Gregg (20)

IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
FlameScope 2018
FlameScope 2018FlameScope 2018
FlameScope 2018
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance Analysis
 

Recently uploaded

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Recently uploaded (20)

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Stop the Guessing: Performance Methodologies for Production Systems

  • 1. Stop the Guessing Performance Methodologies for Production Systems Brendan Gregg Lead Performance Engineer, Joyent Wednesday, June 19, 13
  • 2. Audience  This is for developers, support, DBAs, sysadmins  When perf isn’t your day job, but you want to: - Fix common performance issues, quickly - Have guidance for using performance monitoring tools  Environments with small to large scale production systems Wednesday, June 19, 13
  • 3. whoami  Lead Performance Engineer: analyze everything from apps to metal  Work/Research: tools, visualizations, methodologies  Methodologies is the focus of my next book Wednesday, June 19, 13
  • 4. Joyent  High-Performance Cloud Infrastructure - Public/private cloud provider  OS Virtualization for bare metal performance  KVM for Linux and Windows guests  Core developers of SmartOS and node.js Wednesday, June 19, 13
  • 5. Performance Analysis  Where do I start?  Then what do I do? Wednesday, June 19, 13
  • 6. Performance Methodologies  Provide - Beginners: a starting point - Casual users: a checklist - Guidance for using existing tools: pose questions to ask  The following six are for production system monitoring Wednesday, June 19, 13
  • 7. Production System Monitoring  Guessing Methodologies - 1. Traffic Light Anti-Method - 2. Average Anti-Method - 3. Concentration Game Anti-Method  Not Guessing Methodologies - 4. Workload Characterization Method - 5. USE Method - 6. Thread State Analysis Method Wednesday, June 19, 13
  • 9. Traffic Light Anti-Method  1. Open monitoring dashboard  2. All green? Everything good, mate. = BAD = GOOD Wednesday, June 19, 13
  • 10. Traffic Light Anti-Method, cont.  Performance is subjective - Depends on environment, requirements - No universal thresholds for good/bad  Latency outlier example: - customer A) 200 ms is bad - customer B) 2 ms is bad (an “eternity”)  Developer may have chosen thresholds by guessing Wednesday, June 19, 13
  • 11. Traffic Light Anti-Method, cont.  Performance is complex - Not just one threshold required, but multiple different tests  For example, a disk traffic light: - Utilization-based: one disk at 100% for less than 2 seconds means green (variance), for more than 2 seconds is red (outliers or imbalance), but if all disks are at 100% for more than 2 seconds, that may be green (FS flush) provided it is async write I/O, if sync then red, also if their IOPS is less than 10 each (errors), that’s red (sloth disks), unless those I/O are actually huge, say, 1 Mbyte each or larger, as that can be green, ... etc ... - Latency-based: I/O more than 100 ms means red, except for async writes which are green, but slowish I/O more than 20 ms can red in combination, unless they are more than 1 Mbyte each as that can be green ... Wednesday, June 19, 13
  • 12. Traffic Light Anti-Method, cont.  Types of error: - I. False positive: red instead of green - Team wastes time - II. False negative: green insead of red - Performance issues remain undiagnosed - Team wastes more time looking elsewhere Wednesday, June 19, 13
  • 13. Traffic Light Anti-Method, cont.  Subjective metrics (opinion): - utilization, IOPS, latency  Objective metrics (fact): - errors, alerts, SLAs  For subjective metrics, use weather icons - implies an inexact science, with no hard guarantees - also attention grabbing  A dashboard can use both as appropriate for the metric http://dtrace.org/blogs/brendan/2008/11/10/status-dashboard Wednesday, June 19, 13
  • 14. Traffic Light Anti-Method, cont.  Pros: - Intuitive, attention grabbing - Quick (initially)  Cons: - Type I error (red not green): time wasted - Type II error (green not red): more time wasted & undiagnosed errors - Misleading for subjective metrics: green might not mean what you think it means - depends on tests - Over-simplification Wednesday, June 19, 13
  • 16. Average Anti-Method  1. Measure the average (mean)  2. Assume a normal-like distribution (unimodal)  3. Focus investigation on explaining the average Wednesday, June 19, 13
  • 17. Average Anti-Method: You Have mean stddevstddev 99th Latency Wednesday, June 19, 13
  • 18. Average Anti-Method: You Guess mean stddevstddev 99th Latency Wednesday, June 19, 13
  • 19. Average Anti-Method: Reality mean stddevstddev 99th Latency Wednesday, June 19, 13
  • 20. Average Anti-Method: Reality x50 http://dtrace.org/blogs/brendan/2013/06/19/frequency-trails Wednesday, June 19, 13
  • 21. Average Anti-Method: Examine the Distribution  Many distributions aren’t normal, gaussian, or unimodal  Many distributions have outliers - seen by the max; may not be visible in the 99...th percentiles - influence mean and stddev Wednesday, June 19, 13
  • 22. Average Anti-Method: Outliers mean stddev 99th Latency Wednesday, June 19, 13
  • 23. Average Anti-Method: Visualizations  Distribution is best understood by examining it - Histogram summary - Density Plot detailed summary (shown earlier) - Frequency Trail detailed summary, highlights outliers (previous slides) - Scatter Plot show distribution over time - Heat Map show distribution over time, and is scaleable Wednesday, June 19, 13
  • 24. Average Anti-Method: Heat Map http://dtrace.org/blogs/brendan/2013/05/19/revealing-hidden-latency-patterns http://queue.acm.org/detail.cfm?id=1809426 Time (s) Latency(us) Wednesday, June 19, 13
  • 25. Average Anti-Method  Pros: - Averages are versitile: time series line graphs, Little’s Law  Cons: - Misleading for multimodal distributions - Misleading when outliers are present - Averages are average Wednesday, June 19, 13
  • 27. Concentration Game Anti-Method  1. Pick one metric  2. Pick another metric  3. Do their time series look the same? - If so, investigate correlation!  4. Problem not solved? goto 1 Wednesday, June 19, 13
  • 28. Concentration Game Anti-Method, cont. App Latency Wednesday, June 19, 13
  • 29. Concentration Game Anti-Method, cont. NO App Latency Wednesday, June 19, 13
  • 30. Concentration Game Anti-Method, cont. YES! App Latency Wednesday, June 19, 13
  • 31. Concentration Game Anti-Method, cont.  Pros: - Ages 3 and up - Can discover important correlations between distant systems  Cons: - Time consuming: can discover many symptoms before the cause - Incomplete: missing metrics Wednesday, June 19, 13
  • 33. Workload Characterization Method  1. Who is causing the load?  2. Why is the load called?  3. What is the load?  4. How is the load changing over time? Wednesday, June 19, 13
  • 34. Workload Characterization Method, cont.  1. Who: PID, user, IP addr, country, browser  2. Why: code path, logic  3. What: targets, URLs, I/O types, request rate (IOPS)  4. How: minute, hour, day  The target is the system input (the workload) not the resulting performance SystemWorkload Wednesday, June 19, 13
  • 35. Workload Characterization Method, cont.  Pros: - Potentially largest wins: eliminating unnecessary work  Cons: - Only solves a class of issues – load - Can be time consuming and discouraging – most attributes examined will not be a problem Wednesday, June 19, 13
  • 37. USE Method  For every resource, check:  1. Utilization  2. Saturation  3. Errors Wednesday, June 19, 13
  • 38. USE Method, cont.  For every resource, check:  1. Utilization: time resource was busy, or degree used  2. Saturation: degree of queued extra work  3. Errors: any errors  Identifies resource bottnecks quickly Saturation Errors X Utilization Wednesday, June 19, 13
  • 39. USE Method, cont.  Hardware Resources: - CPUs - Main Memory - Network Interfaces - Storage Devices - Controllers - Interconnects  Find the functional diagram and examine every item in the data path... Wednesday, June 19, 13
  • 40. USE Method, cont.: System Functional Diagram DRAM CPU 1 DRAM CPU Interconnect Memory Bus Memory Bus I/O Bridge I/O Controller Network Controller Disk Disk Port Port I/O Bus Expander Interconnect Interface Transports CPU 1 For each check: 1. Utilization 2. Saturation 3. Errors Wednesday, June 19, 13
  • 41. USE Method, cont.: Linux System Checklist Resource Type Metric CPU Utilization per-cpu: mpstat -P ALL 1, “%idle”; sar -P ALL, “%idle”; system-wide: vmstat 1, “id”; sar -u, “%idle”; dstat -c, “idl”; per-process:top, “%CPU”; htop, “CPU%”; ps -o pcpu; pidstat 1, “%CPU”; per-kernel-thread: top/htop (“K” to toggle), where VIRT == 0 (heuristic). CPU Saturation system-wide: vmstat 1, “r” > CPU count [2]; sar -q, “runq-sz” > CPU count; dstat -p, “run” > CPU count; per-process: /proc/PID/ schedstat 2nd field (sched_info.run_delay); perf sched latency (shows “Average” and “Maximum” delay per-schedule); dynamic tracing, eg, SystemTap schedtimes.stp “queued(us)” CPU Errors perf (LPE) if processor specific error events (CPC) are available; eg, AMD64′s “04Ah Single-bit ECC Errors Recorded by Scrubber” ... ... ... http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-linux-performance-checklist Wednesday, June 19, 13
  • 42. USE Method, cont.: Monitoring Tools  Average metrics don’t work: individual components can become bottlenecks  Eg, CPU utilization  Utilization heat map on the right shows 5,312 CPUs for 60 secs; can still identify “hot CPUs” 100 0 Utilization Time darkness == # of CPUs hot CPUs http://dtrace.org/blogs/brendan/2011/12/18/visualizing-device-utilization Wednesday, June 19, 13
  • 43. USE Method, cont.: Other Targets  For cloud computing, must study any resource limits as well as physical; eg: - physical network interface U.S.E. - AND instance network cap U.S.E.  Other software resources can also be studied with USE metrics: - Mutex Locks - Thread Pools  The application environment can also be studied - Find or draw a functional diagram - Decompose into queueing systems Wednesday, June 19, 13
  • 44. USE Method, cont.: Homework  Your ToDo: - 1. find a system functional diagram - 2. based on it, create a USE checklist on your internal wiki - 3. fill out metrics based on your available toolset - 4. repeat for your application environment  You get: - A checklist for all staff for quickly finding bottlenecks - Awareness of what you cannot measure: - unknown unknowns become known unknowns - ... and known unknowns can become feature requests! Wednesday, June 19, 13
  • 45. USE Method, cont.  Pros: - Complete: all resource bottlenecks and errors - Not limited in scope by available metrics - No unknown unknowns – at least known unknowns - Efficient: picks three metrics for each resource – from what may be hundreds available  Cons: - Limited to a class of issues: resource bottlenecks Wednesday, June 19, 13
  • 46. Thread State Analysis Method Wednesday, June 19, 13
  • 47. Thread State Analysis Method  1. Divide thread time into operating system states  2. Measure states for each application thread  3. Investigate largest non-idle state Wednesday, June 19, 13
  • 48. Thread State Analysis Method, cont.: 2 State  A minimum of two states: On-CPU Off-CPU Wednesday, June 19, 13
  • 49. Thread State Analysis Method, cont.: 2 State  A minimum of two states:  Simple, but off-CPU state ambiguous without further division On-CPU executing spinning on a lock Off-CPU waiting for a turn on-CPU waiting for storage or network I/O waiting for swap ins or page ins blocked on a lock idle waiting for work Wednesday, June 19, 13
  • 50. Thread State Analysis Method, cont.: 6 State  Six states, based on Unix process states: Executing Runnable Anonymous Paging Sleeping Lock Idle Wednesday, June 19, 13
  • 51. Thread State Analysis Method, cont.: 6 State  Six states, based on Unix process states:  Generic: works for all applications Executing on-CPU Runnable and waiting for a turn on CPU Anonymous Paging runnable, but blocked waiting for page ins Sleeping waiting for I/O: storage, network, and data/text page ins Lock waiting to acquire a synchronization lock Idle waiting for work Wednesday, June 19, 13
  • 52. Thread State Analysis Method, cont.  As with other methodologies, these pose questions to answer - Even if they are hard to answer  Measuring states isn’t currently easy, but can be done - Linux: /proc, schedstats, delay accounting, I/O accounting, DTrace - SmartOS: /proc, microstate accounting, DTrace  Idle state may be the most difficult: applications use different techniques to wait for work Wednesday, June 19, 13
  • 53. Thread State Analysis Method, cont.  States lead to further investigation and actionable items: Executing Profile stacks; split into usr/sys; sys = analyze syscalls Runnable Examine CPU load for entire system, and caps Anonymous Paging Check main memory free, and process memory usage Sleeping Identify resource thread is blocked on; syscall analysis Lock Lock analysis Wednesday, June 19, 13
  • 54. Thread State Analysis Method, cont.  Compare to database query time. This alone can be misleading, including: - swap time (anonymous paging) due to a memory misconfig - CPU scheduler latency due to another application  Same for any “time spent in ...” metric - is it really in ...? Wednesday, June 19, 13
  • 55. Thread State Analysis Method, cont.  Pros: - Identifies common problem sources, including from other applications - Quantifies application effects: compare times numerically - Directs further analysis and actions  Cons: - Currently difficult to measure all states Wednesday, June 19, 13
  • 56. More Methodologies  Include: - Drill Down Analysis - Latency Analysis - Event Tracing - Scientific Method - Micro Benchmarking - Baseline Statistics - Modelling  For when performance is your day job Wednesday, June 19, 13
  • 57. Stop the Guessing  The anti-methodolgies involved: - guesswork - beginning with the tools or metrics (answers)  The actual methodolgies posed questions, then sought metrics to answer them  You don’t need to guess – post-DTrace, practically everything can be known  Stop guessing and start asking questions! Wednesday, June 19, 13
  • 58. Thank You!  email: brendan@joyent.com  twitter: @brendangregg  github: https://github.com/brendangregg  blog: http://dtrace.org/blogs/brendan  blog resources: - http://dtrace.org/blogs/brendan/2008/11/10/status-dashboard - http://dtrace.org/blogs/brendan/2013/06/19/frequency-trails - http://dtrace.org/blogs/brendan/2013/05/19/revealing-hidden-latency-patterns - http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-linux-performance-checklist - http://dtrace.org/blogs/brendan/2011/12/18/visualizing-device-utilization Wednesday, June 19, 13