SlideShare a Scribd company logo
1 of 27
Download to read offline
Finding the Golden Signals with Prometheus
Implementing SLOs, Histograms, and Burn Rates
Jack Neely
jjneely@42lines.net
October 20, 2020
42 Lines, Inc.
What Is Prometheus?
prometheus.io
33K Stars, 1.2K Watchers, 5.2K Forks
Cloud Native Time Series Database
1
Use Prometheus Effectively
Solve these three simple problems before your business scales:
• Methods of Common Instrumentation
• Service Level Objectives and Arbitrary Percentiles
• Effective Alerting and Burn Rates
3
Methods of Common Insturmentation
3
Goal of Site Wide Instrumentation
Work with your developers to integrate metrics into the base kit used to build a
new project. Create and encourage a metric naming scheme!
Use the Prometheus Naming Best Practices as a guide. Use standard units of
measurement.
Example PromQL: View a Graph of HTTP/API Hits by Service
> sum by (job) (
rate(http_request_duration_seconds_count[1m])
)
4
Use a Method!
The USE Method
• Brendan Gregg, 2012
• Hardware or Resource Focused
• Utilization, Saturation, Errors
The RED Method
• Tom Wilkie, 2015
• Services and APIs
• Rate, Errors, Duration
The 4 Golden Signals
• Google’s Site Reliability Engineering
Book, 2016
• RED + S
• Latency, Traffic, Errors, Saturation
5
Methods Create Service Level Indicators
These metrics become a standard set of Service Level Indicators (SLIs). Actionable
Service Level Objectives (SLOs) are constructed from SLIs.
• Latency
• Traffic
• Errors
• Saturation
Performance SLO
SuccessRate
TrafficRate is Availability Ratio for ∆t
Capacity Planing
6
An Ideal Workflow: SRE Abstraction Layer
Templating
Code Generation
GitOps
ChatOps
CLI Tools
S3
7
Building Service Level Objectives with Prometheus
8
Articulating a Service Level Objective
1. Service Level Indicator: Usually from a Method
2. Threshold
• ≤ 2 Seconds
• Boolean True
3. Goal: 95% Availability
4. Time Window: 30 Days
Examples
• The service will have 95% availability over the past 30 days.
• The service will return HTTP API results within 2 seconds or less for 95% of
requests over the trailing 30 days.
10
PromQL: Availability Ratio SLO Example
- record: job:http_requests_availability_sli_ratio:rate30d
expr: >
sum by (job) (
rate(http_requests_total{status!~"5.."}[30d])
)
/
sum by (job) (
rate(http_requests_total[30d])
)
- alert: ServiceAvailabilitySLO
expr: >
job:http_requests_availability_sli_ratio:rate30d < 0.95
labels:
severity: page
11
PromQL: Latency SLO Example
- rule job:http_request_duration_seconds_sli:rate30d_q95
expr: >
histogram_quantile(0.95, sum by (job, le) (
rate(
http_request_duration_seconds_bucket[30d]
)
))
- alert: ServiceLatencySLO
expr: >
job:http_request_duration_seconds_sli:rate30d_q95 > 2
labels:
severity: page
But wait, what’s the problem with Histograms?
12
Prometheus Histograms
12
Histograms as a Sketch for Percentile Estimation
Histogram of data$graphite_retrevial_time
Seconds
Frequency
0 1 2 3 4 5
0200040006000
Mean
q(.95)
Standard Deviation
R Calculations on Raw Data
• Gamma Distribution
• µ = 0.2902809
• σ = 0.7330352
• q(.5) = 0.0487695
• q(.95) = 1.452352
• Index of q(.95) = 9, 501
13
Histograms as a Sketch for Percentile Estimation
Histogram of data$graphite_retrevial_time
Seconds
Frequency
0 1 2 3 4 5
0200040006000
Prometheus Estimation
• sample = 9, 501
• 8th bucket
• 93rd sample of 368
144 Buckets Stopping at
t = 29
14
Histograms as a Sketch for Percentile Estimation
Histogram of data$graphite_retrevial_time
Seconds
Frequency
0 1 2 3 4 5
010002000300040005000
Human Defined Histogram
Metric
• sample = 9, 501
• 5th bucket
• 711th sample of 1187
• Error = 234%
Buckets =
{.05, .1, .5, 1, 5, 10, 50, 100}
15
Histograms as a Sketch for Percentile Estimation
Histogram of log10(data$graphite_retrevial_time)
Seconds = 10x
Frequency
−4 −3 −2 −1 0 1 2
0100200300400500600700
Mean
q(.95)
Log-Linear Bucketing
Algorithm
16
Working Around Prometheus’s Limitations
Make the SLO a parameter to the application and export it!
http_request_slo_seconds 2
Use these 3 Counters:
http_requests_total{handler="foo",client="iOS"} 42
http_requests_errors_total{handler="foo",client="iOS"} 3
http_requests_over_slo_total{handler="foo",client="iOS"} 1
SLO alert for 95% of events processed under 2 seconds for last 30 days:
1 - rate(http_requests_over_slo_total[30d]) /
rate(http_requests_total[30d])
< 0.95
17
Alerting on SLOs
17
Alerting Terminology
Reset Time: The time between fixing an incident and the monitoring system
resolving the alert. Did the actions really fix the problem?
Detection Time: The time between the start of an incident and generating an
alert.
Precision: The proportion of alerts for a significant event compared to false
positives. Low traffic conditions can cause a loss of precision.
Recall: The proportion of significant events that resulted in alerts.
18
Error Budgets
Definition
ErrorBudget = 1 − SLO
Example
0.05 = 1 − 0.95
Example in Percentages
5% = 100% − 95%
5% of requests in the 30 day time window can respond with durations longer than
2 seconds.
19
Calculating Error Burn Rates
Definition
BurnRate =
ErrorRatio∆t
ErrorBudget
Ratio of Error Budget Consumed in ∆t
BudgetConsumedRatio∆t =
t
SLO TimeWindow
× BurnRate
Solve for BurnRate to create an alert that fires if 2% of the Error Budget is
consumed in the last hour.
20
PromQL: Burn Rate Example
- rule: job:http_requests_over_slo_ratio:rate1h
expr: >
sum by (job) (
rate(http_requests_over_slo_total[1h])
) /
sum by (job) (
rate(http_requests_total[1h])
)
- alert: BurnRateHigh
expr: >
job:http_requests_over_slo_ratio:rate1h > 14.4 * 0.05
labels:
severity: page
annotations:
summary: 2% of Error Budget consumed in the last hour
21
Conclusions
21
Takeway Points
Build a Schema and Kit as guard rails against garbage in garbage out patterns.
This produces a rule library.
SLO based measuring has many positive and powerful aspects. As they are based
on percentiles we need to get the math right to base good business decisions on
good data. Prometheus presents us with challenges here.
SLOs are actually reports. More like period progress updates than real time
alerting. Also, alerting should fire before we violate our SLO goals.
22
References and Links
Google’s Site Reliability Engineering Book, 2016
Google’s Site Reliability Workbook, 2018
42 Lines Consulting, Jack Neely, jjneely@42lines.net
Monarch: Google’s Planet-Scale In-Memory Time Series Database, August 2020
23
Thank You!
devops@42lines.net
Practical Operations Podcast: operations.fm
23

More Related Content

Similar to Finding the Golden Signals with Prometheus

DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...DiscoveredByte
 
Performance hosting with Ninefold for Spree Apps and Stores
Performance hosting with Ninefold for Spree Apps and StoresPerformance hosting with Ninefold for Spree Apps and Stores
Performance hosting with Ninefold for Spree Apps and StoresAndrew Sharpe
 
PhD_defense_presentation_Oct2013
PhD_defense_presentation_Oct2013PhD_defense_presentation_Oct2013
PhD_defense_presentation_Oct2013Selvi Kadirvel
 
"Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada Fwdays
 
Fuzzy Control meets Software Engineering
Fuzzy Control meets Software EngineeringFuzzy Control meets Software Engineering
Fuzzy Control meets Software EngineeringPooyan Jamshidi
 
Prometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the CloudPrometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the CloudSneha Inguva
 
Droolsand Rule Based Systems 2008 Srping
Droolsand Rule Based Systems 2008 SrpingDroolsand Rule Based Systems 2008 Srping
Droolsand Rule Based Systems 2008 SrpingSrinath Perera
 
Ruby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3xRuby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3xMatthew Gaudet
 
Improved implementation of a Deadline Monotonic algorithm for aperiodic traff...
Improved implementation of a Deadline Monotonic algorithm for aperiodic traff...Improved implementation of a Deadline Monotonic algorithm for aperiodic traff...
Improved implementation of a Deadline Monotonic algorithm for aperiodic traff...Andrea Tino
 
Deadline Miss Detection with SCHED_DEADLINE
Deadline Miss Detection with SCHED_DEADLINEDeadline Miss Detection with SCHED_DEADLINE
Deadline Miss Detection with SCHED_DEADLINEYoshitake Kobayashi
 
How to reduce expenses on monitoring
How to reduce expenses on monitoringHow to reduce expenses on monitoring
How to reduce expenses on monitoringRomanKhavronenko
 
Keptn - Automated Operations & Continuous Delivery for k8s
Keptn - Automated Operations & Continuous Delivery for k8sKeptn - Automated Operations & Continuous Delivery for k8s
Keptn - Automated Operations & Continuous Delivery for k8sAndreas Grabner
 
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...South Tyrol Free Software Conference
 
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...NETWAYS
 
Performance hosting on Ninefold for Spree Stores and Apps
Performance hosting on Ninefold for Spree Stores and AppsPerformance hosting on Ninefold for Spree Stores and Apps
Performance hosting on Ninefold for Spree Stores and AppsNinefold
 
PID Tuning for Near Integrating Processes - Greg McMillan Deminar
PID Tuning for Near Integrating Processes - Greg McMillan DeminarPID Tuning for Near Integrating Processes - Greg McMillan Deminar
PID Tuning for Near Integrating Processes - Greg McMillan DeminarJim Cahill
 
A Guide to Event-Driven SRE-inspired DevOps
A Guide to Event-Driven SRE-inspired DevOpsA Guide to Event-Driven SRE-inspired DevOps
A Guide to Event-Driven SRE-inspired DevOpsAndreas Grabner
 

Similar to Finding the Golden Signals with Prometheus (20)

DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
 
Performance hosting with Ninefold for Spree Apps and Stores
Performance hosting with Ninefold for Spree Apps and StoresPerformance hosting with Ninefold for Spree Apps and Stores
Performance hosting with Ninefold for Spree Apps and Stores
 
PhD_defense_presentation_Oct2013
PhD_defense_presentation_Oct2013PhD_defense_presentation_Oct2013
PhD_defense_presentation_Oct2013
 
Manual estimation approach for Pre-sale phase of a project
Manual estimation approach for Pre-sale phase of a projectManual estimation approach for Pre-sale phase of a project
Manual estimation approach for Pre-sale phase of a project
 
"Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
 
Fuzzy Control meets Software Engineering
Fuzzy Control meets Software EngineeringFuzzy Control meets Software Engineering
Fuzzy Control meets Software Engineering
 
Prometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the CloudPrometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the Cloud
 
Droolsand Rule Based Systems 2008 Srping
Droolsand Rule Based Systems 2008 SrpingDroolsand Rule Based Systems 2008 Srping
Droolsand Rule Based Systems 2008 Srping
 
Ruby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3xRuby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3x
 
Improved implementation of a Deadline Monotonic algorithm for aperiodic traff...
Improved implementation of a Deadline Monotonic algorithm for aperiodic traff...Improved implementation of a Deadline Monotonic algorithm for aperiodic traff...
Improved implementation of a Deadline Monotonic algorithm for aperiodic traff...
 
6 sigma LTE release management process improvement
6 sigma LTE release management process improvement6 sigma LTE release management process improvement
6 sigma LTE release management process improvement
 
Deadline Miss Detection with SCHED_DEADLINE
Deadline Miss Detection with SCHED_DEADLINEDeadline Miss Detection with SCHED_DEADLINE
Deadline Miss Detection with SCHED_DEADLINE
 
How to reduce expenses on monitoring
How to reduce expenses on monitoringHow to reduce expenses on monitoring
How to reduce expenses on monitoring
 
Keptn - Automated Operations & Continuous Delivery for k8s
Keptn - Automated Operations & Continuous Delivery for k8sKeptn - Automated Operations & Continuous Delivery for k8s
Keptn - Automated Operations & Continuous Delivery for k8s
 
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...
 
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
 
Performance hosting on Ninefold for Spree Stores and Apps
Performance hosting on Ninefold for Spree Stores and AppsPerformance hosting on Ninefold for Spree Stores and Apps
Performance hosting on Ninefold for Spree Stores and Apps
 
PID Tuning for Near Integrating Processes - Greg McMillan Deminar
PID Tuning for Near Integrating Processes - Greg McMillan DeminarPID Tuning for Near Integrating Processes - Greg McMillan Deminar
PID Tuning for Near Integrating Processes - Greg McMillan Deminar
 
A Guide to Event-Driven SRE-inspired DevOps
A Guide to Event-Driven SRE-inspired DevOpsA Guide to Event-Driven SRE-inspired DevOps
A Guide to Event-Driven SRE-inspired DevOps
 
Fundamentals Performance Testing
Fundamentals Performance TestingFundamentals Performance Testing
Fundamentals Performance Testing
 

More from All Things Open

Building Reliability - The Realities of Observability
Building Reliability - The Realities of ObservabilityBuilding Reliability - The Realities of Observability
Building Reliability - The Realities of ObservabilityAll Things Open
 
Modern Database Best Practices
Modern Database Best PracticesModern Database Best Practices
Modern Database Best PracticesAll Things Open
 
Open Source and Public Policy
Open Source and Public PolicyOpen Source and Public Policy
Open Source and Public PolicyAll Things Open
 
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...All Things Open
 
The State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil NashThe State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil NashAll Things Open
 
Total ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScriptTotal ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScriptAll Things Open
 
What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?All Things Open
 
How to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractHow to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractAll Things Open
 
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlowAll Things Open
 
DEI Challenges and Success
DEI Challenges and SuccessDEI Challenges and Success
DEI Challenges and SuccessAll Things Open
 
Scaling Web Applications with Background
Scaling Web Applications with BackgroundScaling Web Applications with Background
Scaling Web Applications with BackgroundAll Things Open
 
Supercharging tutorials with WebAssembly
Supercharging tutorials with WebAssemblySupercharging tutorials with WebAssembly
Supercharging tutorials with WebAssemblyAll Things Open
 
Using SQL to Find Needles in Haystacks
Using SQL to Find Needles in HaystacksUsing SQL to Find Needles in Haystacks
Using SQL to Find Needles in HaystacksAll Things Open
 
Configuration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit InterceptConfiguration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit InterceptAll Things Open
 
Scaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship ProgramScaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship ProgramAll Things Open
 
Build Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceBuild Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceAll Things Open
 
Deploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamDeploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamAll Things Open
 
Sudo – Giving access while staying in control
Sudo – Giving access while staying in controlSudo – Giving access while staying in control
Sudo – Giving access while staying in controlAll Things Open
 
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsFortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsAll Things Open
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...All Things Open
 

More from All Things Open (20)

Building Reliability - The Realities of Observability
Building Reliability - The Realities of ObservabilityBuilding Reliability - The Realities of Observability
Building Reliability - The Realities of Observability
 
Modern Database Best Practices
Modern Database Best PracticesModern Database Best Practices
Modern Database Best Practices
 
Open Source and Public Policy
Open Source and Public PolicyOpen Source and Public Policy
Open Source and Public Policy
 
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
 
The State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil NashThe State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil Nash
 
Total ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScriptTotal ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScript
 
What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?
 
How to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractHow to Write & Deploy a Smart Contract
How to Write & Deploy a Smart Contract
 
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 
DEI Challenges and Success
DEI Challenges and SuccessDEI Challenges and Success
DEI Challenges and Success
 
Scaling Web Applications with Background
Scaling Web Applications with BackgroundScaling Web Applications with Background
Scaling Web Applications with Background
 
Supercharging tutorials with WebAssembly
Supercharging tutorials with WebAssemblySupercharging tutorials with WebAssembly
Supercharging tutorials with WebAssembly
 
Using SQL to Find Needles in Haystacks
Using SQL to Find Needles in HaystacksUsing SQL to Find Needles in Haystacks
Using SQL to Find Needles in Haystacks
 
Configuration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit InterceptConfiguration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit Intercept
 
Scaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship ProgramScaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship Program
 
Build Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceBuild Developer Experience Teams for Open Source
Build Developer Experience Teams for Open Source
 
Deploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamDeploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache Beam
 
Sudo – Giving access while staying in control
Sudo – Giving access while staying in controlSudo – Giving access while staying in control
Sudo – Giving access while staying in control
 
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsFortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
 

Recently uploaded

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Finding the Golden Signals with Prometheus

  • 1. Finding the Golden Signals with Prometheus Implementing SLOs, Histograms, and Burn Rates Jack Neely jjneely@42lines.net October 20, 2020 42 Lines, Inc.
  • 2. What Is Prometheus? prometheus.io 33K Stars, 1.2K Watchers, 5.2K Forks Cloud Native Time Series Database 1
  • 3. Use Prometheus Effectively Solve these three simple problems before your business scales: • Methods of Common Instrumentation • Service Level Objectives and Arbitrary Percentiles • Effective Alerting and Burn Rates 3
  • 4. Methods of Common Insturmentation 3
  • 5. Goal of Site Wide Instrumentation Work with your developers to integrate metrics into the base kit used to build a new project. Create and encourage a metric naming scheme! Use the Prometheus Naming Best Practices as a guide. Use standard units of measurement. Example PromQL: View a Graph of HTTP/API Hits by Service > sum by (job) ( rate(http_request_duration_seconds_count[1m]) ) 4
  • 6. Use a Method! The USE Method • Brendan Gregg, 2012 • Hardware or Resource Focused • Utilization, Saturation, Errors The RED Method • Tom Wilkie, 2015 • Services and APIs • Rate, Errors, Duration The 4 Golden Signals • Google’s Site Reliability Engineering Book, 2016 • RED + S • Latency, Traffic, Errors, Saturation 5
  • 7. Methods Create Service Level Indicators These metrics become a standard set of Service Level Indicators (SLIs). Actionable Service Level Objectives (SLOs) are constructed from SLIs. • Latency • Traffic • Errors • Saturation Performance SLO SuccessRate TrafficRate is Availability Ratio for ∆t Capacity Planing 6
  • 8. An Ideal Workflow: SRE Abstraction Layer Templating Code Generation GitOps ChatOps CLI Tools S3 7
  • 9. Building Service Level Objectives with Prometheus 8
  • 10. Articulating a Service Level Objective 1. Service Level Indicator: Usually from a Method 2. Threshold • ≤ 2 Seconds • Boolean True 3. Goal: 95% Availability 4. Time Window: 30 Days Examples • The service will have 95% availability over the past 30 days. • The service will return HTTP API results within 2 seconds or less for 95% of requests over the trailing 30 days. 10
  • 11. PromQL: Availability Ratio SLO Example - record: job:http_requests_availability_sli_ratio:rate30d expr: > sum by (job) ( rate(http_requests_total{status!~"5.."}[30d]) ) / sum by (job) ( rate(http_requests_total[30d]) ) - alert: ServiceAvailabilitySLO expr: > job:http_requests_availability_sli_ratio:rate30d < 0.95 labels: severity: page 11
  • 12. PromQL: Latency SLO Example - rule job:http_request_duration_seconds_sli:rate30d_q95 expr: > histogram_quantile(0.95, sum by (job, le) ( rate( http_request_duration_seconds_bucket[30d] ) )) - alert: ServiceLatencySLO expr: > job:http_request_duration_seconds_sli:rate30d_q95 > 2 labels: severity: page But wait, what’s the problem with Histograms? 12
  • 14. Histograms as a Sketch for Percentile Estimation Histogram of data$graphite_retrevial_time Seconds Frequency 0 1 2 3 4 5 0200040006000 Mean q(.95) Standard Deviation R Calculations on Raw Data • Gamma Distribution • µ = 0.2902809 • σ = 0.7330352 • q(.5) = 0.0487695 • q(.95) = 1.452352 • Index of q(.95) = 9, 501 13
  • 15. Histograms as a Sketch for Percentile Estimation Histogram of data$graphite_retrevial_time Seconds Frequency 0 1 2 3 4 5 0200040006000 Prometheus Estimation • sample = 9, 501 • 8th bucket • 93rd sample of 368 144 Buckets Stopping at t = 29 14
  • 16. Histograms as a Sketch for Percentile Estimation Histogram of data$graphite_retrevial_time Seconds Frequency 0 1 2 3 4 5 010002000300040005000 Human Defined Histogram Metric • sample = 9, 501 • 5th bucket • 711th sample of 1187 • Error = 234% Buckets = {.05, .1, .5, 1, 5, 10, 50, 100} 15
  • 17. Histograms as a Sketch for Percentile Estimation Histogram of log10(data$graphite_retrevial_time) Seconds = 10x Frequency −4 −3 −2 −1 0 1 2 0100200300400500600700 Mean q(.95) Log-Linear Bucketing Algorithm 16
  • 18. Working Around Prometheus’s Limitations Make the SLO a parameter to the application and export it! http_request_slo_seconds 2 Use these 3 Counters: http_requests_total{handler="foo",client="iOS"} 42 http_requests_errors_total{handler="foo",client="iOS"} 3 http_requests_over_slo_total{handler="foo",client="iOS"} 1 SLO alert for 95% of events processed under 2 seconds for last 30 days: 1 - rate(http_requests_over_slo_total[30d]) / rate(http_requests_total[30d]) < 0.95 17
  • 20. Alerting Terminology Reset Time: The time between fixing an incident and the monitoring system resolving the alert. Did the actions really fix the problem? Detection Time: The time between the start of an incident and generating an alert. Precision: The proportion of alerts for a significant event compared to false positives. Low traffic conditions can cause a loss of precision. Recall: The proportion of significant events that resulted in alerts. 18
  • 21. Error Budgets Definition ErrorBudget = 1 − SLO Example 0.05 = 1 − 0.95 Example in Percentages 5% = 100% − 95% 5% of requests in the 30 day time window can respond with durations longer than 2 seconds. 19
  • 22. Calculating Error Burn Rates Definition BurnRate = ErrorRatio∆t ErrorBudget Ratio of Error Budget Consumed in ∆t BudgetConsumedRatio∆t = t SLO TimeWindow × BurnRate Solve for BurnRate to create an alert that fires if 2% of the Error Budget is consumed in the last hour. 20
  • 23. PromQL: Burn Rate Example - rule: job:http_requests_over_slo_ratio:rate1h expr: > sum by (job) ( rate(http_requests_over_slo_total[1h]) ) / sum by (job) ( rate(http_requests_total[1h]) ) - alert: BurnRateHigh expr: > job:http_requests_over_slo_ratio:rate1h > 14.4 * 0.05 labels: severity: page annotations: summary: 2% of Error Budget consumed in the last hour 21
  • 25. Takeway Points Build a Schema and Kit as guard rails against garbage in garbage out patterns. This produces a rule library. SLO based measuring has many positive and powerful aspects. As they are based on percentiles we need to get the math right to base good business decisions on good data. Prometheus presents us with challenges here. SLOs are actually reports. More like period progress updates than real time alerting. Also, alerting should fire before we violate our SLO goals. 22
  • 26. References and Links Google’s Site Reliability Engineering Book, 2016 Google’s Site Reliability Workbook, 2018 42 Lines Consulting, Jack Neely, jjneely@42lines.net Monarch: Google’s Planet-Scale In-Memory Time Series Database, August 2020 23