SlideShare a Scribd company logo
1 of 43
Brian Brazil
Founder
Counting with Prometheus
Who am I?
Engineer passionate about running software reliably in production.
Core developer of Prometheus
Studied Computer Science in Trinity College Dublin.
Google SRE for 7 years, working on high-scale reliable systems.
Contributor to many open source projects, including Prometheus, Ansible,
Python, Aurora and Zookeeper.
Founder of Robust Perception, provider of commercial support and consulting
What is Prometheus?
Prometheus is a metrics-based time series database, designed for whitebox
monitoring.
It supports labels (dimensions/tags).
Alerting and graphing are unified, using the same language.
Architecture
Counting
Counting is easy right?
Just 1, 2, 3?
Counting… to the Extreme!!!
What if we're counting monitoring-related events though for a metrics system.
We're usually sampling data over the network => potential data loss.
What happens with the data we transfer at the other end?
Counters and Gauges
There are two base metric types.
Gauges are a snapshot of current state, such as memory usage or temperature.
They can go up and down.
Counters are the other base type.
To explain them, we need to go on a "small" detour.
Events are key
Events are the thing we want to track with Counters.
An event might be a HTTP request hitting our server, a function call being made or
an error being returned.
An event logging system would record each event individually.
A metrics-based system like Prometheus or Graphite has events aggregated
across time before they get to the TSDB. Therein lies the rub.
Approach #1: Resetting Count
There's a few common approaches to providing this aggregate over time.
The first and simplest is the resetting count.
You start at 0, and every time there's an event you increment the count.
On a regular interval you transfer the current count, and reset to 0.
Counter Reset, Normal Operation
Approach #1: Resetting Count Problems
If a transfer fails, you've lost that data.
Can't presume this effect will be random and unbiased, e.g. if a big spike in traffic
also saturates network links used for monitoring.
Doesn't work if you want to transfer data to more than one place for redundancy.
Each would get 1/n of the data.
Counter Reset, Failed Transfer
Approach #2: Exponential moving average
A number of instrumentation libraries offer this, such as DropWizard's Meter.
Basically the same way Unix load averages work:
result(t) = (1 - w) * result(t-1) + (w) * events_this_period
Where t is the tick, and w is a weighting factor.
The weighting factor determines how quickly new data is incorporated.
Dropwizard evaluates the above every 5 seconds.
Exponential Moving Average, Normal Operation
Approach #2: Exponential moving average Problems
Events aren't uniformly considered. If you're transferring data every 10s, then the
most recent 5s matter more.
Thus reconstructing what happened is hard for debugging, unless you get every
5s update.
You're bound to the 1m, 5m and 15m weightings that the implementation has
chosen.
Also means that it's not particularly resilient to missing a scrape.
Aside: Graphite's summarize() function
Summarize() returns events during e.g. the last hour.
Some have a belief that summarize is accurate. It isn't.
Problem is that with say 15m granularity, data point at 13:02 will include data from
13 minutes before 13:00-14:00 and similarly at the end.
If you want this accurately, need to use logs.
No metrics based system can report this accurately in the general case.
Graphite's summarize() and non-aligned data
Aliasing
Depending on the exact time offsets between the process start, metric
initialisation, data transfers and when the user makes a query you can get
different results.
A second in either direction can make a big difference to the patterns you see in
your graphs.
This is an expected signal processing effect, be aware of it.
Expressive Power
Both previous solutions are reasonable if the monitoring system is a fairly dumb
data store, often with little math beyond addition (if even that).
Losing data or having no redundancy are better than having nothing at all.
What if you have the option for your monitoring system to do math?
What if you control both ends?
Approach #3: Prometheus Counters
Like Approach #1, we have a counter that starts at 0 increments at each event.
This is transferred regularly to Prometheus.
It's not reset on every transfer though, keeps on increasing.
Rate() function in Prometheus takes this in, and calculates how quickly it
increased over the given time period.
Prometheus rate(), Normal Operation
Approach #3: Prometheus Counters
Resilient to failed transfers (lose resolution, not data)
Can handle multiple places to transfer to
Can choose the time period you want to calculate over in monitoring system, thus
choose your level of smoothing e.g. rate(my_metric[5m]) or rate(my_metric[10m])
Uniform consideration of data
Easy to implement on client side
Prometheus rate(), Failed Transfer
Prometheus Counters: Rate()
There's many details to getting the rate() function right.
Processes might restart
Scrapes might be missed
Time periods rarely align perfectly
Time series stop and start
Prometheus Counters: Resets
Counters can be reset to 0, such as if a process restarts or a network switch gets
rebooted.
If we see the value decreasing, then that's a counter reset so presume it went to 0.
So seeing 0 -> 10 -> 5 is 10 + 5 = 15 of an increase.
Graphite/InfluxDB's nonNegativeDerivative() function would ignore the drop, report
based on just the increase 10.
Prometheus Counters: Missed scrapes
If we miss a scrape in the middle of a time period, no problem as we still have the
data points around it.
Little more complicated around the edges of the time period we're looking at
though.
Prometheus Counters: Alignment
It is rare that the user will request data exactly on the same cycle as the scrapes.
Especially when you're monitoring multiple servers with staggered scrapes.
Or given that timestamps are millisecond resolution, and the endpoint graphs use
accepts only second-granularity input.
Thus we need to extrapolate out to the end of the rate()'s range.
increase() can return non-integers on integral data
This is why one of the more surprising behaviours of increase() happens.
So if we have data which is:
t= 1 10
t= 6 12
t=11 13
Request a increase() from t=0 for 15s, you'll get an increase of 3 over 10s.
Extrapolating over the 15s, that's a result of 4.5.
This is the correct result on average. If you want exact answers, use logs.
Non-integral increase due to extrapolation
Prometheus Counters: Time series lifecycle
Time series are created and destroyed. If we always extrapolated out to the edge
of the rate() range we'd get much bigger results than we should.
So we detect that. We calculate the average interval between data points.
If the first/last data point start/end of the range is within 110% of the average
interval, then we extrapolate to the start/end. Allows for failed scrapes.
Otherwise we extrapolate 50% of the average interval.
We also know counters can't go negative, so don't extrapolate before the point
they'd be 0 at.
rate() extrapolation
Problem: Timeseries not always existing
The previous logic handles all the edge cases around counters resets, process
restarts and rolling restarts, on average.
What if a counter appears with the value 1 though long after the process has
started and doesn't increase again?
No increase in the history, so rate() doesn't see it. Can't tell when the increase
happened. Prometheus is designed to be available, not catch 100% of events.
Solution: Logs, or make sure all your counters are being initialised on process
start so it goes 0->1. Will only miss it prior to the first scrape then.
Problem: Lag
All these solutions produce results that lag the actual data - already seen with
summarize().
A 5m Prometheus rate() at a given time, is really the average from 5 minutes ago
to now. Similarly with resetting counters.
Exponential moving averages more complicated to explain, same issue though.
Always compare like with like, stick to one range for your rate()s.
Client Implications
The Prometheus Counter is very easy to implement, only need to increment a
single number.
Concurrency handling varies by language. Mutexes are the slowest, then atomics,
then per-processor values - which the Prometheus Java client approximates.
Dropwizard Meter has to increment 4 numbers and do the decay logic, so about
6x slower per benchmarks.
Dropwizard Counter (which is really a Gauge, as it can go down) is as fast as
Prometheus Counter.
Other performance considerations
Values for each label value (called a "Child") are in map in each metric.
That map lookup can be relatively expensive (~100ns), keep a pointer to the Child
if that could matter. Need to know the labels you'll be using in advance though.
Similarly, don't create a map from metric names to metric objects. Store metric
objects as pointers in simple variables after you create them.
Best Practices
Use seconds for timing. Prometheus values are all floats, so developers don't
need to choose and deal with a mix of ns, us, ms, s and minutes.
increase() function handy for display, but similarly for consistency only use it for
display. Use rate() for recording rules.
increase() is only syntactic sugar for rate().
irate(): The other rate function
Prometheus also has irate().
This looks at the last two points in the given range, and returns how fast they're
increasing per second.
Great for seeing very up to date data.
Can be hard to read if data is very spiky.
Need to be careful you're not missing data.
irate(), Normal Operation
irate(), Failed Transfer
Steps and rate durations
The query_range HTTP endpoint has a step parameter, this is the resolution of
the data returned.
If you have a 10m step and 5m rate, you're going to be ignoring half your data.
To avoid this, make sure your rate range is at least your step for graphs.
For irate(), your step should be no more than your sample resolution.
Compound Types: Summary
How to track average latency? With two counters!
One for total requests (_count), one for total latency of those requests (_sum).
Take the rates, divide and you have average latency.
This is how the compound Summary metric works. It's a more convenient API over
doing the above by hand.
Some clients also offer quantiles. Beware, slow and unaggregatable.
Compound Types: Histogram
Histogram also includes the _count and _sum.
The main purpose is calculating quantiles in Prometheus.
The histogram has buckets, which are counters. You can take the rate() of these,
aggregate them and then use histogram_quantile() to calculate arbitrary quantiles.
Be wary of cardinality explosion, use sparingly.
Resources
Official Project Website: prometheus.io
User Mailing List: prometheus-users@googlegroups.com
Developer Mailing List: prometheus-developers@googlegroups.com
Source code:
https://github.com/prometheus/prometheus/blob/master/promql/functions.go
Robust Perception Blog: www.robustperception.io/blog

More Related Content

What's hot

svn 능력자를 위한 git 개념 가이드
svn 능력자를 위한 git 개념 가이드svn 능력자를 위한 git 개념 가이드
svn 능력자를 위한 git 개념 가이드Insub Lee
 
Git Introduction Tutorial
Git Introduction TutorialGit Introduction Tutorial
Git Introduction TutorialThomas Rausch
 
How to test infrastructure code: automated testing for Terraform, Kubernetes,...
How to test infrastructure code: automated testing for Terraform, Kubernetes,...How to test infrastructure code: automated testing for Terraform, Kubernetes,...
How to test infrastructure code: automated testing for Terraform, Kubernetes,...Yevgeniy Brikman
 
우아한 모노리스
우아한 모노리스우아한 모노리스
우아한 모노리스Arawn Park
 
Pentesting GraphQL Applications
Pentesting GraphQL ApplicationsPentesting GraphQL Applications
Pentesting GraphQL ApplicationsNeelu Tripathy
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusGrafana Labs
 
Securing AEM webapps by hacking them
Securing AEM webapps by hacking themSecuring AEM webapps by hacking them
Securing AEM webapps by hacking themMikhail Egorov
 
Test Driven Development With Python
Test Driven Development With PythonTest Driven Development With Python
Test Driven Development With PythonSiddhi
 
XSS Attacks Exploiting XSS Filter by Masato Kinugawa - CODE BLUE 2015
XSS Attacks Exploiting XSS Filter by Masato Kinugawa - CODE BLUE 2015XSS Attacks Exploiting XSS Filter by Masato Kinugawa - CODE BLUE 2015
XSS Attacks Exploiting XSS Filter by Masato Kinugawa - CODE BLUE 2015CODE BLUE
 
Introduction to Kotlin coroutines
Introduction to Kotlin coroutinesIntroduction to Kotlin coroutines
Introduction to Kotlin coroutinesRoman Elizarov
 
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법Open Source Consulting
 
The Zen of High Performance Messaging with NATS
The Zen of High Performance Messaging with NATS The Zen of High Performance Messaging with NATS
The Zen of High Performance Messaging with NATS NATS
 
Kubernetes Application Deployment with Helm - A beginner Guide!
Kubernetes Application Deployment with Helm - A beginner Guide!Kubernetes Application Deployment with Helm - A beginner Guide!
Kubernetes Application Deployment with Helm - A beginner Guide!Krishna-Kumar
 
RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetch
RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetchRedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetch
RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetchRedis Labs
 
[PHP 也有 Day #64] PHP 升級指南
[PHP 也有 Day #64] PHP 升級指南[PHP 也有 Day #64] PHP 升級指南
[PHP 也有 Day #64] PHP 升級指南Shengyou Fan
 

What's hot (20)

svn 능력자를 위한 git 개념 가이드
svn 능력자를 위한 git 개념 가이드svn 능력자를 위한 git 개념 가이드
svn 능력자를 위한 git 개념 가이드
 
Git Introduction Tutorial
Git Introduction TutorialGit Introduction Tutorial
Git Introduction Tutorial
 
How to test infrastructure code: automated testing for Terraform, Kubernetes,...
How to test infrastructure code: automated testing for Terraform, Kubernetes,...How to test infrastructure code: automated testing for Terraform, Kubernetes,...
How to test infrastructure code: automated testing for Terraform, Kubernetes,...
 
우아한 모노리스
우아한 모노리스우아한 모노리스
우아한 모노리스
 
Pentesting GraphQL Applications
Pentesting GraphQL ApplicationsPentesting GraphQL Applications
Pentesting GraphQL Applications
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Securing AEM webapps by hacking them
Securing AEM webapps by hacking themSecuring AEM webapps by hacking them
Securing AEM webapps by hacking them
 
Test Driven Development With Python
Test Driven Development With PythonTest Driven Development With Python
Test Driven Development With Python
 
Prometheus monitoring
Prometheus monitoringPrometheus monitoring
Prometheus monitoring
 
XSS Attacks Exploiting XSS Filter by Masato Kinugawa - CODE BLUE 2015
XSS Attacks Exploiting XSS Filter by Masato Kinugawa - CODE BLUE 2015XSS Attacks Exploiting XSS Filter by Masato Kinugawa - CODE BLUE 2015
XSS Attacks Exploiting XSS Filter by Masato Kinugawa - CODE BLUE 2015
 
Introduction to Kotlin coroutines
Introduction to Kotlin coroutinesIntroduction to Kotlin coroutines
Introduction to Kotlin coroutines
 
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
 
The Zen of High Performance Messaging with NATS
The Zen of High Performance Messaging with NATS The Zen of High Performance Messaging with NATS
The Zen of High Performance Messaging with NATS
 
git and github
git and githubgit and github
git and github
 
Golang Channels
Golang ChannelsGolang Channels
Golang Channels
 
Kubernetes Application Deployment with Helm - A beginner Guide!
Kubernetes Application Deployment with Helm - A beginner Guide!Kubernetes Application Deployment with Helm - A beginner Guide!
Kubernetes Application Deployment with Helm - A beginner Guide!
 
RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetch
RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetchRedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetch
RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetch
 
Git
GitGit
Git
 
Junit
JunitJunit
Junit
 
[PHP 也有 Day #64] PHP 升級指南
[PHP 也有 Day #64] PHP 升級指南[PHP 也有 Day #64] PHP 升級指南
[PHP 也有 Day #64] PHP 升級指南
 

Similar to Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)

Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Brian Brazil
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral ResearchPo-Ting Wu
 
Algorithm Analysis.pdf
Algorithm Analysis.pdfAlgorithm Analysis.pdf
Algorithm Analysis.pdfMemMem25
 
Stevens-Benchmarking Sorting Algorithms
Stevens-Benchmarking Sorting AlgorithmsStevens-Benchmarking Sorting Algorithms
Stevens-Benchmarking Sorting AlgorithmsJames Stevens
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemC4Media
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to AlgorithmsVenkatesh Iyer
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Brian Brazil
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Brian Brazil
 
TIME EXECUTION OF DIFFERENT SORTED ALGORITHMS
TIME EXECUTION   OF  DIFFERENT SORTED ALGORITHMSTIME EXECUTION   OF  DIFFERENT SORTED ALGORITHMS
TIME EXECUTION OF DIFFERENT SORTED ALGORITHMSTanya Makkar
 
Six Sigma Dfss Application In Data Accarucy
Six Sigma Dfss Application In Data AccarucySix Sigma Dfss Application In Data Accarucy
Six Sigma Dfss Application In Data Accarucyxyhfun
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
 
Process Synchronization -1.ppt
Process Synchronization -1.pptProcess Synchronization -1.ppt
Process Synchronization -1.pptjayverma27
 

Similar to Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017) (20)

Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral Research
 
Algorithm Analysis.pdf
Algorithm Analysis.pdfAlgorithm Analysis.pdf
Algorithm Analysis.pdf
 
Stevens-Benchmarking Sorting Algorithms
Stevens-Benchmarking Sorting AlgorithmsStevens-Benchmarking Sorting Algorithms
Stevens-Benchmarking Sorting Algorithms
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithms
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)
 
Algo and flowchart
Algo and flowchartAlgo and flowchart
Algo and flowchart
 
Analyzing algorithms
Analyzing algorithmsAnalyzing algorithms
Analyzing algorithms
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)
 
TIME EXECUTION OF DIFFERENT SORTED ALGORITHMS
TIME EXECUTION   OF  DIFFERENT SORTED ALGORITHMSTIME EXECUTION   OF  DIFFERENT SORTED ALGORITHMS
TIME EXECUTION OF DIFFERENT SORTED ALGORITHMS
 
Six Sigma Dfss Application In Data Accarucy
Six Sigma Dfss Application In Data AccarucySix Sigma Dfss Application In Data Accarucy
Six Sigma Dfss Application In Data Accarucy
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
 
Process Synchronization -1.ppt
Process Synchronization -1.pptProcess Synchronization -1.ppt
Process Synchronization -1.ppt
 
Major ppt
Major pptMajor ppt
Major ppt
 

More from Brian Brazil

OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)Brian Brazil
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil
 
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Brian Brazil
 
Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Brian Brazil
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Brian Brazil
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Brian Brazil
 
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Brian Brazil
 
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)Brian Brazil
 
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Brian Brazil
 
Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)Brian Brazil
 
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Brian Brazil
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)Brian Brazil
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Brian Brazil
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil
 
So You Want to Write an Exporter
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an ExporterBrian Brazil
 
An Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQLAn Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQLBrian Brazil
 
Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Brian Brazil
 
Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Brian Brazil
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Brian Brazil
 
Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)Brian Brazil
 

More from Brian Brazil (20)

OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
 
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
 
Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
 
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
 
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
 
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
 
Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)
 
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
 
So You Want to Write an Exporter
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
 
An Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQLAn Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQL
 
Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)
 
Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
 
Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)
 

Recently uploaded

定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleanscorenetworkseo
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
Intellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxIntellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxBipin Adhikari
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 

Recently uploaded (20)

定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleans
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
Intellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxIntellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptx
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 

Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)

  • 2. Who am I? Engineer passionate about running software reliably in production. Core developer of Prometheus Studied Computer Science in Trinity College Dublin. Google SRE for 7 years, working on high-scale reliable systems. Contributor to many open source projects, including Prometheus, Ansible, Python, Aurora and Zookeeper. Founder of Robust Perception, provider of commercial support and consulting
  • 3. What is Prometheus? Prometheus is a metrics-based time series database, designed for whitebox monitoring. It supports labels (dimensions/tags). Alerting and graphing are unified, using the same language.
  • 5. Counting Counting is easy right? Just 1, 2, 3?
  • 6. Counting… to the Extreme!!! What if we're counting monitoring-related events though for a metrics system. We're usually sampling data over the network => potential data loss. What happens with the data we transfer at the other end?
  • 7. Counters and Gauges There are two base metric types. Gauges are a snapshot of current state, such as memory usage or temperature. They can go up and down. Counters are the other base type. To explain them, we need to go on a "small" detour.
  • 8. Events are key Events are the thing we want to track with Counters. An event might be a HTTP request hitting our server, a function call being made or an error being returned. An event logging system would record each event individually. A metrics-based system like Prometheus or Graphite has events aggregated across time before they get to the TSDB. Therein lies the rub.
  • 9. Approach #1: Resetting Count There's a few common approaches to providing this aggregate over time. The first and simplest is the resetting count. You start at 0, and every time there's an event you increment the count. On a regular interval you transfer the current count, and reset to 0.
  • 11. Approach #1: Resetting Count Problems If a transfer fails, you've lost that data. Can't presume this effect will be random and unbiased, e.g. if a big spike in traffic also saturates network links used for monitoring. Doesn't work if you want to transfer data to more than one place for redundancy. Each would get 1/n of the data.
  • 13. Approach #2: Exponential moving average A number of instrumentation libraries offer this, such as DropWizard's Meter. Basically the same way Unix load averages work: result(t) = (1 - w) * result(t-1) + (w) * events_this_period Where t is the tick, and w is a weighting factor. The weighting factor determines how quickly new data is incorporated. Dropwizard evaluates the above every 5 seconds.
  • 14. Exponential Moving Average, Normal Operation
  • 15. Approach #2: Exponential moving average Problems Events aren't uniformly considered. If you're transferring data every 10s, then the most recent 5s matter more. Thus reconstructing what happened is hard for debugging, unless you get every 5s update. You're bound to the 1m, 5m and 15m weightings that the implementation has chosen. Also means that it's not particularly resilient to missing a scrape.
  • 16. Aside: Graphite's summarize() function Summarize() returns events during e.g. the last hour. Some have a belief that summarize is accurate. It isn't. Problem is that with say 15m granularity, data point at 13:02 will include data from 13 minutes before 13:00-14:00 and similarly at the end. If you want this accurately, need to use logs. No metrics based system can report this accurately in the general case.
  • 17. Graphite's summarize() and non-aligned data
  • 18. Aliasing Depending on the exact time offsets between the process start, metric initialisation, data transfers and when the user makes a query you can get different results. A second in either direction can make a big difference to the patterns you see in your graphs. This is an expected signal processing effect, be aware of it.
  • 19. Expressive Power Both previous solutions are reasonable if the monitoring system is a fairly dumb data store, often with little math beyond addition (if even that). Losing data or having no redundancy are better than having nothing at all. What if you have the option for your monitoring system to do math? What if you control both ends?
  • 20. Approach #3: Prometheus Counters Like Approach #1, we have a counter that starts at 0 increments at each event. This is transferred regularly to Prometheus. It's not reset on every transfer though, keeps on increasing. Rate() function in Prometheus takes this in, and calculates how quickly it increased over the given time period.
  • 22. Approach #3: Prometheus Counters Resilient to failed transfers (lose resolution, not data) Can handle multiple places to transfer to Can choose the time period you want to calculate over in monitoring system, thus choose your level of smoothing e.g. rate(my_metric[5m]) or rate(my_metric[10m]) Uniform consideration of data Easy to implement on client side
  • 24. Prometheus Counters: Rate() There's many details to getting the rate() function right. Processes might restart Scrapes might be missed Time periods rarely align perfectly Time series stop and start
  • 25. Prometheus Counters: Resets Counters can be reset to 0, such as if a process restarts or a network switch gets rebooted. If we see the value decreasing, then that's a counter reset so presume it went to 0. So seeing 0 -> 10 -> 5 is 10 + 5 = 15 of an increase. Graphite/InfluxDB's nonNegativeDerivative() function would ignore the drop, report based on just the increase 10.
  • 26. Prometheus Counters: Missed scrapes If we miss a scrape in the middle of a time period, no problem as we still have the data points around it. Little more complicated around the edges of the time period we're looking at though.
  • 27. Prometheus Counters: Alignment It is rare that the user will request data exactly on the same cycle as the scrapes. Especially when you're monitoring multiple servers with staggered scrapes. Or given that timestamps are millisecond resolution, and the endpoint graphs use accepts only second-granularity input. Thus we need to extrapolate out to the end of the rate()'s range.
  • 28. increase() can return non-integers on integral data This is why one of the more surprising behaviours of increase() happens. So if we have data which is: t= 1 10 t= 6 12 t=11 13 Request a increase() from t=0 for 15s, you'll get an increase of 3 over 10s. Extrapolating over the 15s, that's a result of 4.5. This is the correct result on average. If you want exact answers, use logs.
  • 29. Non-integral increase due to extrapolation
  • 30. Prometheus Counters: Time series lifecycle Time series are created and destroyed. If we always extrapolated out to the edge of the rate() range we'd get much bigger results than we should. So we detect that. We calculate the average interval between data points. If the first/last data point start/end of the range is within 110% of the average interval, then we extrapolate to the start/end. Allows for failed scrapes. Otherwise we extrapolate 50% of the average interval. We also know counters can't go negative, so don't extrapolate before the point they'd be 0 at.
  • 32. Problem: Timeseries not always existing The previous logic handles all the edge cases around counters resets, process restarts and rolling restarts, on average. What if a counter appears with the value 1 though long after the process has started and doesn't increase again? No increase in the history, so rate() doesn't see it. Can't tell when the increase happened. Prometheus is designed to be available, not catch 100% of events. Solution: Logs, or make sure all your counters are being initialised on process start so it goes 0->1. Will only miss it prior to the first scrape then.
  • 33. Problem: Lag All these solutions produce results that lag the actual data - already seen with summarize(). A 5m Prometheus rate() at a given time, is really the average from 5 minutes ago to now. Similarly with resetting counters. Exponential moving averages more complicated to explain, same issue though. Always compare like with like, stick to one range for your rate()s.
  • 34. Client Implications The Prometheus Counter is very easy to implement, only need to increment a single number. Concurrency handling varies by language. Mutexes are the slowest, then atomics, then per-processor values - which the Prometheus Java client approximates. Dropwizard Meter has to increment 4 numbers and do the decay logic, so about 6x slower per benchmarks. Dropwizard Counter (which is really a Gauge, as it can go down) is as fast as Prometheus Counter.
  • 35. Other performance considerations Values for each label value (called a "Child") are in map in each metric. That map lookup can be relatively expensive (~100ns), keep a pointer to the Child if that could matter. Need to know the labels you'll be using in advance though. Similarly, don't create a map from metric names to metric objects. Store metric objects as pointers in simple variables after you create them.
  • 36. Best Practices Use seconds for timing. Prometheus values are all floats, so developers don't need to choose and deal with a mix of ns, us, ms, s and minutes. increase() function handy for display, but similarly for consistency only use it for display. Use rate() for recording rules. increase() is only syntactic sugar for rate().
  • 37. irate(): The other rate function Prometheus also has irate(). This looks at the last two points in the given range, and returns how fast they're increasing per second. Great for seeing very up to date data. Can be hard to read if data is very spiky. Need to be careful you're not missing data.
  • 40. Steps and rate durations The query_range HTTP endpoint has a step parameter, this is the resolution of the data returned. If you have a 10m step and 5m rate, you're going to be ignoring half your data. To avoid this, make sure your rate range is at least your step for graphs. For irate(), your step should be no more than your sample resolution.
  • 41. Compound Types: Summary How to track average latency? With two counters! One for total requests (_count), one for total latency of those requests (_sum). Take the rates, divide and you have average latency. This is how the compound Summary metric works. It's a more convenient API over doing the above by hand. Some clients also offer quantiles. Beware, slow and unaggregatable.
  • 42. Compound Types: Histogram Histogram also includes the _count and _sum. The main purpose is calculating quantiles in Prometheus. The histogram has buckets, which are counters. You can take the rate() of these, aggregate them and then use histogram_quantile() to calculate arbitrary quantiles. Be wary of cardinality explosion, use sparingly.
  • 43. Resources Official Project Website: prometheus.io User Mailing List: prometheus-users@googlegroups.com Developer Mailing List: prometheus-developers@googlegroups.com Source code: https://github.com/prometheus/prometheus/blob/master/promql/functions.go Robust Perception Blog: www.robustperception.io/blog