Presented at inaugural Singapore Prometheus Meetup, videos on https://www.meetup.com/Singapore-Prometheus-Meetup/events/240844291/
Links to original slides from various blogposts provided.
2. So The Story Goes Like…
Full story: https://www.slideshare.net/grobie/the-history-of-prometheus-at-soundcloud
• 2012 - Joined SoundCloud
• Left Google in 2012 after 5+ years
• Side-project for open-source
monitoring system for Not Only IT
(econometrics, biochemical etc.)
• Started LevelDB-backed
Prometheus
• Server, client_golang
• Protocol Buffers
• 2012 - Joined SoundCloud
• Left Google in 2012 after 2+ years
• Configuration, query language
& &
• 2013 - Joined SoundCloud
• Left Google in 2013 after 7+ years
• Storage rewrite (LevelDB to Chunks): March 2014
• Public release: January 2015
• Join Cloud Native Computing Foundation (CNCF): May 2016
• Prometheus 2.0 announced: November 08, 2017
• Singapore Meetup: 23 November, 2017
3. Motivation Behind - Google SRE Best Practices
Read book: https://landing.google.com/sre/book.html
• SRE: Have software engineers do operations
• Do the same work as an operations team, but with
automation instead of manual labour
• 50% upper bound cap on the amount of “ops”
4. Google SLI, SLO, SLA
Full story: https://cloudplatform.googleblog.com/2017/01/availability-part-deux--CRE-life-lessons.html
Service Level Indicators (SLIs)
• A carefully defined quantitative measure of some aspect of the level of service that is provided
• request latency / error rate (often expressed as % of all requests received ) / system throughput,
Service Level Objectives (SLOs)
• Lower bound ≤ SLI ≤ upper bound
• Define the lowest level of reliability, and state that as your Service Level Objective
(SLO).
Service Level Agreements (SLAs)
• SLA is a looser objective than the SLO. Alternatively the SLA might only specify a subset of SLO metrics.
• I.e. availability SLA of 99.9% over 1 month with internal availability SLO of 99.95%
• A promise to someone using a service that its availability should meet a certain level over a certain
period, and if it fails to do so then some kind of penalty will be paid (partial refund of subscription fee
paid by customers for that period, or subscription time added for free)
6. Latency
• The time it takes to service a request.
• Successful vs. failed requests
• Slow error is even worse than a fast error. Track error latency.
Traffic
• A measure of how much demand is being placed on your system
• Usually HTTP requests per second (static vs dynamic content)
• Streaming system - network I/O rate or concurrent sessions
• Key-value storage system - TPS.
Errors
• The rate of requests that fail, (e.g.: HTTP 500s or HTTP 200 but coupled with wrong content)
Saturation
• How "full" your service is. CPU, Memory, I/O
• Can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it
currently receives?
• Saturation is also concerned with predictions of impending saturation, such as "It looks like your database will fill its
hard drive in 4 hours.”
Four Golden Signals
7. Error Budget = 100% - SLO
Full story: https://cloudplatform.googleblog.com/2017/01/availability-part-deux--CRE-life-lessons.html
Move fast without breaking SLO
• 100% is the wrong reliability target
• Error Budgets balance the goals of:
• Product development teams (KPI is feature velocity, incentive to push code often)
• SRE teams (KPI is reliability of a service, incentive to pushback against change)
• Error budget can be spent on anything: launching features, etc.
• Error budget provokes for discussion of phased rollouts and 1% experiments
Goal of SRE team isn’t “zero outages”
• SRE and product incentive-aligned to spend error budget and get max. feature velocity
8. Googlers use Borgmon (a.k.a. Borgmon rules)
Full story: https://landing.google.com/sre/book/chapters/practical-alerting.html
%curl http://webserver:80/varz
http_requests 37
errors_total 12
Each of the major languages used at Google has an implementation of the exported variable interface that automagically
registers with the HTTP server built into every Google binary by default. It’s called “Collection via /varz “
Time Series:
Distributed:
9. …traditional monitoring in kube era
Full story: https://www.slideshare.net/FabianReinartz/monitoring-a-kubernetes-backed-microservice-architecture-with-prometheus
A lot of traffic to monitor
Way more targets to monitor
…and they constantly change
Need a fleet-wide view (i..e What’s my overall 99th percentile latency)?
Still need to be able to drill down for troubleshooting
&
10. Prometheus Relies on Exporters
Full story: https://www.slideshare.net/FabianReinartz/monitoring-a-kubernetes-backed-microservice-architecture-with-prometheus
Exporters: The endpoint being polled by the prometheus server and answering the GET requests is typically
called exporter, e.g. the host-level metrics exporter is node-exporter.
https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exporters.md
11. Prometheus Architecture
Full story: https://jaxenter.com/prometheus-monitoring-pros-cons-136019.html
The 3 path-method combinations with the highest number of failing
requests?
topk(3,
sum by(path, method) (
rate(http_requests_total{status=~"5.."}[5m]))
)
The 99th percentile request latency by request path?
histogram_quantile(0.99, sum by(le, path) (
rate(http_requests_duration_seconds_bucket[5m])
))
PromQL:
12. Prometheus Storage Architecture
• A monitoring system must be more reliabile than the systems it is monitoring
• Prometheus's local storage is not meant as durable long-term storage.
• Chunks of data are in RAM, with WAL on disk
needed_disk_space =
retention_time_seconds *
ingested_samples_per_second *
bytes_per_sample [1…2 bytes]
• Possible LVM solution if _really_ desperate
As of writing (Nov. 2017) moment possible to integrate via adapters to:
Chronix , Cortex , CrateDB , Graphite , InfluxDB , OpenTSDB , PostgreSQL/TimescaleDB , SignalFx , Clickhouse etc.
This is primarily intended for long term storage. It is recommended that you perform careful evaluation of any
solution in this space to confirm it can handle your data volumes.
Full story: https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage
13. What Protetheus Is Not & Best Practice
• Not 100% accurate
• No logs, only metrics
• Not a durable long-term storage
• Not an anomaly detection
• Not a dashboarding solution
Full story: https://prometheus.io/docs/introduction/overview/#when-does-it-not-fit
Run one Prometheus server (or HA pair) in each failure domain / zone / cluster, monitoring jobs only in that zone.
Have a set of global Prometheus servers that monitor (federate from) the per-cluster ones.