Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monitoring with prometheus


Published on

A short presentation of Prometheus given at the CodeU Automation Night meetup in Aarhus at November 1st 2016.

Published in: Technology
  • Login to see the comments

Monitoring with prometheus

  1. 1. Prometheus By Kasper Nissen @phennex Monitoring with
  2. 2. Hi! My name is Kasper @phennex
  3. 3. What am I going to cover? @phennex + + + Monitoring - why and what? Prometheus - an introduction Short demo
  4. 4. DEMO Part 1 @phennex
  5. 5. Why monitor? @phennex
  6. 6. What to monitor? @phennex Analyzing long-term trends @phennex
  7. 7. What to monitor? @phennex Comparing over time or experiment groups @phennex
  8. 8. What to monitor? @phennex Alerting @phennex
  9. 9. What to monitor? @phennex Building dashboards @phennex
  10. 10. @phennex Conducting ad hoc retrospective analysis @phennex
  11. 11. @phennex Purpose: What is broken? and why?
  12. 12. What to monitor? @phennex
  13. 13. What to monitor? @phennex Hosts CPU, Memory, I/O, Network, Filesystem @phennex
  14. 14. What to monitor? @phennex Containers CPU, Memory, I/O, Restarts, Throttling @phennex
  15. 15. What to monitor? @phennex Applications Throughput, Latency @phennex
  16. 16. The Four Golden Signals @phennex Site Reliability Engineering - How Google Runs Production Systems
  17. 17. What to monitor? @phennex Latency The time it takes to service a request. Important to distinguish between the latency of successful and failed requests. @phennex
  18. 18. What to monitor? @phennex Traffic A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. @phennex
  19. 19. What to monitor? @phennex Errors The rate of requests that fail, either explicitly (e.g. HTTP 500s), implicitly (HTTP 200 success with wrong content) @phennex
  20. 20. What to monitor? @phennex Saturation How “full” your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g. in a memory-constrained system, show memory) @phennex
  21. 21. Prometheus @phennex
  22. 22. What to monitor? @phennex Prometheus Prometheus was presented to be the protector and benefactor of mankind. @phennex
  23. 23. Prometheus @phennex + + + + Heavily inspired by Borgmon Built by ex-Googlers at SoundCloud Pull-based (scrapes at regular intervals) Many integration possibilities
  24. 24. The 2nd project in CNCF
  25. 25. What is Prometheus? @phennex + + + + + + Monitoring system and Timeseries Database Instrumentation Metrics collection and storage Querying Alerting Dashboard / Graphing / Trending Source:
  26. 26. Prometheus focus on @phennex + + Operational systems monitoring Dynamic cloud environments Source:
  27. 27. Prometheus does not do @phennex + + + + + + Raw log / event collection (use ELK stack) Request tracing (use “Magic” anomaly detection Durable long-term storage Automatic horizontal scaling User / auth management
  28. 28. Prometheus Architecture @phennex Long-lived jobs Pushgateway AlertmanagerShort-lived jobs Grafana
  29. 29. The Data model @phennex <metric name>{<label name>=<label value>, …} api_http_requests_total{method="POST", handler="/messages"} Notation: Example: Every time series is uniquely identified by its metric name and a set of key- value pairs, also known as labels.
  30. 30. How to get metrics? @phennex Directly instrumented Not Directly instrumented Exporter Source:
  31. 31. @phennex
  32. 32. Directly instrumented software @phennex cAdvisor Doorman Etcd Kubernetes-Mesos Kubernetes RobustIRC SkyDNS Weave Flux
  33. 33. Official Prometheus Exporters @phennex Node/system metrics exporter AWS CloudWatch exporter Blackbox exporter Collectd exporter Consul exporter Graphite exporter HAProxy exporter InfluxDB exporter JMX exporter Memcached exporter Mesos task exporter MySQL server exporter SNMP exporter StatsD exporter
  34. 34. 3rd party exporters @phennex Databases Aerospike exporter ClickHouse exporter CouchDB exporter MongoDB exporter PgBouncer exporter PostgreSQL exporter ProxySQL exporter Redis exporter RethinkDB exporter SQL query result set metrics exporter
  35. 35. 3rd party exporters @phennex Hardware related apcupsd exporter IoT Edison exporter IPMI exporter knxd exporter Ubiquiti UniFi exporter Messaging systems NATS exporter NSQ exporter RabbitMQ exporter RabbitMQ Management Plugin exporter Mirth Connect exporter
  36. 36. 3rd party exporters @phennex Storage Ceph exporter ScaleIO exporter HTTP Apache exporter Nginx metric library Passenger exporter Varnish exporter WebDriver exporter APIs Docker Hub exporter GitHub exporter OpenWeatherMap exporter Rancher exporter exporter Logging Google's mtail log data extractor Grok exporter Other monitoring systems Cloud Foundry Firehose exporter scollector exporter Heka dashboard exporter Heka exporter Munin exporter New Relic exporter Miscellaneous BIG-IP exporter BIND exporter BOSH exporter Jenkins exporter Meteor JS web framework exporter Minecraft exporter module PowerDNS exporter rTorrent exporter SMTP/Maildir MDA blackbox prober Xen exporter
  37. 37. PromQL @phennex + + + Non-SQL Query Language Better for metrics computation Only does reads Source:
  38. 38. PromQL - Operators @phennex + (addition) == (equal) - (substraction) != (not-equal) * (multiplication) > (greater-than) / (division) < (less-than) % (modulo) >= (greater-or-equal) ^ (exponentiation) <= (less-or-equal) and (intersection) or (union) unless (complement) … and vector matching Source:
  39. 39. PromQL - Aggregation Operators @phennex sum stddev bottomk min stdvar topk max count quantile avg count_values Source:
  40. 40. PromQL - Examples @phennex rate(api_http_requests_total[5m]) errors{job=“foo”} / total{job=“foo”} Source:
  41. 41. DEMO Part 2 @phennex
  42. 42. Alerting @phennex
  43. 43. What to monitor? @phennex Symptom-based alerting Be proactive @phennex
  44. 44. What to monitor? @phennex Prevent alert fatigue - Use ticketing systems (Avoid email spam) - Warning are tasks like new features @phennex
  45. 45. What to monitor? @phennex Provide runbooks - Keep them concise - Explanation, hints, links - Dynamic - include recent observations @phennex
  46. 46. What to monitor? @phennex Practice outages “Firedrills”, “Gamedays” - repeat regularly @phennex
  47. 47. @phennex Start being proactive. Dont be firefighters.
  48. 48. … and remember … @phennex
  49. 49. Hope is NOT a strategy @phennex Source: Site Reliability Engineering, How Google Runs Production Systems (2016), B. Beyer et al.
  50. 50. If you wanna know more… @phennex - - - The Site Reliability Engineering book - Podcasts: - - 
 (prefers push based opposite prometheus) -
  51. 51. @phennex The 3rd project in CNCF
  52. 52. Thank you! @phennex @phennex