SlideShare a Scribd company logo
1 of 13
Singapore Meetup, 2017-11-23
Speaker: Arseny Chernov
So The Story Goes Like…
Full story: https://www.slideshare.net/grobie/the-history-of-prometheus-at-soundcloud
• 2012 - Joined SoundCloud
• Left Google in 2012 after 5+ years
• Side-project for open-source
monitoring system for Not Only IT
(econometrics, biochemical etc.)
• Started LevelDB-backed
Prometheus
• Server, client_golang
• Protocol Buffers
• 2012 - Joined SoundCloud
• Left Google in 2012 after 2+ years
• Configuration, query language
& &
• 2013 - Joined SoundCloud
• Left Google in 2013 after 7+ years
• Storage rewrite (LevelDB to Chunks): March 2014
• Public release: January 2015
• Join Cloud Native Computing Foundation (CNCF): May 2016
• Prometheus 2.0 announced: November 08, 2017
• Singapore Meetup: 23 November, 2017
Motivation Behind - Google SRE Best Practices
Read book: https://landing.google.com/sre/book.html
• SRE: Have software engineers do operations
• Do the same work as an operations team, but with
automation instead of manual labour
• 50% upper bound cap on the amount of “ops”
Google SLI, SLO, SLA
Full story: https://cloudplatform.googleblog.com/2017/01/availability-part-deux--CRE-life-lessons.html
Service Level Indicators (SLIs)
• A carefully defined quantitative measure of some aspect of the level of service that is provided
• request latency / error rate (often expressed as % of all requests received ) / system throughput,
Service Level Objectives (SLOs)
• Lower bound ≤ SLI ≤ upper bound
• Define the lowest level of reliability, and state that as your Service Level Objective
(SLO).
Service Level Agreements (SLAs)
• SLA is a looser objective than the SLO. Alternatively the SLA might only specify a subset of SLO metrics.
• I.e. availability SLA of 99.9% over 1 month with internal availability SLO of 99.95%
• A promise to someone using a service that its availability should meet a certain level over a certain
period, and if it fails to do so then some kind of penalty will be paid (partial refund of subscription fee
paid by customers for that period, or subscription time added for free)
Example 1
Example 2
Latency
• The time it takes to service a request.
• Successful vs. failed requests
• Slow error is even worse than a fast error. Track error latency.
Traffic
• A measure of how much demand is being placed on your system
• Usually HTTP requests per second (static vs dynamic content)
• Streaming system - network I/O rate or concurrent sessions
• Key-value storage system - TPS.
Errors
• The rate of requests that fail, (e.g.: HTTP 500s or HTTP 200 but coupled with wrong content)
Saturation
• How "full" your service is. CPU, Memory, I/O
• Can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it
currently receives?
• Saturation is also concerned with predictions of impending saturation, such as "It looks like your database will fill its
hard drive in 4 hours.”
Four Golden Signals
Error Budget = 100% - SLO
Full story: https://cloudplatform.googleblog.com/2017/01/availability-part-deux--CRE-life-lessons.html
Move fast without breaking SLO
• 100% is the wrong reliability target
• Error Budgets balance the goals of:
• Product development teams (KPI is feature velocity, incentive to push code often)
• SRE teams (KPI is reliability of a service, incentive to pushback against change)
• Error budget can be spent on anything: launching features, etc.
• Error budget provokes for discussion of phased rollouts and 1% experiments
Goal of SRE team isn’t “zero outages”
• SRE and product incentive-aligned to spend error budget and get max. feature velocity
Googlers use Borgmon (a.k.a. Borgmon rules)
Full story: https://landing.google.com/sre/book/chapters/practical-alerting.html
%curl http://webserver:80/varz
http_requests 37
errors_total 12
Each of the major languages used at Google has an implementation of the exported variable interface that automagically
registers with the HTTP server built into every Google binary by default. It’s called “Collection via /varz “
Time Series:
Distributed:
…traditional monitoring in kube era
Full story: https://www.slideshare.net/FabianReinartz/monitoring-a-kubernetes-backed-microservice-architecture-with-prometheus
A lot of traffic to monitor
Way more targets to monitor
…and they constantly change
Need a fleet-wide view (i..e What’s my overall 99th percentile latency)?
Still need to be able to drill down for troubleshooting
&
Prometheus Relies on Exporters
Full story: https://www.slideshare.net/FabianReinartz/monitoring-a-kubernetes-backed-microservice-architecture-with-prometheus
Exporters: The endpoint being polled by the prometheus server and answering the GET requests is typically
called exporter, e.g. the host-level metrics exporter is node-exporter.
https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exporters.md
Prometheus Architecture
Full story: https://jaxenter.com/prometheus-monitoring-pros-cons-136019.html
The 3 path-method combinations with the highest number of failing
requests?
topk(3,
sum by(path, method) (
rate(http_requests_total{status=~"5.."}[5m]))
)
The 99th percentile request latency by request path?
histogram_quantile(0.99, sum by(le, path) (
rate(http_requests_duration_seconds_bucket[5m])
))
PromQL:
Prometheus Storage Architecture
• A monitoring system must be more reliabile than the systems it is monitoring
• Prometheus's local storage is not meant as durable long-term storage.
• Chunks of data are in RAM, with WAL on disk
needed_disk_space =
retention_time_seconds *
ingested_samples_per_second *
bytes_per_sample [1…2 bytes]
• Possible LVM solution if _really_ desperate
As of writing (Nov. 2017) moment possible to integrate via adapters to:
Chronix , Cortex , CrateDB , Graphite , InfluxDB , OpenTSDB , PostgreSQL/TimescaleDB , SignalFx , Clickhouse etc.
This is primarily intended for long term storage. It is recommended that you perform careful evaluation of any
solution in this space to confirm it can handle your data volumes.
Full story: https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage
What Protetheus Is Not & Best Practice
• Not 100% accurate
• No logs, only metrics
• Not a durable long-term storage
• Not an anomaly detection
• Not a dashboarding solution
Full story: https://prometheus.io/docs/introduction/overview/#when-does-it-not-fit
Run one Prometheus server (or HA pair) in each failure domain / zone / cluster, monitoring jobs only in that zone.
Have a set of global Prometheus servers that monitor (federate from) the per-cluster ones.

More Related Content

What's hot

HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseCloudera, Inc.
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程HBaseCon
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQLKonstantin Gredeskoul
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure DataTaro L. Saito
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneDouglas Moore
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure DataTaro L. Saito
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsDataWorks Summit/Hadoop Summit
 
Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Camuel Gilyadov
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...Michael Stack
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBasedave_revell
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 

What's hot (20)

HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to Contribute
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBase
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and Improvements
 
Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 

Similar to Introduction to Prometheus Monitoring (Singapore Meetup)

Monitoring microservice applications: An SRE’s perspective
Monitoring microservice applications: An SRE’s perspectiveMonitoring microservice applications: An SRE’s perspective
Monitoring microservice applications: An SRE’s perspectiveDevOpsProdigy
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
Gaelyk - JFokus 2011 - Guillaume Laforge
Gaelyk - JFokus 2011 - Guillaume LaforgeGaelyk - JFokus 2011 - Guillaume Laforge
Gaelyk - JFokus 2011 - Guillaume LaforgeGuillaume Laforge
 
Database performance management
Database performance managementDatabase performance management
Database performance managementscottaver
 
10 Tips for Your Journey to the Public Cloud
10 Tips for Your Journey to the Public Cloud10 Tips for Your Journey to the Public Cloud
10 Tips for Your Journey to the Public CloudIntuit Inc.
 
Testing for Logic App Solutions | Integration Monday
Testing for Logic App Solutions | Integration MondayTesting for Logic App Solutions | Integration Monday
Testing for Logic App Solutions | Integration MondayBizTalk360
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applicationsAmit Kejriwal
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...rschuppe
 
DockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopDockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopBrian Christner
 
No Devops Without Continuous Testing
No Devops Without Continuous TestingNo Devops Without Continuous Testing
No Devops Without Continuous TestingParasoft
 
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamFrom Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamAndreas Grabner
 
Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...
Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...
Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...Vikas Sahni
 
Building azure applications ireland
Building azure applications irelandBuilding azure applications ireland
Building azure applications irelandMichael Meagher
 
Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...
Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...
Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...gree_tech
 
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECTFlow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECTSabrina Marechal
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Lari Hotari
 
Azure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challengesAzure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challengesIvo Andreev
 
Deploy secure, scalable, and highly available web apps with Azure Front Door ...
Deploy secure, scalable, and highly available web apps with Azure Front Door ...Deploy secure, scalable, and highly available web apps with Azure Front Door ...
Deploy secure, scalable, and highly available web apps with Azure Front Door ...Stamo Petkov
 

Similar to Introduction to Prometheus Monitoring (Singapore Meetup) (20)

Salesforce Performance hacks - Client Side
Salesforce Performance hacks - Client SideSalesforce Performance hacks - Client Side
Salesforce Performance hacks - Client Side
 
Monitoring microservice applications: An SRE’s perspective
Monitoring microservice applications: An SRE’s perspectiveMonitoring microservice applications: An SRE’s perspective
Monitoring microservice applications: An SRE’s perspective
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Gaelyk - JFokus 2011 - Guillaume Laforge
Gaelyk - JFokus 2011 - Guillaume LaforgeGaelyk - JFokus 2011 - Guillaume Laforge
Gaelyk - JFokus 2011 - Guillaume Laforge
 
Software Performance
Software Performance Software Performance
Software Performance
 
Database performance management
Database performance managementDatabase performance management
Database performance management
 
10 Tips for Your Journey to the Public Cloud
10 Tips for Your Journey to the Public Cloud10 Tips for Your Journey to the Public Cloud
10 Tips for Your Journey to the Public Cloud
 
Testing for Logic App Solutions | Integration Monday
Testing for Logic App Solutions | Integration MondayTesting for Logic App Solutions | Integration Monday
Testing for Logic App Solutions | Integration Monday
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
 
DockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopDockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging Workshop
 
No Devops Without Continuous Testing
No Devops Without Continuous TestingNo Devops Without Continuous Testing
No Devops Without Continuous Testing
 
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamFrom Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
 
Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...
Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...
Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...
 
Building azure applications ireland
Building azure applications irelandBuilding azure applications ireland
Building azure applications ireland
 
Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...
Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...
Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...
 
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECTFlow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014
 
Azure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challengesAzure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challenges
 
Deploy secure, scalable, and highly available web apps with Azure Front Door ...
Deploy secure, scalable, and highly available web apps with Azure Front Door ...Deploy secure, scalable, and highly available web apps with Azure Front Door ...
Deploy secure, scalable, and highly available web apps with Azure Front Door ...
 

Recently uploaded

Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsMonica Sydney
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdfMatthew Sinclair
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasDigicorns Technologies
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查ydyuyu
 
Power point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria IuzzolinoPower point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria Iuzzolinonuriaiuzzolino1
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoilmeghakumariji156
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfJOHNBEBONYAP1
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge GraphsEleniIlkou
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirtrahman018755
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查ydyuyu
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtrahman018755
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制pxcywzqs
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdfMatthew Sinclair
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftAanSulistiyo
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Roommeghakumariji156
 
75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptxAsmae Rabhi
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...kajalverma014
 

Recently uploaded (20)

Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
Power point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria IuzzolinoPower point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria Iuzzolino
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck Microsoft
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
 

Introduction to Prometheus Monitoring (Singapore Meetup)

  • 2. So The Story Goes Like… Full story: https://www.slideshare.net/grobie/the-history-of-prometheus-at-soundcloud • 2012 - Joined SoundCloud • Left Google in 2012 after 5+ years • Side-project for open-source monitoring system for Not Only IT (econometrics, biochemical etc.) • Started LevelDB-backed Prometheus • Server, client_golang • Protocol Buffers • 2012 - Joined SoundCloud • Left Google in 2012 after 2+ years • Configuration, query language & & • 2013 - Joined SoundCloud • Left Google in 2013 after 7+ years • Storage rewrite (LevelDB to Chunks): March 2014 • Public release: January 2015 • Join Cloud Native Computing Foundation (CNCF): May 2016 • Prometheus 2.0 announced: November 08, 2017 • Singapore Meetup: 23 November, 2017
  • 3. Motivation Behind - Google SRE Best Practices Read book: https://landing.google.com/sre/book.html • SRE: Have software engineers do operations • Do the same work as an operations team, but with automation instead of manual labour • 50% upper bound cap on the amount of “ops”
  • 4. Google SLI, SLO, SLA Full story: https://cloudplatform.googleblog.com/2017/01/availability-part-deux--CRE-life-lessons.html Service Level Indicators (SLIs) • A carefully defined quantitative measure of some aspect of the level of service that is provided • request latency / error rate (often expressed as % of all requests received ) / system throughput, Service Level Objectives (SLOs) • Lower bound ≤ SLI ≤ upper bound • Define the lowest level of reliability, and state that as your Service Level Objective (SLO). Service Level Agreements (SLAs) • SLA is a looser objective than the SLO. Alternatively the SLA might only specify a subset of SLO metrics. • I.e. availability SLA of 99.9% over 1 month with internal availability SLO of 99.95% • A promise to someone using a service that its availability should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid (partial refund of subscription fee paid by customers for that period, or subscription time added for free)
  • 6. Latency • The time it takes to service a request. • Successful vs. failed requests • Slow error is even worse than a fast error. Track error latency. Traffic • A measure of how much demand is being placed on your system • Usually HTTP requests per second (static vs dynamic content) • Streaming system - network I/O rate or concurrent sessions • Key-value storage system - TPS. Errors • The rate of requests that fail, (e.g.: HTTP 500s or HTTP 200 but coupled with wrong content) Saturation • How "full" your service is. CPU, Memory, I/O • Can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? • Saturation is also concerned with predictions of impending saturation, such as "It looks like your database will fill its hard drive in 4 hours.” Four Golden Signals
  • 7. Error Budget = 100% - SLO Full story: https://cloudplatform.googleblog.com/2017/01/availability-part-deux--CRE-life-lessons.html Move fast without breaking SLO • 100% is the wrong reliability target • Error Budgets balance the goals of: • Product development teams (KPI is feature velocity, incentive to push code often) • SRE teams (KPI is reliability of a service, incentive to pushback against change) • Error budget can be spent on anything: launching features, etc. • Error budget provokes for discussion of phased rollouts and 1% experiments Goal of SRE team isn’t “zero outages” • SRE and product incentive-aligned to spend error budget and get max. feature velocity
  • 8. Googlers use Borgmon (a.k.a. Borgmon rules) Full story: https://landing.google.com/sre/book/chapters/practical-alerting.html %curl http://webserver:80/varz http_requests 37 errors_total 12 Each of the major languages used at Google has an implementation of the exported variable interface that automagically registers with the HTTP server built into every Google binary by default. It’s called “Collection via /varz “ Time Series: Distributed:
  • 9. …traditional monitoring in kube era Full story: https://www.slideshare.net/FabianReinartz/monitoring-a-kubernetes-backed-microservice-architecture-with-prometheus A lot of traffic to monitor Way more targets to monitor …and they constantly change Need a fleet-wide view (i..e What’s my overall 99th percentile latency)? Still need to be able to drill down for troubleshooting &
  • 10. Prometheus Relies on Exporters Full story: https://www.slideshare.net/FabianReinartz/monitoring-a-kubernetes-backed-microservice-architecture-with-prometheus Exporters: The endpoint being polled by the prometheus server and answering the GET requests is typically called exporter, e.g. the host-level metrics exporter is node-exporter. https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exporters.md
  • 11. Prometheus Architecture Full story: https://jaxenter.com/prometheus-monitoring-pros-cons-136019.html The 3 path-method combinations with the highest number of failing requests? topk(3, sum by(path, method) ( rate(http_requests_total{status=~"5.."}[5m])) ) The 99th percentile request latency by request path? histogram_quantile(0.99, sum by(le, path) ( rate(http_requests_duration_seconds_bucket[5m]) )) PromQL:
  • 12. Prometheus Storage Architecture • A monitoring system must be more reliabile than the systems it is monitoring • Prometheus's local storage is not meant as durable long-term storage. • Chunks of data are in RAM, with WAL on disk needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample [1…2 bytes] • Possible LVM solution if _really_ desperate As of writing (Nov. 2017) moment possible to integrate via adapters to: Chronix , Cortex , CrateDB , Graphite , InfluxDB , OpenTSDB , PostgreSQL/TimescaleDB , SignalFx , Clickhouse etc. This is primarily intended for long term storage. It is recommended that you perform careful evaluation of any solution in this space to confirm it can handle your data volumes. Full story: https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage
  • 13. What Protetheus Is Not & Best Practice • Not 100% accurate • No logs, only metrics • Not a durable long-term storage • Not an anomaly detection • Not a dashboarding solution Full story: https://prometheus.io/docs/introduction/overview/#when-does-it-not-fit Run one Prometheus server (or HA pair) in each failure domain / zone / cluster, monitoring jobs only in that zone. Have a set of global Prometheus servers that monitor (federate from) the per-cluster ones.