SlideShare a Scribd company logo
1 of 38
Download to read offline
Monitoring Microservices
with Prometheus
Tobias Schmidt - MicroCPH May 17, 2017
github.com/grobie @dagrobie tobidt@gmail.com
Monitoring
● Ability to observe and understand systems and their behavior.
○ Know when things go wrong
○ Understand and debug service misbehavior
○ Detect trends and act in advance
● Blackbox vs. Whitebox monitoring
○ Blackbox: Observes systems externally with periodic checks
○ Whitebox: Provides internally observed metrics
● Whitebox: Different levels of granularity
○ Logging
○ Tracing
○ Metrics
Monitoring
● Metrics monitoring system and time series database
○ Instrumentation (client libraries and exporters)
○ Metrics collection, processing and storage
○ Querying, alerting and dashboards
○ Analysis, trending, capacity planning
○ Focused on infrastructure, not business metrics
● Key features
○ Powerful query language for metrics with label dimensions
○ Stable and simple operation
○ Built for modern dynamic deploy environments
○ Easy setup
● What it’s not
○ Logging system
○ Designed for perfect answers
Prometheus
Instrumentation case study
Gusta: a simple like service
● Service to handle everything around liking a resource
○ List all liked likes on a resource
○ Create a like on a resource
○ Delete a like on a resource
● Implementation
○ Written in golang
○ Uses the gokit.io toolkit
Gusta overview
// Like represents all information of a single like.
type Like struct {
ResourceID string `json:"resourceID"`
UserID string `json:"userID"`
CreatedAt time.Time `json:"createdAt"`
}
// Service describes all methods provided by the gusta service.
type Service interface {
ListResourceLikes(resourceID string) ([]Like, error)
LikeResource(resourceID, userID string) error
UnlikeResource(resourceID, userID string) error
}
Gusta core
// main.go
var store gusta.Store
store = gusta.NewMemoryStore()
var s gusta.Service
s = gusta.NewService(store)
s = gusta.LoggingMiddleware(logger)(s)
var h http.Handler
h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"))
http.Handle("/", h)
if err := http.ListenAndServe(*httpAddr, nil); err != nil {
logger.Log("exit error", err)
}
Gusta server
./gusta
ts=2017-05-16T19:39:34.938108068Z transport=HTTP addr=:8080
ts=2017-05-16T19:38:24.203071341Z method=LikeResource ResourceID=r1ee85512 UserID=ue86d7a01 took=10.466µs err=null
ts=2017-05-16T19:38:24.323002316Z method=ListResourceLikes ResourceID=r8669fd29 took=17.812µs err=null
ts=2017-05-16T19:38:24.343061775Z method=ListResourceLikes ResourceID=rd4ac47c6 took=30.986µs err=null
ts=2017-05-16T19:38:24.363022818Z method=LikeResource ResourceID=r1ee85512 UserID=u19597d1e took=10.757µs err=null
ts=2017-05-16T19:38:24.38303722Z method=ListResourceLikes ResourceID=rfc9a393a took=41.554µs err=null
ts=2017-05-16T19:38:24.40303802Z method=ListResourceLikes ResourceID=r8669fd29 took=28.115µs err=null
ts=2017-05-16T19:38:24.423045585Z method=ListResourceLikes ResourceID=r8669fd29 took=23.842µs err=null
ts=2017-05-16T19:38:20.843121594Z method=UnlikeResource ResourceID=r1ee85512 UserID=ub5e42f43 took=8.57µs err="not
found"
ts=2017-05-16T19:38:20.863037026Z method=ListResourceLikes ResourceID=rfc9a393a took=27.839µs err=null
ts=2017-05-16T19:38:20.883081162Z method=ListResourceLikes ResourceID=r8669fd29 took=16.999µs err=null
Gusta server
Basic Instrumentation
Providing operational insight
● “Four golden signals” cover the essentials
○ Latency
○ Traffic
○ Errors
○ Saturation
● Similar concepts: RED and USE methods
○ Request: Rate, Errors, Duration
○ Utilization, Saturation, Errors
● Information about the service itself
● Interaction with dependencies (other services, databases, etc.)
What information should be provided?
● Direct instrumentation
○ Traffic, Latency, Errors, Saturation
○ Service specific metrics (and interaction with dependencies)
○ Prometheus client libraries provide packages to instrument HTTP
requests out of the box
● Exporters
○ Utilization, Saturation
○ node_exporter CPU, memory, IO utilization per host
○ wmi_exporter does the same for Windows
○ cAdvisor (Container advisor) provides similar metrics for each container
Where to get the information from?
// main.go
import "github.com/prometheus/client_golang/prometheus"
var registry = prometheus.NewRegistry()
registry.MustRegister(
prometheus.NewGoCollector(),
prometheus.NewProcessCollector(os.Getpid(), ""),
)
// Pass down registry when creating HTTP handlers.
h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"), registry)
Initializing Prometheus client library
var h http.Handler = listResourceLikesHandler
var method, path string = "GET", "/api/v1/likes/{id}"
requests := prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "gusta_http_server_requests_total",
Help: "Total number of requests handled by the HTTP server.",
ConstLabels: prometheus.Labels{"method": method, "path": path},
},
[]string{"code"},
)
registry.MustRegister(requests)
h = promhttp.InstrumentHandlerCounter(requests, h)
Counting HTTP requests
var h http.Handler = listResourceLikesHandler
var method, path string = "GET", "/api/v1/likes/{id}"
requestDuration := prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "gusta_http_server_request_duration_seconds",
Help: "A histogram of latencies for requests.",
Buckets: []float64{.0025, .005, 0.01, 0.025, 0.05, 0.1},
ConstLabels: prometheus.Labels{"method": method, "path": path},
},
[]string{},
)
registry.MustRegister(requestDuration)
h = promhttp.InstrumentHandlerDuration(requestDuration, h)
Observing HTTP request latency
Exposing metrics
Observing the current state
● Prometheus is a pull based monitoring system
○ Instances expose an HTTP endpoint to expose their metrics
○ Prometheus uses service discovery or static target lists to collect the
state periodically
● Centralized management
○ Prometheus decides how often to scrape instances
● Prometheus stores the data on local disc
○ In a big outage, you could run Prometheus on your laptop!
How to collect the metrics?
// main.go
// ...
http.Handle("/metrics", promhttp.HandlerFor(
registry,
promhttp.HandlerOpts{},
))
Exposing the metrics via HTTP
curl -s http://localhost:8080/metrics | grep requests
# HELP gusta_http_server_requests_total Total number of requests handled by the gusta HTTP server.
# TYPE gusta_http_server_requests_total counter
gusta_http_server_requests_total{code="200",method="DELETE",path="/api/v1/likes"} 3
gusta_http_server_requests_total{code="200",method="GET",path="/api/v1/likes/{id}"} 429
gusta_http_server_requests_total{code="200",method="POST",path="/api/v1/likes"} 51
gusta_http_server_requests_total{code="404",method="DELETE",path="/api/v1/likes"} 14
gusta_http_server_requests_total{code="409",method="POST",path="/api/v1/likes"} 3
Request metrics
curl -s http://localhost:8080/metrics | grep request_duration
# HELP gusta_http_server_request_duration_seconds A histogram of latencies for requests.
# TYPE gusta_http_server_request_duration_seconds histogram
...
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.00025"} 414
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0005"} 423
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.001"} 429
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0025"} 429
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.005"} 429
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.01"} 429
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="+Inf"} 429
gusta_http_server_request_duration_seconds_sum{method="GET",path="/api/v1/likes/{id}"} 0.047897984
gusta_http_server_request_duration_seconds_count{method="GET",path="/api/v1/likes/{id}"} 429
...
Latency metrics
curl -s http://localhost:8080/metrics | grep process
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 892.78
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 23
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 9.3446144e+07
...
Out-of-the-box process metrics
Collecting metrics
Scraping all service instances
# Scrape all targets every 5 seconds by default.
global:
scrape_interval: 5s
evaluation_interval: 5s
scrape_configs:
# Scrape the Prometheus server itself.
- job_name: prometheus
static_configs:
- targets: [localhost:9090]
# Scrape the Gusta service.
- job_name: gusta
static_configs:
- targets: [localhost:8080]
Static configuration
scrape_configs:
# Scrape the Gusta service using Consul.
- job_name: consul
consul_sd_configs:
- server: localhost:8500
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: .*,prod,.*
action: keep
- source_labels: [__meta_consul_service]
target_label: job
Consul service discovery
Target overview
Simple Graph UI
Simple Graph UI
Dashboards
Human-readable metrics
Grafana example
Alerts
Actionable metrics
ALERT InstanceDown
IF up == 0
FOR 2m
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Instance down for more than 5 minutes.",
description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for >= 5 minutes.",
}
ALERT RunningOutOfFileDescriptors
IF process_open_fds / process_fds * 100 > 95
FOR 2m
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Instance has many open file descriptors.",
description = "{{ $labels.instance }} of job {{ $labels.job }} has {{ $value }}% open descriptors.",
}
Alert examples
ALERT GustaHighErrorRate
IF sum without(code, instance) (rate(gusta_http_server_requests_total{code=~"5.."}[1m]))
/ sum without(code, instance) (rate(gusta_http_server_requests_total[1m]))
* 100 > 0.1
FOR 2m
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "Gusta service endpoints have a high error rate.",
description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }} returns {{ $value }}% errors.",
}
ALERT GustaHighLatency
IF histogram_quantile(0.95, rate(gusta_http_server_request_duration_seconds_bucket[1m])) > 0.1
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "Gusta service endpoints have a high latency.",
description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }}
has a 95% percentile latency of {{ $value }} seconds.",
}
Alert examples
ALERT FilesystemRunningFull
IF predict_linear(node_filesystem_avail{mountpoint!="/var/lib/docker/aufs"}[6h], 24 * 60 * 60) < 0
FOR 1h
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Filesystem space is filling up.",
description = "Filesystem on {{ $labels.device }} at {{ $labels.instance }}
is predicted to run out of space within the next 24 hours.",
}
Alert examples
Summary
● Monitoring is essential to run, understand and operate services.
● Prometheus
○ Client instrumentation
○ Scrape configuration
○ Querying
○ Dashboards
○ Alert rules
● Important Metrics
○ Four golden signals: Latency, Traffic, Error, Saturation
● Best practices
Recap
● https://prometheus.io
● Talks, Articles, Videos https://www.reddit.com/r/PrometheusMonitoring/
● Our “StackOverflow” https://www.robustperception.io/blog/
● Ask the community https://prometheus.io/community/
● Google’s SRE book https://landing.google.com/sre/book/index.html
● USE method http://www.brendangregg.com/usemethod.html
● My philosophy on alerting https://goo.gl/UnvYhQ
Sources
Thank you
Tobias Schmidt - MicroCPH May 17, 2017
github.com/grobie - @dagrobie
● High availability
○ Run two identical servers
● Scaling
○ Shard by datacenter / team / service ( / instance )
● Aggregation across Prometheus servers
○ Federation
● Retention time
○ Generic remote storage support available.
● Pull vs. Push
○ Doesn’t matter in practice. Advantages depend on use case.
● Security
○ Focused on writing a monitoring system, left to the user.
FAQ

More Related Content

What's hot

Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheusKasper Nissen
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaSyah Dwi Prihatmoko
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Brian Brazil
 
End to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenParis Container Day
 
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...Thomas Riley
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusGrafana Labs
 
Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to PrometheusJulien Pivotto
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)Lucas Jellema
 
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Brian Brazil
 
Grafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsGrafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsMarco Pracucci
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheusCeline George
 
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...NETWAYS
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with PrometheusQAware GmbH
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with PrometheusShiao-An Yuan
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil
 

What's hot (20)

Cloud Monitoring tool Grafana
Cloud Monitoring  tool Grafana Cloud Monitoring  tool Grafana
Cloud Monitoring tool Grafana
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
End to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max Inden
 
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to Prometheus
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
 
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
 
Grafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsGrafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for Logs
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with Prometheus
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
 
Grafana
GrafanaGrafana
Grafana
 
HAProxy
HAProxy HAProxy
HAProxy
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
 
Prometheus and Grafana
Prometheus and GrafanaPrometheus and Grafana
Prometheus and Grafana
 

Similar to Monitoring microservices with Prometheus

Using NGINX as an Effective and Highly Available Content Cache
Using NGINX as an Effective and Highly Available Content CacheUsing NGINX as an Effective and Highly Available Content Cache
Using NGINX as an Effective and Highly Available Content CacheKevin Jones
 
ITB2017 - Nginx Effective High Availability Content Caching
ITB2017 - Nginx Effective High Availability Content CachingITB2017 - Nginx Effective High Availability Content Caching
ITB2017 - Nginx Effective High Availability Content CachingOrtus Solutions, Corp
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
The RED Method: How To Instrument Your Services
The RED Method: How To Instrument Your ServicesThe RED Method: How To Instrument Your Services
The RED Method: How To Instrument Your ServicesKausal
 
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP ArchiveAkamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP ArchiveRick Viscomi
 
Tracking the Performance of the Web Over Time with the HTTP Archive
Tracking the Performance of the Web Over Time with the HTTP ArchiveTracking the Performance of the Web Over Time with the HTTP Archive
Tracking the Performance of the Web Over Time with the HTTP ArchiveAkamai Developers & Admins
 
PostgreSQL Performance Problems: Monitoring and Alerting
PostgreSQL Performance Problems: Monitoring and AlertingPostgreSQL Performance Problems: Monitoring and Alerting
PostgreSQL Performance Problems: Monitoring and AlertingGrant Fritchey
 
Dynamic Infrastructure and Container Monitoring with Prometheus
Dynamic Infrastructure and Container Monitoring with PrometheusDynamic Infrastructure and Container Monitoring with Prometheus
Dynamic Infrastructure and Container Monitoring with PrometheusGeorg Öttl
 
Log aggregation and analysis
Log aggregation and analysisLog aggregation and analysis
Log aggregation and analysisDhaval Mehta
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek PROIDEA
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackJakub Hajek
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performancesource{d}
 
observability pre-release: using prometheus to test and fix new software
observability pre-release: using prometheus to test and fix new softwareobservability pre-release: using prometheus to test and fix new software
observability pre-release: using prometheus to test and fix new softwareSneha Inguva
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemAccumulo Summit
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoringRohit Jnagal
 
Improving the performance of Odoo deployments
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deploymentsOdoo
 

Similar to Monitoring microservices with Prometheus (20)

Using NGINX as an Effective and Highly Available Content Cache
Using NGINX as an Effective and Highly Available Content CacheUsing NGINX as an Effective and Highly Available Content Cache
Using NGINX as an Effective and Highly Available Content Cache
 
ITB2017 - Nginx Effective High Availability Content Caching
ITB2017 - Nginx Effective High Availability Content CachingITB2017 - Nginx Effective High Availability Content Caching
ITB2017 - Nginx Effective High Availability Content Caching
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
The RED Method: How To Instrument Your Services
The RED Method: How To Instrument Your ServicesThe RED Method: How To Instrument Your Services
The RED Method: How To Instrument Your Services
 
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP ArchiveAkamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
 
Tracking the Performance of the Web Over Time with the HTTP Archive
Tracking the Performance of the Web Over Time with the HTTP ArchiveTracking the Performance of the Web Over Time with the HTTP Archive
Tracking the Performance of the Web Over Time with the HTTP Archive
 
PostgreSQL Performance Problems: Monitoring and Alerting
PostgreSQL Performance Problems: Monitoring and AlertingPostgreSQL Performance Problems: Monitoring and Alerting
PostgreSQL Performance Problems: Monitoring and Alerting
 
Dynamic Infrastructure and Container Monitoring with Prometheus
Dynamic Infrastructure and Container Monitoring with PrometheusDynamic Infrastructure and Container Monitoring with Prometheus
Dynamic Infrastructure and Container Monitoring with Prometheus
 
Log aggregation and analysis
Log aggregation and analysisLog aggregation and analysis
Log aggregation and analysis
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
Metrics with Ganglia
Metrics with GangliaMetrics with Ganglia
Metrics with Ganglia
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performance
 
observability pre-release: using prometheus to test and fix new software
observability pre-release: using prometheus to test and fix new softwareobservability pre-release: using prometheus to test and fix new software
observability pre-release: using prometheus to test and fix new software
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
Native Container Monitoring
Native Container MonitoringNative Container Monitoring
Native Container Monitoring
 
Improving the performance of Odoo deployments
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deployments
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
 
Redis
RedisRedis
Redis
 

More from Tobias Schmidt

Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusTobias Schmidt
 
The history of Prometheus at SoundCloud
The history of Prometheus at SoundCloudThe history of Prometheus at SoundCloud
The history of Prometheus at SoundCloudTobias Schmidt
 
Efficient monitoring and alerting
Efficient monitoring and alertingEfficient monitoring and alerting
Efficient monitoring and alertingTobias Schmidt
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudTobias Schmidt
 
Prometheus loves Grafana
Prometheus loves GrafanaPrometheus loves Grafana
Prometheus loves GrafanaTobias Schmidt
 
16 months @ SoundCloud
16 months @ SoundCloud16 months @ SoundCloud
16 months @ SoundCloudTobias Schmidt
 

More from Tobias Schmidt (7)

Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
The history of Prometheus at SoundCloud
The history of Prometheus at SoundCloudThe history of Prometheus at SoundCloud
The history of Prometheus at SoundCloud
 
Efficient monitoring and alerting
Efficient monitoring and alertingEfficient monitoring and alerting
Efficient monitoring and alerting
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloud
 
Prometheus loves Grafana
Prometheus loves GrafanaPrometheus loves Grafana
Prometheus loves Grafana
 
16 months @ SoundCloud
16 months @ SoundCloud16 months @ SoundCloud
16 months @ SoundCloud
 
Two database findings
Two database findingsTwo database findings
Two database findings
 

Recently uploaded

KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmDeepika Walanjkar
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTSneha Padhiar
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书rnrncn29
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodManicka Mamallan Andavar
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfManish Kumar
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1
 
signals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsignals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsapna80328
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 

Recently uploaded (20)

KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument method
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
 
signals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsignals in triangulation .. ...Surveying
signals in triangulation .. ...Surveying
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based question
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 

Monitoring microservices with Prometheus

  • 1. Monitoring Microservices with Prometheus Tobias Schmidt - MicroCPH May 17, 2017 github.com/grobie @dagrobie tobidt@gmail.com
  • 3. ● Ability to observe and understand systems and their behavior. ○ Know when things go wrong ○ Understand and debug service misbehavior ○ Detect trends and act in advance ● Blackbox vs. Whitebox monitoring ○ Blackbox: Observes systems externally with periodic checks ○ Whitebox: Provides internally observed metrics ● Whitebox: Different levels of granularity ○ Logging ○ Tracing ○ Metrics Monitoring
  • 4. ● Metrics monitoring system and time series database ○ Instrumentation (client libraries and exporters) ○ Metrics collection, processing and storage ○ Querying, alerting and dashboards ○ Analysis, trending, capacity planning ○ Focused on infrastructure, not business metrics ● Key features ○ Powerful query language for metrics with label dimensions ○ Stable and simple operation ○ Built for modern dynamic deploy environments ○ Easy setup ● What it’s not ○ Logging system ○ Designed for perfect answers Prometheus
  • 5. Instrumentation case study Gusta: a simple like service
  • 6. ● Service to handle everything around liking a resource ○ List all liked likes on a resource ○ Create a like on a resource ○ Delete a like on a resource ● Implementation ○ Written in golang ○ Uses the gokit.io toolkit Gusta overview
  • 7. // Like represents all information of a single like. type Like struct { ResourceID string `json:"resourceID"` UserID string `json:"userID"` CreatedAt time.Time `json:"createdAt"` } // Service describes all methods provided by the gusta service. type Service interface { ListResourceLikes(resourceID string) ([]Like, error) LikeResource(resourceID, userID string) error UnlikeResource(resourceID, userID string) error } Gusta core
  • 8. // main.go var store gusta.Store store = gusta.NewMemoryStore() var s gusta.Service s = gusta.NewService(store) s = gusta.LoggingMiddleware(logger)(s) var h http.Handler h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP")) http.Handle("/", h) if err := http.ListenAndServe(*httpAddr, nil); err != nil { logger.Log("exit error", err) } Gusta server
  • 9. ./gusta ts=2017-05-16T19:39:34.938108068Z transport=HTTP addr=:8080 ts=2017-05-16T19:38:24.203071341Z method=LikeResource ResourceID=r1ee85512 UserID=ue86d7a01 took=10.466µs err=null ts=2017-05-16T19:38:24.323002316Z method=ListResourceLikes ResourceID=r8669fd29 took=17.812µs err=null ts=2017-05-16T19:38:24.343061775Z method=ListResourceLikes ResourceID=rd4ac47c6 took=30.986µs err=null ts=2017-05-16T19:38:24.363022818Z method=LikeResource ResourceID=r1ee85512 UserID=u19597d1e took=10.757µs err=null ts=2017-05-16T19:38:24.38303722Z method=ListResourceLikes ResourceID=rfc9a393a took=41.554µs err=null ts=2017-05-16T19:38:24.40303802Z method=ListResourceLikes ResourceID=r8669fd29 took=28.115µs err=null ts=2017-05-16T19:38:24.423045585Z method=ListResourceLikes ResourceID=r8669fd29 took=23.842µs err=null ts=2017-05-16T19:38:20.843121594Z method=UnlikeResource ResourceID=r1ee85512 UserID=ub5e42f43 took=8.57µs err="not found" ts=2017-05-16T19:38:20.863037026Z method=ListResourceLikes ResourceID=rfc9a393a took=27.839µs err=null ts=2017-05-16T19:38:20.883081162Z method=ListResourceLikes ResourceID=r8669fd29 took=16.999µs err=null Gusta server
  • 11. ● “Four golden signals” cover the essentials ○ Latency ○ Traffic ○ Errors ○ Saturation ● Similar concepts: RED and USE methods ○ Request: Rate, Errors, Duration ○ Utilization, Saturation, Errors ● Information about the service itself ● Interaction with dependencies (other services, databases, etc.) What information should be provided?
  • 12. ● Direct instrumentation ○ Traffic, Latency, Errors, Saturation ○ Service specific metrics (and interaction with dependencies) ○ Prometheus client libraries provide packages to instrument HTTP requests out of the box ● Exporters ○ Utilization, Saturation ○ node_exporter CPU, memory, IO utilization per host ○ wmi_exporter does the same for Windows ○ cAdvisor (Container advisor) provides similar metrics for each container Where to get the information from?
  • 13. // main.go import "github.com/prometheus/client_golang/prometheus" var registry = prometheus.NewRegistry() registry.MustRegister( prometheus.NewGoCollector(), prometheus.NewProcessCollector(os.Getpid(), ""), ) // Pass down registry when creating HTTP handlers. h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"), registry) Initializing Prometheus client library
  • 14. var h http.Handler = listResourceLikesHandler var method, path string = "GET", "/api/v1/likes/{id}" requests := prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "gusta_http_server_requests_total", Help: "Total number of requests handled by the HTTP server.", ConstLabels: prometheus.Labels{"method": method, "path": path}, }, []string{"code"}, ) registry.MustRegister(requests) h = promhttp.InstrumentHandlerCounter(requests, h) Counting HTTP requests
  • 15. var h http.Handler = listResourceLikesHandler var method, path string = "GET", "/api/v1/likes/{id}" requestDuration := prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "gusta_http_server_request_duration_seconds", Help: "A histogram of latencies for requests.", Buckets: []float64{.0025, .005, 0.01, 0.025, 0.05, 0.1}, ConstLabels: prometheus.Labels{"method": method, "path": path}, }, []string{}, ) registry.MustRegister(requestDuration) h = promhttp.InstrumentHandlerDuration(requestDuration, h) Observing HTTP request latency
  • 17. ● Prometheus is a pull based monitoring system ○ Instances expose an HTTP endpoint to expose their metrics ○ Prometheus uses service discovery or static target lists to collect the state periodically ● Centralized management ○ Prometheus decides how often to scrape instances ● Prometheus stores the data on local disc ○ In a big outage, you could run Prometheus on your laptop! How to collect the metrics?
  • 18. // main.go // ... http.Handle("/metrics", promhttp.HandlerFor( registry, promhttp.HandlerOpts{}, )) Exposing the metrics via HTTP
  • 19. curl -s http://localhost:8080/metrics | grep requests # HELP gusta_http_server_requests_total Total number of requests handled by the gusta HTTP server. # TYPE gusta_http_server_requests_total counter gusta_http_server_requests_total{code="200",method="DELETE",path="/api/v1/likes"} 3 gusta_http_server_requests_total{code="200",method="GET",path="/api/v1/likes/{id}"} 429 gusta_http_server_requests_total{code="200",method="POST",path="/api/v1/likes"} 51 gusta_http_server_requests_total{code="404",method="DELETE",path="/api/v1/likes"} 14 gusta_http_server_requests_total{code="409",method="POST",path="/api/v1/likes"} 3 Request metrics
  • 20. curl -s http://localhost:8080/metrics | grep request_duration # HELP gusta_http_server_request_duration_seconds A histogram of latencies for requests. # TYPE gusta_http_server_request_duration_seconds histogram ... gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.00025"} 414 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0005"} 423 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.001"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0025"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.005"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.01"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="+Inf"} 429 gusta_http_server_request_duration_seconds_sum{method="GET",path="/api/v1/likes/{id}"} 0.047897984 gusta_http_server_request_duration_seconds_count{method="GET",path="/api/v1/likes/{id}"} 429 ... Latency metrics
  • 21. curl -s http://localhost:8080/metrics | grep process # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. # TYPE process_cpu_seconds_total counter process_cpu_seconds_total 892.78 # HELP process_max_fds Maximum number of open file descriptors. # TYPE process_max_fds gauge process_max_fds 1024 # HELP process_open_fds Number of open file descriptors. # TYPE process_open_fds gauge process_open_fds 23 # HELP process_resident_memory_bytes Resident memory size in bytes. # TYPE process_resident_memory_bytes gauge process_resident_memory_bytes 9.3446144e+07 ... Out-of-the-box process metrics
  • 22. Collecting metrics Scraping all service instances
  • 23. # Scrape all targets every 5 seconds by default. global: scrape_interval: 5s evaluation_interval: 5s scrape_configs: # Scrape the Prometheus server itself. - job_name: prometheus static_configs: - targets: [localhost:9090] # Scrape the Gusta service. - job_name: gusta static_configs: - targets: [localhost:8080] Static configuration
  • 24. scrape_configs: # Scrape the Gusta service using Consul. - job_name: consul consul_sd_configs: - server: localhost:8500 relabel_configs: - source_labels: [__meta_consul_tags] regex: .*,prod,.* action: keep - source_labels: [__meta_consul_service] target_label: job Consul service discovery
  • 31. ALERT InstanceDown IF up == 0 FOR 2m LABELS { severity = "warning" } ANNOTATIONS { summary = "Instance down for more than 5 minutes.", description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for >= 5 minutes.", } ALERT RunningOutOfFileDescriptors IF process_open_fds / process_fds * 100 > 95 FOR 2m LABELS { severity = "warning" } ANNOTATIONS { summary = "Instance has many open file descriptors.", description = "{{ $labels.instance }} of job {{ $labels.job }} has {{ $value }}% open descriptors.", } Alert examples
  • 32. ALERT GustaHighErrorRate IF sum without(code, instance) (rate(gusta_http_server_requests_total{code=~"5.."}[1m])) / sum without(code, instance) (rate(gusta_http_server_requests_total[1m])) * 100 > 0.1 FOR 2m LABELS { severity = "critical" } ANNOTATIONS { summary = "Gusta service endpoints have a high error rate.", description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }} returns {{ $value }}% errors.", } ALERT GustaHighLatency IF histogram_quantile(0.95, rate(gusta_http_server_request_duration_seconds_bucket[1m])) > 0.1 LABELS { severity = "critical" } ANNOTATIONS { summary = "Gusta service endpoints have a high latency.", description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }} has a 95% percentile latency of {{ $value }} seconds.", } Alert examples
  • 33. ALERT FilesystemRunningFull IF predict_linear(node_filesystem_avail{mountpoint!="/var/lib/docker/aufs"}[6h], 24 * 60 * 60) < 0 FOR 1h LABELS { severity = "warning" } ANNOTATIONS { summary = "Filesystem space is filling up.", description = "Filesystem on {{ $labels.device }} at {{ $labels.instance }} is predicted to run out of space within the next 24 hours.", } Alert examples
  • 35. ● Monitoring is essential to run, understand and operate services. ● Prometheus ○ Client instrumentation ○ Scrape configuration ○ Querying ○ Dashboards ○ Alert rules ● Important Metrics ○ Four golden signals: Latency, Traffic, Error, Saturation ● Best practices Recap
  • 36. ● https://prometheus.io ● Talks, Articles, Videos https://www.reddit.com/r/PrometheusMonitoring/ ● Our “StackOverflow” https://www.robustperception.io/blog/ ● Ask the community https://prometheus.io/community/ ● Google’s SRE book https://landing.google.com/sre/book/index.html ● USE method http://www.brendangregg.com/usemethod.html ● My philosophy on alerting https://goo.gl/UnvYhQ Sources
  • 37. Thank you Tobias Schmidt - MicroCPH May 17, 2017 github.com/grobie - @dagrobie
  • 38. ● High availability ○ Run two identical servers ● Scaling ○ Shard by datacenter / team / service ( / instance ) ● Aggregation across Prometheus servers ○ Federation ● Retention time ○ Generic remote storage support available. ● Pull vs. Push ○ Doesn’t matter in practice. Advantages depend on use case. ● Security ○ Focused on writing a monitoring system, left to the user. FAQ