Monitoring Microservices
with Prometheus
Tobias Schmidt - MicroCPH May 17, 2017 @dagrobie
● Ability to observe and understand systems and their behavior.
○ Know when things go wrong
○ Understand and debug service misbehavior
○ Detect trends and act in advance
● Blackbox vs. Whitebox monitoring
○ Blackbox: Observes systems externally with periodic checks
○ Whitebox: Provides internally observed metrics
● Whitebox: Different levels of granularity
○ Logging
○ Tracing
○ Metrics
● Metrics monitoring system and time series database
○ Instrumentation (client libraries and exporters)
○ Metrics collection, processing and storage
○ Querying, alerting and dashboards
○ Analysis, trending, capacity planning
○ Focused on infrastructure, not business metrics
● Key features
○ Powerful query language for metrics with label dimensions
○ Stable and simple operation
○ Built for modern dynamic deploy environments
○ Easy setup
● What it’s not
○ Logging system
○ Designed for perfect answers
Instrumentation case study
Gusta: a simple like service
● Service to handle everything around liking a resource
○ List all liked likes on a resource
○ Create a like on a resource
○ Delete a like on a resource
● Implementation
○ Written in golang
○ Uses the toolkit
Gusta overview
// Like represents all information of a single like.
type Like struct {
ResourceID string `json:"resourceID"`
UserID string `json:"userID"`
CreatedAt time.Time `json:"createdAt"`
// Service describes all methods provided by the gusta service.
type Service interface {
ListResourceLikes(resourceID string) ([]Like, error)
LikeResource(resourceID, userID string) error
UnlikeResource(resourceID, userID string) error
Gusta core
// main.go
var store gusta.Store
store = gusta.NewMemoryStore()
var s gusta.Service
s = gusta.NewService(store)
s = gusta.LoggingMiddleware(logger)(s)
var h http.Handler
h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"))
http.Handle("/", h)
if err := http.ListenAndServe(*httpAddr, nil); err != nil {
logger.Log("exit error", err)
Gusta server
ts=2017-05-16T19:39:34.938108068Z transport=HTTP addr=:8080
ts=2017-05-16T19:38:24.203071341Z method=LikeResource ResourceID=r1ee85512 UserID=ue86d7a01 took=10.466µs err=null
ts=2017-05-16T19:38:24.323002316Z method=ListResourceLikes ResourceID=r8669fd29 took=17.812µs err=null
ts=2017-05-16T19:38:24.343061775Z method=ListResourceLikes ResourceID=rd4ac47c6 took=30.986µs err=null
ts=2017-05-16T19:38:24.363022818Z method=LikeResource ResourceID=r1ee85512 UserID=u19597d1e took=10.757µs err=null
ts=2017-05-16T19:38:24.38303722Z method=ListResourceLikes ResourceID=rfc9a393a took=41.554µs err=null
ts=2017-05-16T19:38:24.40303802Z method=ListResourceLikes ResourceID=r8669fd29 took=28.115µs err=null
ts=2017-05-16T19:38:24.423045585Z method=ListResourceLikes ResourceID=r8669fd29 took=23.842µs err=null
ts=2017-05-16T19:38:20.843121594Z method=UnlikeResource ResourceID=r1ee85512 UserID=ub5e42f43 took=8.57µs err="not
ts=2017-05-16T19:38:20.863037026Z method=ListResourceLikes ResourceID=rfc9a393a took=27.839µs err=null
ts=2017-05-16T19:38:20.883081162Z method=ListResourceLikes ResourceID=r8669fd29 took=16.999µs err=null
Gusta server
Basic Instrumentation
Providing operational insight
● “Four golden signals” cover the essentials
○ Latency
○ Traffic
○ Errors
○ Saturation
● Similar concepts: RED and USE methods
○ Request: Rate, Errors, Duration
○ Utilization, Saturation, Errors
● Information about the service itself
● Interaction with dependencies (other services, databases, etc.)
What information should be provided?
● Direct instrumentation
○ Traffic, Latency, Errors, Saturation
○ Service specific metrics (and interaction with dependencies)
○ Prometheus client libraries provide packages to instrument HTTP
requests out of the box
● Exporters
○ Utilization, Saturation
○ node_exporter CPU, memory, IO utilization per host
○ wmi_exporter does the same for Windows
○ cAdvisor (Container advisor) provides similar metrics for each container
Where to get the information from?
// main.go
import ""
var registry = prometheus.NewRegistry()
prometheus.NewProcessCollector(os.Getpid(), ""),
// Pass down registry when creating HTTP handlers.
h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"), registry)
Initializing Prometheus client library
var h http.Handler = listResourceLikesHandler
var method, path string = "GET", "/api/v1/likes/{id}"
requests := prometheus.NewCounterVec(
Name: "gusta_http_server_requests_total",
Help: "Total number of requests handled by the HTTP server.",
ConstLabels: prometheus.Labels{"method": method, "path": path},
h = promhttp.InstrumentHandlerCounter(requests, h)
Counting HTTP requests
var h http.Handler = listResourceLikesHandler
var method, path string = "GET", "/api/v1/likes/{id}"
requestDuration := prometheus.NewHistogramVec(
Name: "gusta_http_server_request_duration_seconds",
Help: "A histogram of latencies for requests.",
Buckets: []float64{.0025, .005, 0.01, 0.025, 0.05, 0.1},
ConstLabels: prometheus.Labels{"method": method, "path": path},
h = promhttp.InstrumentHandlerDuration(requestDuration, h)
Observing HTTP request latency
Exposing metrics
Observing the current state
● Prometheus is a pull based monitoring system
○ Instances expose an HTTP endpoint to expose their metrics
○ Prometheus uses service discovery or static target lists to collect the
state periodically
● Centralized management
○ Prometheus decides how often to scrape instances
● Prometheus stores the data on local disc
○ In a big outage, you could run Prometheus on your laptop!
How to collect the metrics?
// main.go
// ...
http.Handle("/metrics", promhttp.HandlerFor(
Exposing the metrics via HTTP
curl -s http://localhost:8080/metrics | grep requests
# HELP gusta_http_server_requests_total Total number of requests handled by the gusta HTTP server.
# TYPE gusta_http_server_requests_total counter
gusta_http_server_requests_total{code="200",method="DELETE",path="/api/v1/likes"} 3
gusta_http_server_requests_total{code="200",method="GET",path="/api/v1/likes/{id}"} 429
gusta_http_server_requests_total{code="200",method="POST",path="/api/v1/likes"} 51
gusta_http_server_requests_total{code="404",method="DELETE",path="/api/v1/likes"} 14
gusta_http_server_requests_total{code="409",method="POST",path="/api/v1/likes"} 3
Request metrics
curl -s http://localhost:8080/metrics | grep request_duration
# HELP gusta_http_server_request_duration_seconds A histogram of latencies for requests.
# TYPE gusta_http_server_request_duration_seconds histogram
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.00025"} 414
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0005"} 423
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.001"} 429
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0025"} 429
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.005"} 429
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.01"} 429
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="+Inf"} 429
gusta_http_server_request_duration_seconds_sum{method="GET",path="/api/v1/likes/{id}"} 0.047897984
gusta_http_server_request_duration_seconds_count{method="GET",path="/api/v1/likes/{id}"} 429
Latency metrics
curl -s http://localhost:8080/metrics | grep process
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 892.78
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 23
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 9.3446144e+07
Out-of-the-box process metrics
Collecting metrics
Scraping all service instances
# Scrape all targets every 5 seconds by default.
scrape_interval: 5s
evaluation_interval: 5s
# Scrape the Prometheus server itself.
- job_name: prometheus
- targets: [localhost:9090]
# Scrape the Gusta service.
- job_name: gusta
- targets: [localhost:8080]
Static configuration
# Scrape the Gusta service using Consul.
- job_name: consul
- server: localhost:8500
- source_labels: [__meta_consul_tags]
regex: .*,prod,.*
action: keep
- source_labels: [__meta_consul_service]
target_label: job
Consul service discovery
Target overview
Simple Graph UI
Simple Graph UI
Human-readable metrics
Grafana example
Actionable metrics
ALERT InstanceDown
IF up == 0
FOR 2m
LABELS { severity = "warning" }
summary = "Instance down for more than 5 minutes.",
description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for >= 5 minutes.",
ALERT RunningOutOfFileDescriptors
IF process_open_fds / process_fds * 100 > 95
FOR 2m
LABELS { severity = "warning" }
summary = "Instance has many open file descriptors.",
description = "{{ $labels.instance }} of job {{ $labels.job }} has {{ $value }}% open descriptors.",
Alert examples
ALERT GustaHighErrorRate
IF sum without(code, instance) (rate(gusta_http_server_requests_total{code=~"5.."}[1m]))
/ sum without(code, instance) (rate(gusta_http_server_requests_total[1m]))
* 100 > 0.1
FOR 2m
LABELS { severity = "critical" }
summary = "Gusta service endpoints have a high error rate.",
description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }} returns {{ $value }}% errors.",
ALERT GustaHighLatency
IF histogram_quantile(0.95, rate(gusta_http_server_request_duration_seconds_bucket[1m])) > 0.1
LABELS { severity = "critical" }
summary = "Gusta service endpoints have a high latency.",
description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }}
has a 95% percentile latency of {{ $value }} seconds.",
Alert examples
ALERT FilesystemRunningFull
IF predict_linear(node_filesystem_avail{mountpoint!="/var/lib/docker/aufs"}[6h], 24 * 60 * 60) < 0
FOR 1h
LABELS { severity = "warning" }
summary = "Filesystem space is filling up.",
description = "Filesystem on {{ $labels.device }} at {{ $labels.instance }}
is predicted to run out of space within the next 24 hours.",
Alert examples
● Monitoring is essential to run, understand and operate services.
● Prometheus
○ Client instrumentation
○ Scrape configuration
○ Querying
○ Dashboards
○ Alert rules
● Important Metrics
○ Four golden signals: Latency, Traffic, Error, Saturation
● Best practices
● Talks, Articles, Videos
● Our “StackOverflow”
● Ask the community
● Google’s SRE book
● USE method
● My philosophy on alerting
Thank you
Tobias Schmidt - MicroCPH May 17, 2017 - @dagrobie
● High availability
○ Run two identical servers
● Scaling
○ Shard by datacenter / team / service ( / instance )
● Aggregation across Prometheus servers
○ Federation
● Retention time
○ Generic remote storage support available.
● Pull vs. Push
○ Doesn’t matter in practice. Advantages depend on use case.
● Security
○ Focused on writing a monitoring system, left to the user.

  • 3. ● Ability to observe and understand systems and their behavior. ○ Know when things go wrong ○ Understand and debug service misbehavior ○ Detect trends and act in advance ● Blackbox vs. Whitebox monitoring ○ Blackbox: Observes systems externally with periodic checks ○ Whitebox: Provides internally observed metrics ● Whitebox: Different levels of granularity ○ Logging ○ Tracing ○ Metrics Monitoring
  • 4. ● Metrics monitoring system and time series database ○ Instrumentation (client libraries and exporters) ○ Metrics collection, processing and storage ○ Querying, alerting and dashboards ○ Analysis, trending, capacity planning ○ Focused on infrastructure, not business metrics ● Key features ○ Powerful query language for metrics with label dimensions ○ Stable and simple operation ○ Built for modern dynamic deploy environments ○ Easy setup ● What it’s not ○ Logging system ○ Designed for perfect answers Prometheus
  • 5. Instrumentation case study Gusta: a simple like service
  • 6. ● Service to handle everything around liking a resource ○ List all liked likes on a resource ○ Create a like on a resource ○ Delete a like on a resource ● Implementation ○ Written in golang ○ Uses the toolkit Gusta overview
  • 7. // Like represents all information of a single like. type Like struct { ResourceID string `json:"resourceID"` UserID string `json:"userID"` CreatedAt time.Time `json:"createdAt"` } // Service describes all methods provided by the gusta service. type Service interface { ListResourceLikes(resourceID string) ([]Like, error) LikeResource(resourceID, userID string) error UnlikeResource(resourceID, userID string) error } Gusta core
  • 8. // main.go var store gusta.Store store = gusta.NewMemoryStore() var s gusta.Service s = gusta.NewService(store) s = gusta.LoggingMiddleware(logger)(s) var h http.Handler h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP")) http.Handle("/", h) if err := http.ListenAndServe(*httpAddr, nil); err != nil { logger.Log("exit error", err) } Gusta server
  • 9. ./gusta ts=2017-05-16T19:39:34.938108068Z transport=HTTP addr=:8080 ts=2017-05-16T19:38:24.203071341Z method=LikeResource ResourceID=r1ee85512 UserID=ue86d7a01 took=10.466µs err=null ts=2017-05-16T19:38:24.323002316Z method=ListResourceLikes ResourceID=r8669fd29 took=17.812µs err=null ts=2017-05-16T19:38:24.343061775Z method=ListResourceLikes ResourceID=rd4ac47c6 took=30.986µs err=null ts=2017-05-16T19:38:24.363022818Z method=LikeResource ResourceID=r1ee85512 UserID=u19597d1e took=10.757µs err=null ts=2017-05-16T19:38:24.38303722Z method=ListResourceLikes ResourceID=rfc9a393a took=41.554µs err=null ts=2017-05-16T19:38:24.40303802Z method=ListResourceLikes ResourceID=r8669fd29 took=28.115µs err=null ts=2017-05-16T19:38:24.423045585Z method=ListResourceLikes ResourceID=r8669fd29 took=23.842µs err=null ts=2017-05-16T19:38:20.843121594Z method=UnlikeResource ResourceID=r1ee85512 UserID=ub5e42f43 took=8.57µs err="not found" ts=2017-05-16T19:38:20.863037026Z method=ListResourceLikes ResourceID=rfc9a393a took=27.839µs err=null ts=2017-05-16T19:38:20.883081162Z method=ListResourceLikes ResourceID=r8669fd29 took=16.999µs err=null Gusta server
  • 11. ● “Four golden signals” cover the essentials ○ Latency ○ Traffic ○ Errors ○ Saturation ● Similar concepts: RED and USE methods ○ Request: Rate, Errors, Duration ○ Utilization, Saturation, Errors ● Information about the service itself ● Interaction with dependencies (other services, databases, etc.) What information should be provided?
  • 12. ● Direct instrumentation ○ Traffic, Latency, Errors, Saturation ○ Service specific metrics (and interaction with dependencies) ○ Prometheus client libraries provide packages to instrument HTTP requests out of the box ● Exporters ○ Utilization, Saturation ○ node_exporter CPU, memory, IO utilization per host ○ wmi_exporter does the same for Windows ○ cAdvisor (Container advisor) provides similar metrics for each container Where to get the information from?
  • 13. // main.go import "" var registry = prometheus.NewRegistry() registry.MustRegister( prometheus.NewGoCollector(), prometheus.NewProcessCollector(os.Getpid(), ""), ) // Pass down registry when creating HTTP handlers. h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"), registry) Initializing Prometheus client library
  • 14. var h http.Handler = listResourceLikesHandler var method, path string = "GET", "/api/v1/likes/{id}" requests := prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "gusta_http_server_requests_total", Help: "Total number of requests handled by the HTTP server.", ConstLabels: prometheus.Labels{"method": method, "path": path}, }, []string{"code"}, ) registry.MustRegister(requests) h = promhttp.InstrumentHandlerCounter(requests, h) Counting HTTP requests
  • 15. var h http.Handler = listResourceLikesHandler var method, path string = "GET", "/api/v1/likes/{id}" requestDuration := prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "gusta_http_server_request_duration_seconds", Help: "A histogram of latencies for requests.", Buckets: []float64{.0025, .005, 0.01, 0.025, 0.05, 0.1}, ConstLabels: prometheus.Labels{"method": method, "path": path}, }, []string{}, ) registry.MustRegister(requestDuration) h = promhttp.InstrumentHandlerDuration(requestDuration, h) Observing HTTP request latency
  • 17. ● Prometheus is a pull based monitoring system ○ Instances expose an HTTP endpoint to expose their metrics ○ Prometheus uses service discovery or static target lists to collect the state periodically ● Centralized management ○ Prometheus decides how often to scrape instances ● Prometheus stores the data on local disc ○ In a big outage, you could run Prometheus on your laptop! How to collect the metrics?
  • 18. // main.go // ... http.Handle("/metrics", promhttp.HandlerFor( registry, promhttp.HandlerOpts{}, )) Exposing the metrics via HTTP
  • 19. curl -s http://localhost:8080/metrics | grep requests # HELP gusta_http_server_requests_total Total number of requests handled by the gusta HTTP server. # TYPE gusta_http_server_requests_total counter gusta_http_server_requests_total{code="200",method="DELETE",path="/api/v1/likes"} 3 gusta_http_server_requests_total{code="200",method="GET",path="/api/v1/likes/{id}"} 429 gusta_http_server_requests_total{code="200",method="POST",path="/api/v1/likes"} 51 gusta_http_server_requests_total{code="404",method="DELETE",path="/api/v1/likes"} 14 gusta_http_server_requests_total{code="409",method="POST",path="/api/v1/likes"} 3 Request metrics
  • 20. curl -s http://localhost:8080/metrics | grep request_duration # HELP gusta_http_server_request_duration_seconds A histogram of latencies for requests. # TYPE gusta_http_server_request_duration_seconds histogram ... gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.00025"} 414 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0005"} 423 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.001"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0025"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.005"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.01"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="+Inf"} 429 gusta_http_server_request_duration_seconds_sum{method="GET",path="/api/v1/likes/{id}"} 0.047897984 gusta_http_server_request_duration_seconds_count{method="GET",path="/api/v1/likes/{id}"} 429 ... Latency metrics
  • 21. curl -s http://localhost:8080/metrics | grep process # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. # TYPE process_cpu_seconds_total counter process_cpu_seconds_total 892.78 # HELP process_max_fds Maximum number of open file descriptors. # TYPE process_max_fds gauge process_max_fds 1024 # HELP process_open_fds Number of open file descriptors. # TYPE process_open_fds gauge process_open_fds 23 # HELP process_resident_memory_bytes Resident memory size in bytes. # TYPE process_resident_memory_bytes gauge process_resident_memory_bytes 9.3446144e+07 ... Out-of-the-box process metrics
  • 22. Collecting metrics Scraping all service instances
  • 23. # Scrape all targets every 5 seconds by default. global: scrape_interval: 5s evaluation_interval: 5s scrape_configs: # Scrape the Prometheus server itself. - job_name: prometheus static_configs: - targets: [localhost:9090] # Scrape the Gusta service. - job_name: gusta static_configs: - targets: [localhost:8080] Static configuration
  • 24. scrape_configs: # Scrape the Gusta service using Consul. - job_name: consul consul_sd_configs: - server: localhost:8500 relabel_configs: - source_labels: [__meta_consul_tags] regex: .*,prod,.* action: keep - source_labels: [__meta_consul_service] target_label: job Consul service discovery
  • 31. ALERT InstanceDown IF up == 0 FOR 2m LABELS { severity = "warning" } ANNOTATIONS { summary = "Instance down for more than 5 minutes.", description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for >= 5 minutes.", } ALERT RunningOutOfFileDescriptors IF process_open_fds / process_fds * 100 > 95 FOR 2m LABELS { severity = "warning" } ANNOTATIONS { summary = "Instance has many open file descriptors.", description = "{{ $labels.instance }} of job {{ $labels.job }} has {{ $value }}% open descriptors.", } Alert examples
  • 32. ALERT GustaHighErrorRate IF sum without(code, instance) (rate(gusta_http_server_requests_total{code=~"5.."}[1m])) / sum without(code, instance) (rate(gusta_http_server_requests_total[1m])) * 100 > 0.1 FOR 2m LABELS { severity = "critical" } ANNOTATIONS { summary = "Gusta service endpoints have a high error rate.", description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }} returns {{ $value }}% errors.", } ALERT GustaHighLatency IF histogram_quantile(0.95, rate(gusta_http_server_request_duration_seconds_bucket[1m])) > 0.1 LABELS { severity = "critical" } ANNOTATIONS { summary = "Gusta service endpoints have a high latency.", description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }} has a 95% percentile latency of {{ $value }} seconds.", } Alert examples
  • 33. ALERT FilesystemRunningFull IF predict_linear(node_filesystem_avail{mountpoint!="/var/lib/docker/aufs"}[6h], 24 * 60 * 60) < 0 FOR 1h LABELS { severity = "warning" } ANNOTATIONS { summary = "Filesystem space is filling up.", description = "Filesystem on {{ $labels.device }} at {{ $labels.instance }} is predicted to run out of space within the next 24 hours.", } Alert examples
  • 35. ● Monitoring is essential to run, understand and operate services. ● Prometheus ○ Client instrumentation ○ Scrape configuration ○ Querying ○ Dashboards ○ Alert rules ● Important Metrics ○ Four golden signals: Latency, Traffic, Error, Saturation ● Best practices Recap
  • 36. ● ● Talks, Articles, Videos ● Our “StackOverflow” ● Ask the community ● Google’s SRE book ● USE method ● My philosophy on alerting Sources
  • 37. Thank you Tobias Schmidt - MicroCPH May 17, 2017 - @dagrobie
  • 38. ● High availability ○ Run two identical servers ● Scaling ○ Shard by datacenter / team / service ( / instance ) ● Aggregation across Prometheus servers ○ Federation ● Retention time ○ Generic remote storage support available. ● Pull vs. Push ○ Doesn’t matter in practice. Advantages depend on use case. ● Security ○ Focused on writing a monitoring system, left to the user. FAQ