Monitoring microservices with Prometheus

Monitoring Microservices
with Prometheus
Tobias Schmidt - MicroCPH May 17, 2017
github.com/grobie @dagrobie tobidt@gmail.com

● Ability to observe and understand systems and their behavior.
○ Know when things go wrong
○ Understand and debug service misbehavior
○ Detect trends and act in advance
● Blackbox vs. Whitebox monitoring
○ Blackbox: Observes systems externally with periodic checks
○ Whitebox: Provides internally observed metrics
● Whitebox: Different levels of granularity
○ Logging
○ Tracing
○ Metrics
Monitoring

● Metrics monitoring system and time series database
○ Instrumentation (client libraries and exporters)
○ Metrics collection, processing and storage
○ Querying, alerting and dashboards
○ Analysis, trending, capacity planning
○ Focused on infrastructure, not business metrics
● Key features
○ Powerful query language for metrics with label dimensions
○ Stable and simple operation
○ Built for modern dynamic deploy environments
○ Easy setup
● What it’s not
○ Logging system
○ Designed for perfect answers
Prometheus

Instrumentation case study
Gusta: a simple like service

● Service to handle everything around liking a resource
○ List all liked likes on a resource
○ Create a like on a resource
○ Delete a like on a resource
● Implementation
○ Written in golang
○ Uses the gokit.io toolkit
Gusta overview

// Like represents all information of a single like.
type Like struct {
ResourceID string `json:"resourceID"`
UserID string `json:"userID"`
CreatedAt time.Time `json:"createdAt"`
}
// Service describes all methods provided by the gusta service.
type Service interface {
ListResourceLikes(resourceID string) ([]Like, error)
LikeResource(resourceID, userID string) error
UnlikeResource(resourceID, userID string) error
}
Gusta core

// main.go
var store gusta.Store
store = gusta.NewMemoryStore()
var s gusta.Service
s = gusta.NewService(store)
s = gusta.LoggingMiddleware(logger)(s)
var h http.Handler
h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"))
http.Handle("/", h)
if err := http.ListenAndServe(*httpAddr, nil); err != nil {
logger.Log("exit error", err)
}
Gusta server

./gusta
ts=2017-05-16T19:39:34.938108068Z transport=HTTP addr=:8080
ts=2017-05-16T19:38:24.203071341Z method=LikeResource ResourceID=r1ee85512 UserID=ue86d7a01 took=10.466µs err=null
ts=2017-05-16T19:38:24.323002316Z method=ListResourceLikes ResourceID=r8669fd29 took=17.812µs err=null
ts=2017-05-16T19:38:24.343061775Z method=ListResourceLikes ResourceID=rd4ac47c6 took=30.986µs err=null
ts=2017-05-16T19:38:24.363022818Z method=LikeResource ResourceID=r1ee85512 UserID=u19597d1e took=10.757µs err=null
ts=2017-05-16T19:38:24.38303722Z method=ListResourceLikes ResourceID=rfc9a393a took=41.554µs err=null
ts=2017-05-16T19:38:20.843121594Z method=UnlikeResource ResourceID=r1ee85512 UserID=ub5e42f43 took=8.57µs err="not
found"
ts=2017-05-16T19:38:20.863037026Z method=ListResourceLikes ResourceID=rfc9a393a took=27.839µs err=null
Gusta server

Basic Instrumentation
Providing operational insight

● “Four golden signals” cover the essentials
○ Latency
○ Traffic
○ Errors
○ Saturation
● Similar concepts: RED and USE methods
○ Request: Rate, Errors, Duration
○ Utilization, Saturation, Errors
● Information about the service itself
● Interaction with dependencies (other services, databases, etc.)
What information should be provided?

● Direct instrumentation
○ Traffic, Latency, Errors, Saturation
○ Service specific metrics (and interaction with dependencies)
○ Prometheus client libraries provide packages to instrument HTTP
requests out of the box
● Exporters
○ Utilization, Saturation
○ node_exporter CPU, memory, IO utilization per host
○ wmi_exporter does the same for Windows
○ cAdvisor (Container advisor) provides similar metrics for each container
Where to get the information from?

// main.go
import "github.com/prometheus/client_golang/prometheus"
var registry = prometheus.NewRegistry()
registry.MustRegister(
prometheus.NewGoCollector(),
prometheus.NewProcessCollector(os.Getpid(), ""),
)
// Pass down registry when creating HTTP handlers.
h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"), registry)
Initializing Prometheus client library

var h http.Handler = listResourceLikesHandler
var method, path string = "GET", "/api/v1/likes/{id}"
requests := prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "gusta_http_server_requests_total",
Help: "Total number of requests handled by the HTTP server.",
ConstLabels: prometheus.Labels{"method": method, "path": path},
},
[]string{"code"},
)
registry.MustRegister(requests)
h = promhttp.InstrumentHandlerCounter(requests, h)
Counting HTTP requests

var h http.Handler = listResourceLikesHandler
var method, path string = "GET", "/api/v1/likes/{id}"
requestDuration := prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "gusta_http_server_request_duration_seconds",
Help: "A histogram of latencies for requests.",
Buckets: []float64{.0025, .005, 0.01, 0.025, 0.05, 0.1},
ConstLabels: prometheus.Labels{"method": method, "path": path},
},
[]string{},
)
registry.MustRegister(requestDuration)
h = promhttp.InstrumentHandlerDuration(requestDuration, h)
Observing HTTP request latency

Exposing metrics
Observing the current state

● Prometheus is a pull based monitoring system
○ Instances expose an HTTP endpoint to expose their metrics
○ Prometheus uses service discovery or static target lists to collect the
state periodically
● Centralized management
○ Prometheus decides how often to scrape instances
● Prometheus stores the data on local disc
○ In a big outage, you could run Prometheus on your laptop!
How to collect the metrics?

// main.go
// ...
http.Handle("/metrics", promhttp.HandlerFor(
registry,
promhttp.HandlerOpts{},
))
Exposing the metrics via HTTP

curl -s http://localhost:8080/metrics | grep requests
# HELP gusta_http_server_requests_total Total number of requests handled by the gusta HTTP server.
# TYPE gusta_http_server_requests_total counter
gusta_http_server_requests_total{code="200",method="DELETE",path="/api/v1/likes"} 3
gusta_http_server_requests_total{code="200",method="GET",path="/api/v1/likes/{id}"} 429
gusta_http_server_requests_total{code="200",method="POST",path="/api/v1/likes"} 51
gusta_http_server_requests_total{code="404",method="DELETE",path="/api/v1/likes"} 14
gusta_http_server_requests_total{code="409",method="POST",path="/api/v1/likes"} 3
Request metrics

curl -s http://localhost:8080/metrics | grep request_duration
# HELP gusta_http_server_request_duration_seconds A histogram of latencies for requests.
# TYPE gusta_http_server_request_duration_seconds histogram
...
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.00025"} 414
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="+Inf"} 429
gusta_http_server_request_duration_seconds_sum{method="GET",path="/api/v1/likes/{id}"} 0.047897984
gusta_http_server_request_duration_seconds_count{method="GET",path="/api/v1/likes/{id}"} 429
...
Latency metrics

curl -s http://localhost:8080/metrics | grep process
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 892.78
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 23
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 9.3446144e+07
...
Out-of-the-box process metrics

Collecting metrics
Scraping all service instances

# Scrape all targets every 5 seconds by default.
global:
scrape_interval: 5s
evaluation_interval: 5s
scrape_configs:
# Scrape the Prometheus server itself.
- job_name: prometheus
static_configs:
- targets: [localhost:9090]
# Scrape the Gusta service.
- job_name: gusta
static_configs:
- targets: [localhost:8080]
Static configuration

scrape_configs:
# Scrape the Gusta service using Consul.
- job_name: consul
consul_sd_configs:
- server: localhost:8500
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: .*,prod,.*
action: keep
- source_labels: [__meta_consul_service]
target_label: job
Consul service discovery

Dashboards
Human-readable metrics

ALERT InstanceDown
IF up == 0
FOR 2m
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Instance down for more than 5 minutes.",
description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for >= 5 minutes.",
}
ALERT RunningOutOfFileDescriptors
IF process_open_fds / process_fds * 100 > 95
FOR 2m
ANNOTATIONS {
summary = "Instance has many open file descriptors.",
description = "{{ $labels.instance }} of job {{ $labels.job }} has {{ $value }}% open descriptors.",
}
Alert examples

ALERT GustaHighErrorRate
IF sum without(code, instance) (rate(gusta_http_server_requests_total{code=~"5.."}[1m]))
/ sum without(code, instance) (rate(gusta_http_server_requests_total[1m]))
* 100 > 0.1
FOR 2m
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "Gusta service endpoints have a high error rate.",
description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }} returns {{ $value }}% errors.",
}
ALERT GustaHighLatency
IF histogram_quantile(0.95, rate(gusta_http_server_request_duration_seconds_bucket[1m])) > 0.1
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "Gusta service endpoints have a high latency.",
description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }}
has a 95% percentile latency of {{ $value }} seconds.",
}
Alert examples

ALERT FilesystemRunningFull
IF predict_linear(node_filesystem_avail{mountpoint!="/var/lib/docker/aufs"}[6h], 24 * 60 * 60) < 0
FOR 1h
ANNOTATIONS {
summary = "Filesystem space is filling up.",
description = "Filesystem on {{ $labels.device }} at {{ $labels.instance }}
is predicted to run out of space within the next 24 hours.",
}
Alert examples

● Monitoring is essential to run, understand and operate services.
● Prometheus
○ Client instrumentation
○ Scrape configuration
○ Querying
○ Dashboards
○ Alert rules
● Important Metrics
○ Four golden signals: Latency, Traffic, Error, Saturation
● Best practices
Recap

● https://prometheus.io
● Talks, Articles, Videos https://www.reddit.com/r/PrometheusMonitoring/
● Our “StackOverflow” https://www.robustperception.io/blog/
● Ask the community https://prometheus.io/community/
● Google’s SRE book https://landing.google.com/sre/book/index.html
● USE method http://www.brendangregg.com/usemethod.html
● My philosophy on alerting https://goo.gl/UnvYhQ
Sources

Thank you
Tobias Schmidt - MicroCPH May 17, 2017
github.com/grobie - @dagrobie

● High availability
○ Run two identical servers
● Scaling
○ Shard by datacenter / team / service ( / instance )
● Aggregation across Prometheus servers
○ Federation
● Retention time
○ Generic remote storage support available.
● Pull vs. Push
○ Doesn’t matter in practice. Advantages depend on use case.
● Security
○ Focused on writing a monitoring system, left to the user.
FAQ

Monitoring microservices with Prometheus

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Monitoring microservices with Prometheus

Similar to Monitoring microservices with Prometheus (20)

More from Tobias Schmidt

More from Tobias Schmidt (7)

Recently uploaded

Recently uploaded (20)

Monitoring microservices with Prometheus