The industry is moving toward a microservices architecture, and many companies have embraced container orchestration solutions such as Kubernetes. DigitalOcean is no different. Over the past year, DigitalOcean’s Delivery team has been building a runtime platform based on Kubernetes with the goal of making shipping code easier. The system has empowered service owners to quickly and efficiently deploy and update their applications. A vital component is a white box monitoring and alerting solution based on Prometheus and Alertmanager.
Sneha Inguva offers an overview of the system and shares problems encountered, potential solutions, and key lessons learned in the process. Sneha dives into the setup of Prometheus and Alertmanager that allows service owners to instrument their own metrics and alerts, explaining the service owner’s point of view and the internals that allow for the dynamic addition of alerts, and offers a glimpse of future modifications to the system. Join in to learn how to leverage open source tools for your monitoring and alerting needs.
9. digitalocean.com
the plan:
● the “olden” days vs. container orchestration
● docc at DigitalOcean
● prometheus + alertmanager and docc
● alerting in action: examples
● potential pitfalls
● next steps
10. digitalocean.com
the “olden” days:
service owners write an application
provision a server with chef or ansible
use a CI/CD pipeline, bash scripts, or other tools
to deploy and update application on VM
12. digitalocean.com
the “olden” days:
longer to provision than write actual service
hard to set up monitoring services
blackbox monitoring NOT insightful
whitebox monitoring services NOT easily queryable
33. digitalocean.com
specify metrics, ports, alerts in your manifest file
Which metrics endpoint should be scraped?
Which container port needs to be exposed?
Specify alerting rule, duration interval, and channel.
42. digitalocean.com
counters: cumulative, increasing metric
gauges: single metric that goes up or down
histograms: samples and buckets observations
summaries: also samples observations but can calculate
things like quantiles
49. digitalocean.com
State-based alerts
Is my service up and/or scrapeable?
absent(up{kubernetes_name="doccserver"}) or
sum(up{kubernetes_name="doccserver"}) == 0
Do I have the # of loadbalancers I expect?
sum(up{kubernetes_name="loadbalancer"}) < 3
51. digitalocean.com
Threshold alerts
Is our loadbalancer at 50% capacity in terms of sessions?
max(haproxy_frontend_current_sessions /
haproxy_frontend_limit_sessions) BY
(kubernetes_node_name, frontend) * 100 > 50
Are 50 percent of tests taking longer than 10 minutes?
max(test_duration_seconds{quantile="0.5",res
ult="pass"}) BY (test_name) > 600
54. digitalocean.com
Solution: Slack and/or Pagerduty
send only the most urgent, production alerts to pagerduty
try out different promQL queries to have less spikey metrics
57. digitalocean.com
Solution: Docs and suggested alerts
extensive documentation and tutorials, accessible from CLI
prometheus slack channel for real-time help
standard alert examples
59. digitalocean.com
Solution: opinionated manifest file
services owner must include maintainer information
aerts themselves include descriptions and summaries with
several labels
alerts must include team-specific receivers
65. digitalocean.com
#2: Leverage metrics for autopilot
user trusts in our custom controllers and
schedulers
collect metrics and build model about resource
usage over time
accordingly adjust limits and alerts
66. digitalocean.com
#3: Leverage metrics for autoscaling
services based on resource usage, #
connections, etc.
loadbalancers based on # of frontend and
backend connections
# of worker nodes based on memory and cpu
capacity metrics