Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Why Does (My) Monitoring Suck?

Monitoring services is easy, right? Set up a notification that goes out when a certain number increases past a certain threshold to let you know that there’s a problem. But if that’s the case, why are so many teams drowning in alerts and dreading their time on call? The reason is that we tend to monitor the wrong things: reactive alerts, metrics that we don’t completely understand how they impact our service, and capacity alerts. We look at our own view of the service and fail to consider that our customers have a different view.

Come learn to let go of what does not help, and explore how to monitor for what truly matters: what the customer sees. This starts with defining our agreements with our customers, continues through building applications intelligently and instrumenting all the things, and finishes with picking the right signals out of that instrumentation to generate alerts that are actionable, not ones that introduce confusion and noise. We will also touch on capacity planning, and how it should never wake you up. You’ll find it’s possible to assure that you meet your service level objectives while still maximizing your sleep level objectives.

  • Login to see the comments

Why Does (My) Monitoring Suck?

  1. 1. Why Does (My) Monitoring Suck? Todd Palino Senior Staff Engineer, Site Reliability LinkedIn
  2. 2. This Is The Only Slide You May Need a Picture Of
  3. 3. What’s On Our List Today? Alerting Anti-Patterns Setting Goals What Is Monitoring? Designing For Success Wrapping Up
  4. 4. Alerting Anti-Patterns
  5. 5. Network Operations Center • Central monitoring and alerting • Gatekeeping monitored alerts with no deep knowledge • Information overload for a moderate sized system • Glorified telephone operators
  6. 6. Kafka Under-Replicated Partitions • Unclear meaning • Sometimes it’s not a problem at all • Does the customer care as long as requests are getting served? • Frequently gets ignored in the middle of the night
  7. 7. CPU Load • Relative measure of how busy the processors are • Who cares? Processors are supposed to be busy • What’s causing it? • Might be capacity. Maybe
  8. 8. Setting Goals
  9. 9. • SLI Indicator • SLO Objective • SLT Target • SLA Agreement Service Level Whatever
  10. 10. Let’s Be Smart About This • Specific • Measurable • Agreed • Realistic • Time-limited, Testable
  11. 11. Common SLOs Is the service able to handle requests? Availability Are requests being handled promptly? Latency Are the responses being returned correct? Correctness
  12. 12. What Is Monitoring?
  13. 13. Observe and check the progress or quality of (something) over a period of time; keep under systematic review. M o n i t o r
  14. 14. So WTF is Observability? • Comes from control theory • A measure of how well internal states of a system can be inferred from knowledge of its external outputs • It’s a noun – you have this (to some extent). You can’t “do” it.
  15. 15. What Are We Looking For?
  16. 16. What Can We Work With? Single numbers • Counters • Gauges • Histograms (and Summaries) Metrics Events Structured data • Log messages • Tracing (collection of events)
  17. 17. Where Can We Get It? • Rich data on internal state • Necessary for high observability • Tons of data possible, but the utility is often questionable • Beware! Here be dragons! Subjective Objective • Customer view of your system • Think of “Down For Everyone Or Just Me?” • Critical for SLO monitoring • More difficult to do, but it’s the authority on whether or not something is broken
  18. 18. Designing For Success
  19. 19. Build For Failure Rich instrumentation on every aspect Intelligence Tolerate single component failures (not just N+1) Availability Limit resource creation and utilization Capacity
  20. 20. It’s the only thing that matters Using the SLO • Always measure the SLIs • Objective monitoring is best • Don’t beat the SLO • Only alert on the SLO
  21. 21. ONLY??? • SLO alerts find unknown-unknowns • Known-unknowns and unknown-knowns must only exist transiently • A known-known should not require a human. Automate responses to known issues • For all else, if you have a 100% signal it can be an alert. But if it doesn’t impact the SLO, does it need to wake you up?
  22. 22. What About Capacity? Assure no single user can quickly overrun capacity Use Quotas Frequently enough to respond to trend changes Report & Review Never ignore or put off expansion work Act Promptly
  23. 23. Wrapping Up
  24. 24. What Should I Do Next? • Talk to your customers and agree on what they can expect • Add objective monitoring for these expectations Define Your SLOs Clean Up Alerts Add Instrumentation • Inventory alerts and eliminate any that are not a clear signal • Add alerts for the SLOs that you have agreed on • Implement quotas, if needed, to assure capacity isn’t suddenly overrun • Switch to structured logging • For distributed systems, consider adding request tracing • But make sure you don’t hold this extra information for longer than it’s needed for debugging
  25. 25. More Resources • Finding Me • • • @bonkoif • Code Yellow – How we help overburdened teams • • Usenix LISA (10/29 – Nashville) “Code Yellow: Helping Top-Heavy Teams the Smart Way” • SRE – What does the culture look like at LinkedIn • “Building SRE” • Every Day is Monday in Operations - • Kafka – Deep dive on monitoring for Apache Kafka • know
  26. 26. Questions?