1. can be applied to the nascent world of microservices.
Put some SRE
in your microservices
Hard-won lessons from the world of SRE…
2. The many faces of
Theo Schlossnagle
@postwait
CEO Circonus
3. The nature of the problem
Software Sucks
Once you’ve run software at scale,
you have a deep understanding of
how it is all tied together with
loose string and hope.
4. All software will fail, but
good software
fails well
• Consider the phrase:
“have you used X in anger.”
5. Never undervalue grace in failure.
Rule . 𝛌1 Crash landings should be both
fast and controlled.
6. What it means to
fail quickly & safely
• The scope of failure should
collapse completely.
• The time to failure should be
measured in small multiples of
normal service time
• Nothing outside the scope of
failure should be impacted.
https://www.youtube.com/watch?v=5SL1A2d2e7M
8. Pragmatic analysis is required to
understand failure’s
true nature
• Post-mortem analysis is critical
• Stack traces
• Forensic logs
• Images (cores, dumps, etc.)
9. The difference between a shock and electrocution is real.
Rule . 𝛌3 Use circuit breakers.
10. Circuit breakers are designed to
avoid
cascading failure
• it’s not all about,
especially with microservices
• protect yourselves and others
• circuit breakers of many type
• timing
• queue depth
• concurrency
http://melissaomarkham.com
11. You cannot understand what you cannot measure.
Rule . 𝛌4 Behavior is complex.
Understand it.
12. Don’t measure to assess availability
measure to understand
Build robust models of behavior
Understand performance changes
Don’t use averages
Don’t use percentiles alone
13. Don’t measure to assess availability
measure to understand
Build robust models of behavior
Understand performance changes
Don’t use averages
Don’t use percentiles alone
14. It’s easy to demand perfection; it’s also stupid.
Rule . 𝛌5 Have an failure budget.
15. Avoid failure is simply impossible,
expect and manage
failure
• use failure budgets
• set expectations reasonably
• define and reward successes on
improvement and competency,
not just uptime.
16. Justice should be blind; operations should not.
Rule . 𝛌6 Instrumentation &
Observability have no equals.
17. For every “I wonder what X is right now?”
in production,
you must have answers
DTrace
eBPF
Instrument code for observability
https://www.pinterest.com/pin/441775044670412234/