The document discusses the anatomy of cascading failures in distributed systems. It describes common triggering conditions that can cause failures like planned changes, traffic fluctuations, resource starvation and crashes. It then explains how initial failures can cascade through load redistribution, retry amplification, latency creep and resource contention during recovery. Finally, it provides strategies for improving system resilience like robust architecture, chaos engineering, retrying policies, throttling, circuit breaking, fallbacks and choosing the right tools.
6. New Rollouts
Planned Changes
Traffic Drains
Turndowns
Triggering Conditions - Change
public String getCountry(String userId) {
try {
// Try to get latest country to avoid stale info
// The update method will write the userInfo to DB
UserInfo userInfo = userInfoService.update(userId);
...
return userInfo.getCountry();
} catch (Exception e) {
// Default to cache if service is down
return getCountryFromCache(userId);
}
}
The Anatomy of a Cascading Failure
12. Resource Starvation - Dependencies Between Resources
Poorly tuned Garbage Collection
Slow requests
Increased CPU due to GC
More in-progress requests
More RAM due to queuing
Less RAM for caching
Lower cache hit rate
More requests to backend
🔥🔥🔥
The Anatomy of a Cascading Failure
20. Architecture - Orchestration vs Choreography
Orchestration
Choreography
Strategies for Improving Resilience
Card service
Account
service
User service
Signup
service
Ship card
Create Account
Create user
Card service
Account
service
User service
User signup
event
Subscribes
Signup
service
Publishes
21. Capacity Planning - Do I need it?
Strategies for Improving Resilience
Are you
operating
at Google
scale?
You likely don’t.
Go to next slide
No
Ok, but it will
be costly and imprecise
Yes
22. Capacity Planning - More important things
Automate provisioning and deployment (🐄🐄🐄 not 🐕🐈🐹)
Auto-scaling and auto-healing
Robust architecture in the face of growing traffic (pub/sub helps)
Agree on SLIs and SLOs and monitor them closely
Strategies for Improving Resilience
23. “Chaos Engineering is the discipline of experimenting
on a system in order to build confidence in the system’s capability to withstand turbulent
conditions in production.”
Principles of Chaos Engineering
Chaos Testing
Strategies for Improving Resilience
24. Retrying - What should I retry?
What makes a request retriable?
● ⚠ idempotency
● 🚫 GET with side-effects
● ✅ stateless if you can
Should you retry timeouts?
● Stay tuned to the next slides
Strategies for Improving Resilience
26. Retrying - Retry Budgets
Per-request retry budget
● Each request retried at most 3x
Per-client retry budget
● Retry requests = at most 10% total requests to upstream
● If > 10% of requests are failing => upstream is likely unhealthy
Strategies for Improving Resilience
27. Throttling - Setting Timeouts
Strategies for Improving Resilience
Service BService A
3s timeout
Service C Service D
2s timeout 5s timeout
�� ��> 2s
⚠ Timeout early and be disciplined when setting timeouts
⚠
���� ����
⌛
28. Throttling - Complexity Creep
Strategies for Improving Resilience
⚠ Propagate timeouts and avoid nesting ⚠
Service BService A
7s timeout
Service C Service D
5s timeout 3s timeout
Service FService E
3s timeout
2s timeout
29. Throttling - Circular Dependencies
Strategies for Improving Resilience
⚠ Avoid circular dependencies at all cost ⚠
Service B
Service A
5s timeout
Service C
5s timeout
2s timeout3s timeout
30. Throttling - Rate Limiting
Avoid overload by clients and set per-client limits:
● requests from one calling service can use up to x CPU seconds/time interval
on the upstream
● anything above that will be throttled
● these metrics are aggregated across all instances of a calling service and
upstream
If this is too complicated
=> limit based on RPS/customer/endpoint
Strategies for Improving Resilience
31. Throttling - Circuit Breaking
Strategies for Improving Resilience
Closed Open
Half Open
fail (threshold reached)
reset timeout
fail
fail (under threshold)
success call/raise circuit open
success
Service A
Circuit
Breaker
Service B
⚠
⚠
⚠
⚠
🚫
timeout
timeout
timeout
timeout
trip circuit
circuit open
33. Fallbacks and Rejection
Cache
Dead letter queues for writes
Return hard-coded value
Empty Response (“Fail Silent”)
User experience
⚠ Make sure to discuss these with your product owners ⚠
��
��
Strategies for Improving Resilience
36. Your knobs may vary
Choosing the right tools
⚠ Never sacrifice observability ⚠
Side-car proxiesLibraries and Frameworks
��
Simplicity of operation
Simplicity of testing
Simplicity of enforcement
Simplicity of configuration
Polyglot
Predictability of results
����