The Anatomy of a Cascading Failure Breakdown

The Anatomy of a
Cascading Failure
Rareș Mușină, Tech Lead @N26
@r3sm4n

The Anatomy of a
Cascading Failure

Triggering Conditions - What happened?
The Anatomy of a Cascading Failure

New Rollouts
Planned Changes
Trafﬁc Drains
Turndowns
Triggering Conditions - Change
public String getCountry(String userId) {
try {
// Try to get latest country to avoid stale info
// The update method will write the userInfo to DB
UserInfo userInfo = userInfoService.update(userId);
...
return userInfo.getCountry();
} catch (Exception e) {
// Default to cache if service is down
return getCountryFromCache(userId);
}
}

Triggering Conditions - Throttling

Burstiness (e.g. scheduled tasks)
DDOSes
Instance Death (gee, thanks Spotinst)
Organic Growth
Request proﬁle changes
Triggering Conditions - Entropy

CPU
Memory
Network
Disk space
Threads
File descriptors
………………………...
Resource Starvation - Vespene Gas Is Finite

Resource Starvation - Dependencies Between Resources
Poorly tuned Garbage Collection
Slow requests
Increased CPU due to GC
More in-progress requests
More RAM due to queuing
Less RAM for caching
Lower cache hit rate
More requests to backend
🔥🔥🔥

Server Overload/Meltdown/Crash/Unavailability
:(
CPU/Memory maxed out
Health checks returning 5xx
Endpoints returning 5xx
Timeouts
Increased load on other instances

Cascading Failures - Load Redistribution
ELB ELB
A B
500 350
100 250
ELB ELB
A
600 600

Cascading Failures - Retry Ampliﬁcation

Cascading Failures - Latency Creep

Cascading Failures - Latency Propagation
Service A
Service B Service C
��
Service D Service E
��
��
��
⚡
⚡

Cascading Failures - Resource Contention During Recovery

Strategies for
Improving Resilience

Architecture - Orchestration vs Choreography
Orchestration
Choreography
Strategies for Improving Resilience
Card service
Account
service
User service
Signup
service
Ship card
Create Account
Create user
Card service
Account
service
User service
User signup
event
Subscribes
Signup
service
Publishes

Capacity Planning - Do I need it?
Are you
operating
at Google
scale?
You likely don’t.
Go to next slide
No
Ok, but it will
be costly and imprecise
Yes

Capacity Planning - More important things
Automate provisioning and deployment (🐄🐄🐄 not 🐕🐈🐹)
Auto-scaling and auto-healing
Robust architecture in the face of growing trafﬁc (pub/sub helps)
Agree on SLIs and SLOs and monitor them closely

“Chaos Engineering is the discipline of experimenting
on a system in order to build conﬁdence in the system’s capability to withstand turbulent
conditions in production.”
Principles of Chaos Engineering
Chaos Testing

Retrying - What should I retry?
What makes a request retriable?
● ⚠ idempotency
● 🚫 GET with side-effects
● ✅ stateless if you can
Should you retry timeouts?
● Stay tuned to the next slides

Retrying - Backing Off With Jitter

Retrying - Retry Budgets
Per-request retry budget
● Each request retried at most 3x
Per-client retry budget
● Retry requests = at most 10% total requests to upstream
● If > 10% of requests are failing => upstream is likely unhealthy

Throttling - Setting Timeouts
Service BService A
3s timeout
Service C Service D
2s timeout 5s timeout
�� > 2s
⚠ Timeout early and be disciplined when setting timeouts
⚠
��
⌛

Throttling - Complexity Creep
⚠ Propagate timeouts and avoid nesting ⚠
Service BService A
7s timeout
Service C Service D
5s timeout 3s timeout
Service FService E
3s timeout
2s timeout

Throttling - Circular Dependencies
⚠ Avoid circular dependencies at all cost ⚠
Service B
Service A
5s timeout
Service C
5s timeout
2s timeout3s timeout

Throttling - Rate Limiting
Avoid overload by clients and set per-client limits:
● requests from one calling service can use up to x CPU seconds/time interval
on the upstream
● anything above that will be throttled
● these metrics are aggregated across all instances of a calling service and
upstream
If this is too complicated
=> limit based on RPS/customer/endpoint

Throttling - Circuit Breaking
Closed Open
Half Open
fail (threshold reached)
reset timeout
fail
fail (under threshold)
success call/raise circuit open
success
Service A
Circuit
Breaker
Service B
⚠
⚠
⚠
⚠
🚫
timeout
timeout
timeout
timeout
trip circuit
circuit open

gradient = (lag_noload/lag_actual)
newLimit = currentLimit × gradient + queueSize
Throttling - Adaptive Concurrency Limits
Queue
Concurrency

Fallbacks and Rejection
Cache
Dead letter queues for writes
Return hard-coded value
Empty Response (“Fail Silent”)
User experience
⚠ Make sure to discuss these with your product owners ⚠
��
��

Tooling landscape
Choosing the right tools
Libraries and Frameworks Side-car proxies
🚃🚃🚃 Hype 🚃🚃🚃

Your knobs may vary
⚠ Never sacriﬁce observability ⚠
Side-car proxiesLibraries and Frameworks
��
Simplicity of operation
Simplicity of testing
Simplicity of enforcement
Simplicity of conﬁguration
Polyglot
Predictability of results
��

Adoption Considerations - ⚠ ⚠ ⚠

The Anatomy of a Cascading Failure Breakdown

Recommended

Recommended

More Related Content

Similar to The Anatomy of a Cascading Failure Breakdown

Similar to The Anatomy of a Cascading Failure Breakdown (20)

Recently uploaded

Recently uploaded (20)

The Anatomy of a Cascading Failure Breakdown