SlideShare a Scribd company logo
1 of 39
Download to read offline
The Anatomy of a
Cascading Failure
Rareș Mușină, Tech Lead @N26
@r3sm4n
The Anatomy of a
Cascading Failure
Triggering Conditions - What happened?
The Anatomy of a Cascading Failure
New Rollouts
Planned Changes
Traffic Drains
Turndowns
Triggering Conditions - Change
public String getCountry(String userId) {
try {
// Try to get latest country to avoid stale info
// The update method will write the userInfo to DB
UserInfo userInfo = userInfoService.update(userId);
...
return userInfo.getCountry();
} catch (Exception e) {
// Default to cache if service is down
return getCountryFromCache(userId);
}
}
The Anatomy of a Cascading Failure
Triggering Conditions - What happened?
The Anatomy of a Cascading Failure
Triggering Conditions - Throttling
The Anatomy of a Cascading Failure
Triggering Conditions - What happened?
The Anatomy of a Cascading Failure
Burstiness (e.g. scheduled tasks)
DDOSes
Instance Death (gee, thanks Spotinst)
Organic Growth
Request profile changes
Triggering Conditions - Entropy
The Anatomy of a Cascading Failure
CPU
Memory
Network
Disk space
Threads
File descriptors
………………………...
Resource Starvation - Vespene Gas Is Finite
The Anatomy of a Cascading Failure
Resource Starvation - Dependencies Between Resources
Poorly tuned Garbage Collection
Slow requests
Increased CPU due to GC
More in-progress requests
More RAM due to queuing
Less RAM for caching
Lower cache hit rate
More requests to backend
🔥🔥🔥
The Anatomy of a Cascading Failure
Server Overload/Meltdown/Crash/Unavailability
:(
CPU/Memory maxed out
Health checks returning 5xx
Endpoints returning 5xx
Timeouts
Increased load on other instances
The Anatomy of a Cascading Failure
Cascading Failures - Load Redistribution
The Anatomy of a Cascading Failure
ELB ELB
A B
500 350
100 250
ELB ELB
A
600 600
Cascading Failures - Retry Amplification
The Anatomy of a Cascading Failure
Cascading Failures - Latency Creep
The Anatomy of a Cascading Failure
Cascading Failures - Latency Propagation
The Anatomy of a Cascading Failure
Service A
Service B Service C
����
Service D Service E
��
��
����
⚡
⚡
Cascading Failures - Resource Contention During Recovery
The Anatomy of a Cascading Failure
Strategies for
Improving Resilience
Architecture - Orchestration vs Choreography
Orchestration
Choreography
Strategies for Improving Resilience
Card service
Account
service
User service
Signup
service
Ship card
Create Account
Create user
Card service
Account
service
User service
User signup
event
Subscribes
Signup
service
Publishes
Capacity Planning - Do I need it?
Strategies for Improving Resilience
Are you
operating
at Google
scale?
You likely don’t.
Go to next slide
No
Ok, but it will
be costly and imprecise
Yes
Capacity Planning - More important things
Automate provisioning and deployment (🐄🐄🐄 not 🐕🐈🐹)
Auto-scaling and auto-healing
Robust architecture in the face of growing traffic (pub/sub helps)
Agree on SLIs and SLOs and monitor them closely
Strategies for Improving Resilience
“Chaos Engineering is the discipline of experimenting
on a system in order to build confidence in the system’s capability to withstand turbulent
conditions in production.”
Principles of Chaos Engineering
Chaos Testing
Strategies for Improving Resilience
Retrying - What should I retry?
What makes a request retriable?
● ⚠ idempotency
● 🚫 GET with side-effects
● ✅ stateless if you can
Should you retry timeouts?
● Stay tuned to the next slides
Strategies for Improving Resilience
Retrying - Backing Off With Jitter
Strategies for Improving Resilience
Retrying - Retry Budgets
Per-request retry budget
● Each request retried at most 3x
Per-client retry budget
● Retry requests = at most 10% total requests to upstream
● If > 10% of requests are failing => upstream is likely unhealthy
Strategies for Improving Resilience
Throttling - Setting Timeouts
Strategies for Improving Resilience
Service BService A
3s timeout
Service C Service D
2s timeout 5s timeout
�� ��> 2s
⚠ Timeout early and be disciplined when setting timeouts
⚠
���� ����
⌛
Throttling - Complexity Creep
Strategies for Improving Resilience
⚠ Propagate timeouts and avoid nesting ⚠
Service BService A
7s timeout
Service C Service D
5s timeout 3s timeout
Service FService E
3s timeout
2s timeout
Throttling - Circular Dependencies
Strategies for Improving Resilience
⚠ Avoid circular dependencies at all cost ⚠
Service B
Service A
5s timeout
Service C
5s timeout
2s timeout3s timeout
Throttling - Rate Limiting
Avoid overload by clients and set per-client limits:
● requests from one calling service can use up to x CPU seconds/time interval
on the upstream
● anything above that will be throttled
● these metrics are aggregated across all instances of a calling service and
upstream
If this is too complicated
=> limit based on RPS/customer/endpoint
Strategies for Improving Resilience
Throttling - Circuit Breaking
Strategies for Improving Resilience
Closed Open
Half Open
fail (threshold reached)
reset timeout
fail
fail (under threshold)
success call/raise circuit open
success
Service A
Circuit
Breaker
Service B
⚠
⚠
⚠
⚠
🚫
timeout
timeout
timeout
timeout
trip circuit
circuit open
gradient = (lag_noload/lag_actual)
newLimit = currentLimit × gradient + queueSize
Throttling - Adaptive Concurrency Limits
Queue
Concurrency
Strategies for Improving Resilience
Fallbacks and Rejection
Cache
Dead letter queues for writes
Return hard-coded value
Empty Response (“Fail Silent”)
User experience
⚠ Make sure to discuss these with your product owners ⚠
��
��
Strategies for Improving Resilience
Choosing the right tools
Tooling landscape
Choosing the right tools
Libraries and Frameworks Side-car proxies
🚃🚃🚃 Hype 🚃🚃🚃
Your knobs may vary
Choosing the right tools
⚠ Never sacrifice observability ⚠
Side-car proxiesLibraries and Frameworks
��
Simplicity of operation
Simplicity of testing
Simplicity of enforcement
Simplicity of configuration
Polyglot
Predictability of results
����
Adoption Considerations - ⚠ ⚠ ⚠
Choosing the right tools
Ask Away
Thank you!
🙇 🙇 🙇

More Related Content

Similar to The Anatomy of a Cascading Failure Breakdown

Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsNavina Ramesh
 
Renew power - ReLead Case Competition
Renew power - ReLead Case CompetitionRenew power - ReLead Case Competition
Renew power - ReLead Case CompetitionArush Sharma
 
Effective monitoring of distributed systems
Effective monitoring of distributed systemsEffective monitoring of distributed systems
Effective monitoring of distributed systemsMohanraj Nagasamy
 
Customer Case Study: CenterPoint Energy - How to achieve .0003 abends!
Customer Case Study: CenterPoint Energy - How to achieve .0003 abends!Customer Case Study: CenterPoint Energy - How to achieve .0003 abends!
Customer Case Study: CenterPoint Energy - How to achieve .0003 abends!CA Technologies
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
 
ScalabilityAvailability
ScalabilityAvailabilityScalabilityAvailability
ScalabilityAvailabilitywebuploader
 
On the quality of service of crash recovery
On the quality of service of crash recoveryOn the quality of service of crash recovery
On the quality of service of crash recoveryingenioustech
 
Service Virtualization - Next Gen Testing Conference Singapore 2013
Service Virtualization - Next Gen Testing Conference Singapore 2013Service Virtualization - Next Gen Testing Conference Singapore 2013
Service Virtualization - Next Gen Testing Conference Singapore 2013Min Fang
 
Adaptive Server Farms for the Data Center
Adaptive Server Farms for the Data CenterAdaptive Server Farms for the Data Center
Adaptive Server Farms for the Data Centerelliando dias
 
Navigating Disaster Recovery in Kubernetes and CNCF Crossplane
Navigating Disaster Recovery in Kubernetes and CNCF Crossplane Navigating Disaster Recovery in Kubernetes and CNCF Crossplane
Navigating Disaster Recovery in Kubernetes and CNCF Crossplane Carlos Santana
 
Resume For Gary W Mitchell (021516)
Resume For Gary W Mitchell (021516)  Resume For Gary W Mitchell (021516)
Resume For Gary W Mitchell (021516) Gary Mitchell
 
Sybase ASE 15.7- Two Case Studies of Successful Migration
Sybase ASE 15.7- Two Case Studies of Successful Migration Sybase ASE 15.7- Two Case Studies of Successful Migration
Sybase ASE 15.7- Two Case Studies of Successful Migration SAP Technology
 
Pre-Con Education: Building Basic ITSM Workflows in CA Service Management
Pre-Con Education: Building Basic ITSM Workflows in CA Service ManagementPre-Con Education: Building Basic ITSM Workflows in CA Service Management
Pre-Con Education: Building Basic ITSM Workflows in CA Service ManagementCA Technologies
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemDatabricks
 
Architecture patterns: servlet vs reactive
Architecture patterns: servlet vs reactiveArchitecture patterns: servlet vs reactive
Architecture patterns: servlet vs reactiveVikram Rawat
 
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureIntroduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureAna Medina
 
Improving Agility (Learning from Maersk Line's Journey) | Özlem Yüce | Agile ...
Improving Agility (Learning from Maersk Line's Journey) | Özlem Yüce | Agile ...Improving Agility (Learning from Maersk Line's Journey) | Özlem Yüce | Agile ...
Improving Agility (Learning from Maersk Line's Journey) | Özlem Yüce | Agile ...Agile Greece
 

Similar to The Anatomy of a Cascading Failure Breakdown (20)

Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing Applications
 
Renew power - ReLead Case Competition
Renew power - ReLead Case CompetitionRenew power - ReLead Case Competition
Renew power - ReLead Case Competition
 
Effective monitoring of distributed systems
Effective monitoring of distributed systemsEffective monitoring of distributed systems
Effective monitoring of distributed systems
 
PEnDAR webinar 2 with notes
PEnDAR webinar 2 with notesPEnDAR webinar 2 with notes
PEnDAR webinar 2 with notes
 
Customer Case Study: CenterPoint Energy - How to achieve .0003 abends!
Customer Case Study: CenterPoint Energy - How to achieve .0003 abends!Customer Case Study: CenterPoint Energy - How to achieve .0003 abends!
Customer Case Study: CenterPoint Energy - How to achieve .0003 abends!
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 
ScalabilityAvailability
ScalabilityAvailabilityScalabilityAvailability
ScalabilityAvailability
 
On the quality of service of crash recovery
On the quality of service of crash recoveryOn the quality of service of crash recovery
On the quality of service of crash recovery
 
Creating a Value Stream Plan
Creating a Value Stream Plan Creating a Value Stream Plan
Creating a Value Stream Plan
 
Service Virtualization - Next Gen Testing Conference Singapore 2013
Service Virtualization - Next Gen Testing Conference Singapore 2013Service Virtualization - Next Gen Testing Conference Singapore 2013
Service Virtualization - Next Gen Testing Conference Singapore 2013
 
Adaptive Server Farms for the Data Center
Adaptive Server Farms for the Data CenterAdaptive Server Farms for the Data Center
Adaptive Server Farms for the Data Center
 
Navigating Disaster Recovery in Kubernetes and CNCF Crossplane
Navigating Disaster Recovery in Kubernetes and CNCF Crossplane Navigating Disaster Recovery in Kubernetes and CNCF Crossplane
Navigating Disaster Recovery in Kubernetes and CNCF Crossplane
 
Resume For Gary W Mitchell (021516)
Resume For Gary W Mitchell (021516)  Resume For Gary W Mitchell (021516)
Resume For Gary W Mitchell (021516)
 
Sybase ASE 15.7- Two Case Studies of Successful Migration
Sybase ASE 15.7- Two Case Studies of Successful Migration Sybase ASE 15.7- Two Case Studies of Successful Migration
Sybase ASE 15.7- Two Case Studies of Successful Migration
 
Pre-Con Education: Building Basic ITSM Workflows in CA Service Management
Pre-Con Education: Building Basic ITSM Workflows in CA Service ManagementPre-Con Education: Building Basic ITSM Workflows in CA Service Management
Pre-Con Education: Building Basic ITSM Workflows in CA Service Management
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
Integrating Laser Scan Data into FEA Model to Perform Level 3 FFS
Integrating Laser Scan Data into FEA Model to Perform Level 3 FFSIntegrating Laser Scan Data into FEA Model to Perform Level 3 FFS
Integrating Laser Scan Data into FEA Model to Perform Level 3 FFS
 
Architecture patterns: servlet vs reactive
Architecture patterns: servlet vs reactiveArchitecture patterns: servlet vs reactive
Architecture patterns: servlet vs reactive
 
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureIntroduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
 
Improving Agility (Learning from Maersk Line's Journey) | Özlem Yüce | Agile ...
Improving Agility (Learning from Maersk Line's Journey) | Özlem Yüce | Agile ...Improving Agility (Learning from Maersk Line's Journey) | Özlem Yüce | Agile ...
Improving Agility (Learning from Maersk Line's Journey) | Özlem Yüce | Agile ...
 

Recently uploaded

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

The Anatomy of a Cascading Failure Breakdown

  • 1.
  • 2. The Anatomy of a Cascading Failure Rareș Mușină, Tech Lead @N26 @r3sm4n
  • 3.
  • 4. The Anatomy of a Cascading Failure
  • 5. Triggering Conditions - What happened? The Anatomy of a Cascading Failure
  • 6. New Rollouts Planned Changes Traffic Drains Turndowns Triggering Conditions - Change public String getCountry(String userId) { try { // Try to get latest country to avoid stale info // The update method will write the userInfo to DB UserInfo userInfo = userInfoService.update(userId); ... return userInfo.getCountry(); } catch (Exception e) { // Default to cache if service is down return getCountryFromCache(userId); } } The Anatomy of a Cascading Failure
  • 7. Triggering Conditions - What happened? The Anatomy of a Cascading Failure
  • 8. Triggering Conditions - Throttling The Anatomy of a Cascading Failure
  • 9. Triggering Conditions - What happened? The Anatomy of a Cascading Failure
  • 10. Burstiness (e.g. scheduled tasks) DDOSes Instance Death (gee, thanks Spotinst) Organic Growth Request profile changes Triggering Conditions - Entropy The Anatomy of a Cascading Failure
  • 11. CPU Memory Network Disk space Threads File descriptors ………………………... Resource Starvation - Vespene Gas Is Finite The Anatomy of a Cascading Failure
  • 12. Resource Starvation - Dependencies Between Resources Poorly tuned Garbage Collection Slow requests Increased CPU due to GC More in-progress requests More RAM due to queuing Less RAM for caching Lower cache hit rate More requests to backend 🔥🔥🔥 The Anatomy of a Cascading Failure
  • 13. Server Overload/Meltdown/Crash/Unavailability :( CPU/Memory maxed out Health checks returning 5xx Endpoints returning 5xx Timeouts Increased load on other instances The Anatomy of a Cascading Failure
  • 14. Cascading Failures - Load Redistribution The Anatomy of a Cascading Failure ELB ELB A B 500 350 100 250 ELB ELB A 600 600
  • 15. Cascading Failures - Retry Amplification The Anatomy of a Cascading Failure
  • 16. Cascading Failures - Latency Creep The Anatomy of a Cascading Failure
  • 17. Cascading Failures - Latency Propagation The Anatomy of a Cascading Failure Service A Service B Service C ���� Service D Service E �� �� ���� ⚡ ⚡
  • 18. Cascading Failures - Resource Contention During Recovery The Anatomy of a Cascading Failure
  • 20. Architecture - Orchestration vs Choreography Orchestration Choreography Strategies for Improving Resilience Card service Account service User service Signup service Ship card Create Account Create user Card service Account service User service User signup event Subscribes Signup service Publishes
  • 21. Capacity Planning - Do I need it? Strategies for Improving Resilience Are you operating at Google scale? You likely don’t. Go to next slide No Ok, but it will be costly and imprecise Yes
  • 22. Capacity Planning - More important things Automate provisioning and deployment (🐄🐄🐄 not 🐕🐈🐹) Auto-scaling and auto-healing Robust architecture in the face of growing traffic (pub/sub helps) Agree on SLIs and SLOs and monitor them closely Strategies for Improving Resilience
  • 23. “Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” Principles of Chaos Engineering Chaos Testing Strategies for Improving Resilience
  • 24. Retrying - What should I retry? What makes a request retriable? ● ⚠ idempotency ● 🚫 GET with side-effects ● ✅ stateless if you can Should you retry timeouts? ● Stay tuned to the next slides Strategies for Improving Resilience
  • 25. Retrying - Backing Off With Jitter Strategies for Improving Resilience
  • 26. Retrying - Retry Budgets Per-request retry budget ● Each request retried at most 3x Per-client retry budget ● Retry requests = at most 10% total requests to upstream ● If > 10% of requests are failing => upstream is likely unhealthy Strategies for Improving Resilience
  • 27. Throttling - Setting Timeouts Strategies for Improving Resilience Service BService A 3s timeout Service C Service D 2s timeout 5s timeout �� ��> 2s ⚠ Timeout early and be disciplined when setting timeouts ⚠ ���� ���� ⌛
  • 28. Throttling - Complexity Creep Strategies for Improving Resilience ⚠ Propagate timeouts and avoid nesting ⚠ Service BService A 7s timeout Service C Service D 5s timeout 3s timeout Service FService E 3s timeout 2s timeout
  • 29. Throttling - Circular Dependencies Strategies for Improving Resilience ⚠ Avoid circular dependencies at all cost ⚠ Service B Service A 5s timeout Service C 5s timeout 2s timeout3s timeout
  • 30. Throttling - Rate Limiting Avoid overload by clients and set per-client limits: ● requests from one calling service can use up to x CPU seconds/time interval on the upstream ● anything above that will be throttled ● these metrics are aggregated across all instances of a calling service and upstream If this is too complicated => limit based on RPS/customer/endpoint Strategies for Improving Resilience
  • 31. Throttling - Circuit Breaking Strategies for Improving Resilience Closed Open Half Open fail (threshold reached) reset timeout fail fail (under threshold) success call/raise circuit open success Service A Circuit Breaker Service B ⚠ ⚠ ⚠ ⚠ 🚫 timeout timeout timeout timeout trip circuit circuit open
  • 32. gradient = (lag_noload/lag_actual) newLimit = currentLimit × gradient + queueSize Throttling - Adaptive Concurrency Limits Queue Concurrency Strategies for Improving Resilience
  • 33. Fallbacks and Rejection Cache Dead letter queues for writes Return hard-coded value Empty Response (“Fail Silent”) User experience ⚠ Make sure to discuss these with your product owners ⚠ �� �� Strategies for Improving Resilience
  • 35. Tooling landscape Choosing the right tools Libraries and Frameworks Side-car proxies 🚃🚃🚃 Hype 🚃🚃🚃
  • 36. Your knobs may vary Choosing the right tools ⚠ Never sacrifice observability ⚠ Side-car proxiesLibraries and Frameworks �� Simplicity of operation Simplicity of testing Simplicity of enforcement Simplicity of configuration Polyglot Predictability of results ����
  • 37. Adoption Considerations - ⚠ ⚠ ⚠ Choosing the right tools