SlideShare a Scribd company logo
1 of 87
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Chaos Engineering at Netflix Scale
Nora Jones, Senior Chaos Engineer
@nora_js
DEV334
November 29, 2017
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KNOWN WAYS TO INCREASE CONFIDENCE
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
UNIT TESTS
Input Output
Component A
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
INTEGRATION TESTS
Input Output
COMPONENT
A
COMPONENT
B
Service C Service D
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CHAOS ENGINEERING
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
NEW WAY TO INCREASE CONFIDENCE
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CHAOS EXPERIMENTS
Service C Service D
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
WHY IS THERE A FEAR OF CHAOS WHEN
IT’S INEVITABLE?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THE “IT DOESN’T APPLY TO ME” FALLACY
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
IT APPLIES TO YOU MORE
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCES OF CHAOS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 0: SOCIALIZATION & MONITORING
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 0: SOCIALIZATION
Acknowledge
complexity
Define the
steady state
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“I WORK AT A STARTUP, THERE IS NO
STEADY STATE”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
TIPS FOR DEFINING STEADY STATE
• Start with non-critical services
• Start in a staging environment, if
possible
• Only include services that want to be
Chaos’ed
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“THE ARMIES OF CHAOS
ARE COMING!”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 0: SOCIALIZATION
Part of your job as a Chaos Engineer is to
understand the customer and their needs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 0: MONITORING
WHAT ARE YOUR KEY BUSINESS METRICS?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 0: MONITORING
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SPS: NETFLIX’S KEY BUSINESS METRIC
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 0: MONITORING
DON’T LOSE SIGHT OF YOUR COMPANY’S
CUSTOMERS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 1: GRACEFUL RESTARTS AND
DEGRADATION
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 2: TARGETED CHAOS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 3: CAN WE CAUSE A
CASCADING FAILURE?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
NOT IF THIS FAILS, BUT
WHEN IT FAILS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
LATENCY MONKEY: “A LEARNING
OPPORTUNITY”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 4: FAILURE INJECTION
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SAMPLE FAILURE INJECTION LIBRARY
TYPES OF CHAOS FAILURES
TYPES OF CHAOS FAILURES
Criteria&API
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
NETFLIX FAILURE INJECTION POINTS
HYSTRIX
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Service A
Service B
Routing
Failure
injection
Service
Injection
Points
NETFLIX FAILURE INJECTION
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 5: CHAOS AUTOMATION PLATFORM
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Service A Service BRouting
100%
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Service A
Control
Service A Service BRouting
98%
1%
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Service A
Control
Service A
Experiment
Service A Service BRouting
98%
1%
1%
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 0: MONITORING
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SPS: NETFLIX’S KEY BUSINESS METRIC
ChAP MONITORING
10:27 10:30 10:33 10:36 10:39 10:42 10:45 10:48
ChAP MONITORING
10:27 10:30 10:33 10:36 10:39 10:42 10:45 10:48
SHORTED
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ChAP GOAL: CHAOS ALL THE THINGS AND
RUN ALL THE TIME
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 6: WHAT’S NEXT?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
NETFLIX FAILURE INJECTION POINTS
HYSTRIX
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AUTOMATE EXPERIMENT CREATION AND
PRIORITIZATION
ChAP’S MONOCLE
ChAP’S MONOCLE
ChAP’S MONOCLE
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CRITICIALITY SCORE
RPS Stats Range bucket * number of retries * number of Hystrix Commands =
CRITICALITY SCORE
RPS Stats Range bucket * number of retries * number of Hystrix Commands =
Criticality Score
CRITICALITY SCORE
RPS Stats Range bucket * number of retries * number of Hystrix Commands =
Criticality Score
CRITICALITY SCORE
RPS Stats Range bucket * number of retries * number of Hystrix Commands =
Criticality Score
CRITICALITY SCORE
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCES OF CHAOS
0. Socialization and Monitoring
1. Graceful Restarts and Degradation
2. Targeted Chaos
3. Causing a Cascading Failure
4. Failure Injection
5. Chaos Automation Platform
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
RECORD CHAOS SUCCESS STORIES
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“We ran a Chaos Experiment that verifies
that our fallback path works and it
successfully caught an issue in the
fallback path before it resulted in an
availability incident”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“While failing calls, we discovered an
increase in requests for the experiment
cluster (even though fallbacks were
successful)…”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“…this likely means whoever was
consuming the fallback was retrying the
call, causing an increase in requests.”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
TAKEAWAYS
• Everyone can and should be doing
Chaos Engineering
• The road to chaos is a learning
opportunity
• Safety is critical, involve your business
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CHAOS DOESN’T CAUSE PROBLEMS.
IT REVEALS THEM.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!
N o r a J o n e s
@ n o r a _ j s

More Related Content

What's hot

What's hot (20)

CTD302_How Hulu reinvented Television using the AWS Cloud
CTD302_How Hulu reinvented Television using the AWS CloudCTD302_How Hulu reinvented Television using the AWS Cloud
CTD302_How Hulu reinvented Television using the AWS Cloud
 
MCL306_Making IoT Smarter with AWS Rekognition.pdf
MCL306_Making IoT Smarter with AWS Rekognition.pdfMCL306_Making IoT Smarter with AWS Rekognition.pdf
MCL306_Making IoT Smarter with AWS Rekognition.pdf
 
ARC306_High Resiliency & Availability Of Online Entertainment Communities Usi...
ARC306_High Resiliency & Availability Of Online Entertainment Communities Usi...ARC306_High Resiliency & Availability Of Online Entertainment Communities Usi...
ARC306_High Resiliency & Availability Of Online Entertainment Communities Usi...
 
CTD201_Introduction to Amazon CloudFront and AWS Lambda@Edge
CTD201_Introduction to Amazon CloudFront and AWS Lambda@EdgeCTD201_Introduction to Amazon CloudFront and AWS Lambda@Edge
CTD201_Introduction to Amazon CloudFront and AWS Lambda@Edge
 
MAE303-OTT State of Play Innovation at Netflix, Hulu, Amazon Video, and AWS E...
MAE303-OTT State of Play Innovation at Netflix, Hulu, Amazon Video, and AWS E...MAE303-OTT State of Play Innovation at Netflix, Hulu, Amazon Video, and AWS E...
MAE303-OTT State of Play Innovation at Netflix, Hulu, Amazon Video, and AWS E...
 
MSC202_Learn How Salesforce Used ADCs for App Load Balancing for an Internati...
MSC202_Learn How Salesforce Used ADCs for App Load Balancing for an Internati...MSC202_Learn How Salesforce Used ADCs for App Load Balancing for an Internati...
MSC202_Learn How Salesforce Used ADCs for App Load Balancing for an Internati...
 
DEV203_Launch Applications the Amazon Way
DEV203_Launch Applications the Amazon WayDEV203_Launch Applications the Amazon Way
DEV203_Launch Applications the Amazon Way
 
ARC303_Running Lean Architectures How to Optimize for Cost Efficiency
ARC303_Running Lean Architectures How to Optimize for Cost EfficiencyARC303_Running Lean Architectures How to Optimize for Cost Efficiency
ARC303_Running Lean Architectures How to Optimize for Cost Efficiency
 
Interstella 8888: Advanced Microservice Operations - CON407 - re:Invent 2017
Interstella 8888: Advanced Microservice Operations - CON407 - re:Invent 2017Interstella 8888: Advanced Microservice Operations - CON407 - re:Invent 2017
Interstella 8888: Advanced Microservice Operations - CON407 - re:Invent 2017
 
STG316_Optimizing Storage for Big Data Workloads
STG316_Optimizing Storage for Big Data WorkloadsSTG316_Optimizing Storage for Big Data Workloads
STG316_Optimizing Storage for Big Data Workloads
 
Optimizing EC2 for Fun and Profit #bigsavings #newfeatures - CMP202 - re:Inve...
Optimizing EC2 for Fun and Profit #bigsavings #newfeatures - CMP202 - re:Inve...Optimizing EC2 for Fun and Profit #bigsavings #newfeatures - CMP202 - re:Inve...
Optimizing EC2 for Fun and Profit #bigsavings #newfeatures - CMP202 - re:Inve...
 
ARC319_Multi-Region Active-Active Architecture
ARC319_Multi-Region Active-Active ArchitectureARC319_Multi-Region Active-Active Architecture
ARC319_Multi-Region Active-Active Architecture
 
How Do I Build a Global Transit Network on AWS? - MSC302 - re:Invent 2017
How Do I Build a Global Transit Network on AWS? - MSC302 - re:Invent 2017How Do I Build a Global Transit Network on AWS? - MSC302 - re:Invent 2017
How Do I Build a Global Transit Network on AWS? - MSC302 - re:Invent 2017
 
Learn How AWS is Enabling the World's Most Advanced Media Workflows - CTD202 ...
Learn How AWS is Enabling the World's Most Advanced Media Workflows - CTD202 ...Learn How AWS is Enabling the World's Most Advanced Media Workflows - CTD202 ...
Learn How AWS is Enabling the World's Most Advanced Media Workflows - CTD202 ...
 
DAT320_Moving a Galaxy into Cloud
DAT320_Moving a Galaxy into CloudDAT320_Moving a Galaxy into Cloud
DAT320_Moving a Galaxy into Cloud
 
DEV204_Debugging Modern Applications Introduction to AWS X-Ray
DEV204_Debugging Modern Applications Introduction to AWS X-RayDEV204_Debugging Modern Applications Introduction to AWS X-Ray
DEV204_Debugging Modern Applications Introduction to AWS X-Ray
 
CMP213_GPU(G3) Applications in Media and Entertainment Workloads
CMP213_GPU(G3) Applications in Media and Entertainment WorkloadsCMP213_GPU(G3) Applications in Media and Entertainment Workloads
CMP213_GPU(G3) Applications in Media and Entertainment Workloads
 
Oracle Enterprise Solutions on AWS - ENT326 - re:Invent 2017
Oracle Enterprise Solutions on AWS - ENT326 - re:Invent 2017Oracle Enterprise Solutions on AWS - ENT326 - re:Invent 2017
Oracle Enterprise Solutions on AWS - ENT326 - re:Invent 2017
 
DAT325_Snapchat Stories
DAT325_Snapchat StoriesDAT325_Snapchat Stories
DAT325_Snapchat Stories
 
Deep Dive into Container Scheduling with Amazon ECS - CON404 - re:Invent 2017
Deep Dive into Container Scheduling with Amazon ECS - CON404 - re:Invent 2017Deep Dive into Container Scheduling with Amazon ECS - CON404 - re:Invent 2017
Deep Dive into Container Scheduling with Amazon ECS - CON404 - re:Invent 2017
 

Similar to Performing Chaos at Netflix Scale - DEV334 - re:Invent 2017

Similar to Performing Chaos at Netflix Scale - DEV334 - re:Invent 2017 (20)

10 Lessons from 10 Years of AWS
10 Lessons from 10 Years of AWS10 Lessons from 10 Years of AWS
10 Lessons from 10 Years of AWS
 
Launch Applications the Amazon Way - AWS Online Tech Talks
Launch Applications the Amazon Way - AWS Online Tech TalksLaunch Applications the Amazon Way - AWS Online Tech Talks
Launch Applications the Amazon Way - AWS Online Tech Talks
 
Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.
 
Leveraging a Cloud Policy Framework - From Zero to Well Governed - ENT318 - r...
Leveraging a Cloud Policy Framework - From Zero to Well Governed - ENT318 - r...Leveraging a Cloud Policy Framework - From Zero to Well Governed - ENT318 - r...
Leveraging a Cloud Policy Framework - From Zero to Well Governed - ENT318 - r...
 
AWS re:Invent 2017 | CloudHealth Tech Session
AWS re:Invent 2017 |  CloudHealth Tech SessionAWS re:Invent 2017 |  CloudHealth Tech Session
AWS re:Invent 2017 | CloudHealth Tech Session
 
DEV209 A Field Guide to Monitoring in the Cloud: From Lift and Shift to AWS L...
DEV209 A Field Guide to Monitoring in the Cloud: From Lift and Shift to AWS L...DEV209 A Field Guide to Monitoring in the Cloud: From Lift and Shift to AWS L...
DEV209 A Field Guide to Monitoring in the Cloud: From Lift and Shift to AWS L...
 
The AWS Philosophy of Security - SID322 - re:Invent 2017
The AWS Philosophy of Security - SID322 - re:Invent 2017The AWS Philosophy of Security - SID322 - re:Invent 2017
The AWS Philosophy of Security - SID322 - re:Invent 2017
 
CON320_Monitoring, Logging and Debugging Containerized Services
CON320_Monitoring, Logging and Debugging Containerized ServicesCON320_Monitoring, Logging and Debugging Containerized Services
CON320_Monitoring, Logging and Debugging Containerized Services
 
From Cloud Cost Management to Financial Agility: The Journey to Success - ENT...
From Cloud Cost Management to Financial Agility: The Journey to Success - ENT...From Cloud Cost Management to Financial Agility: The Journey to Success - ENT...
From Cloud Cost Management to Financial Agility: The Journey to Success - ENT...
 
Conversation and Memory - ALX401-R - re:Invent 2017
Conversation and Memory - ALX401-R - re:Invent 2017Conversation and Memory - ALX401-R - re:Invent 2017
Conversation and Memory - ALX401-R - re:Invent 2017
 
Keynote - AWSome Day Copenhagen
Keynote - AWSome Day Copenhagen Keynote - AWSome Day Copenhagen
Keynote - AWSome Day Copenhagen
 
AWSome Day - Opening Keynote
AWSome Day - Opening KeynoteAWSome Day - Opening Keynote
AWSome Day - Opening Keynote
 
How serverless helps startups innovate and scale
How serverless helps startups innovate and scaleHow serverless helps startups innovate and scale
How serverless helps startups innovate and scale
 
Building the Largest Repo for Serverless Compliance-as-Code - SID205 - re:Inv...
Building the Largest Repo for Serverless Compliance-as-Code - SID205 - re:Inv...Building the Largest Repo for Serverless Compliance-as-Code - SID205 - re:Inv...
Building the Largest Repo for Serverless Compliance-as-Code - SID205 - re:Inv...
 
Keynote: What Transformation Really Means for the Enterprise - AWS Transforma...
Keynote: What Transformation Really Means for the Enterprise - AWS Transforma...Keynote: What Transformation Really Means for the Enterprise - AWS Transforma...
Keynote: What Transformation Really Means for the Enterprise - AWS Transforma...
 
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...
 
LFS307_Using AWS to Maximize Digital Marketing Reach and Efficiency
LFS307_Using AWS to Maximize Digital Marketing Reach and EfficiencyLFS307_Using AWS to Maximize Digital Marketing Reach and Efficiency
LFS307_Using AWS to Maximize Digital Marketing Reach and Efficiency
 
Enterprise Innovation? Yes, with AWS Cloud, AI, and IoT - WIN201 - re:Invent ...
Enterprise Innovation? Yes, with AWS Cloud, AI, and IoT - WIN201 - re:Invent ...Enterprise Innovation? Yes, with AWS Cloud, AI, and IoT - WIN201 - re:Invent ...
Enterprise Innovation? Yes, with AWS Cloud, AI, and IoT - WIN201 - re:Invent ...
 
How to get from Zero to Hundreds of Certified Engineers
How to get from Zero to Hundreds of Certified EngineersHow to get from Zero to Hundreds of Certified Engineers
How to get from Zero to Hundreds of Certified Engineers
 
Launching applications the Amazon Way
Launching applications the Amazon WayLaunching applications the Amazon Way
Launching applications the Amazon Way
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Performing Chaos at Netflix Scale - DEV334 - re:Invent 2017

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Chaos Engineering at Netflix Scale Nora Jones, Senior Chaos Engineer @nora_js DEV334 November 29, 2017
  • 2.
  • 3.
  • 4.
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KNOWN WAYS TO INCREASE CONFIDENCE
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. UNIT TESTS Input Output Component A
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. INTEGRATION TESTS Input Output COMPONENT A COMPONENT B Service C Service D
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CHAOS ENGINEERING
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. NEW WAY TO INCREASE CONFIDENCE
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CHAOS EXPERIMENTS Service C Service D
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. WHY IS THERE A FEAR OF CHAOS WHEN IT’S INEVITABLE?
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THE “IT DOESN’T APPLY TO ME” FALLACY
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. IT APPLIES TO YOU MORE
  • 14.
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCES OF CHAOS
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 0: SOCIALIZATION & MONITORING
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 0: SOCIALIZATION Acknowledge complexity Define the steady state
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “I WORK AT A STARTUP, THERE IS NO STEADY STATE”
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. TIPS FOR DEFINING STEADY STATE • Start with non-critical services • Start in a staging environment, if possible • Only include services that want to be Chaos’ed
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “THE ARMIES OF CHAOS ARE COMING!”
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 0: SOCIALIZATION Part of your job as a Chaos Engineer is to understand the customer and their needs
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 0: MONITORING WHAT ARE YOUR KEY BUSINESS METRICS?
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 0: MONITORING
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SPS: NETFLIX’S KEY BUSINESS METRIC
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 0: MONITORING DON’T LOSE SIGHT OF YOUR COMPANY’S CUSTOMERS
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 1: GRACEFUL RESTARTS AND DEGRADATION
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 2: TARGETED CHAOS
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 3: CAN WE CAUSE A CASCADING FAILURE?
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. NOT IF THIS FAILS, BUT WHEN IT FAILS
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. LATENCY MONKEY: “A LEARNING OPPORTUNITY”
  • 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 4: FAILURE INJECTION
  • 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SAMPLE FAILURE INJECTION LIBRARY
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45. TYPES OF CHAOS FAILURES
  • 46. TYPES OF CHAOS FAILURES
  • 48.
  • 49.
  • 50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. NETFLIX FAILURE INJECTION POINTS HYSTRIX
  • 51. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Service A Service B Routing Failure injection Service Injection Points NETFLIX FAILURE INJECTION
  • 52. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 5: CHAOS AUTOMATION PLATFORM
  • 53. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Service A Service BRouting 100%
  • 54. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Service A Control Service A Service BRouting 98% 1%
  • 55. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Service A Control Service A Experiment Service A Service BRouting 98% 1% 1%
  • 56. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 0: MONITORING
  • 57. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SPS: NETFLIX’S KEY BUSINESS METRIC
  • 58. ChAP MONITORING 10:27 10:30 10:33 10:36 10:39 10:42 10:45 10:48
  • 59. ChAP MONITORING 10:27 10:30 10:33 10:36 10:39 10:42 10:45 10:48 SHORTED
  • 60. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ChAP GOAL: CHAOS ALL THE THINGS AND RUN ALL THE TIME
  • 61. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCE 6: WHAT’S NEXT?
  • 62. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. NETFLIX FAILURE INJECTION POINTS HYSTRIX
  • 63. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AUTOMATE EXPERIMENT CREATION AND PRIORITIZATION
  • 64.
  • 65.
  • 69. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 70. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CRITICIALITY SCORE
  • 71. RPS Stats Range bucket * number of retries * number of Hystrix Commands = CRITICALITY SCORE
  • 72. RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score CRITICALITY SCORE
  • 73. RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score CRITICALITY SCORE
  • 74. RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score CRITICALITY SCORE
  • 75. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FORCES OF CHAOS 0. Socialization and Monitoring 1. Graceful Restarts and Degradation 2. Targeted Chaos 3. Causing a Cascading Failure 4. Failure Injection 5. Chaos Automation Platform
  • 76. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. RECORD CHAOS SUCCESS STORIES
  • 77. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “We ran a Chaos Experiment that verifies that our fallback path works and it successfully caught an issue in the fallback path before it resulted in an availability incident”
  • 78.
  • 79.
  • 80.
  • 81.
  • 82. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “While failing calls, we discovered an increase in requests for the experiment cluster (even though fallbacks were successful)…”
  • 83. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 84. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “…this likely means whoever was consuming the fallback was retrying the call, causing an increase in requests.”
  • 85. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. TAKEAWAYS • Everyone can and should be doing Chaos Engineering • The road to chaos is a learning opportunity • Safety is critical, involve your business
  • 86. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CHAOS DOESN’T CAUSE PROBLEMS. IT REVEALS THEM.
  • 87. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU! N o r a J o n e s @ n o r a _ j s