More Related Content Similar to Performing Chaos at Netflix Scale - DEV334 - re:Invent 2017 (20) More from Amazon Web Services (20) Performing Chaos at Netflix Scale - DEV334 - re:Invent 20171. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Chaos Engineering at Netflix Scale
Nora Jones, Senior Chaos Engineer
@nora_js
DEV334
November 29, 2017
5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KNOWN WAYS TO INCREASE CONFIDENCE
6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
UNIT TESTS
Input Output
Component A
7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
INTEGRATION TESTS
Input Output
COMPONENT
A
COMPONENT
B
Service C Service D
8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CHAOS ENGINEERING
9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
NEW WAY TO INCREASE CONFIDENCE
10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CHAOS EXPERIMENTS
Service C Service D
11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
WHY IS THERE A FEAR OF CHAOS WHEN
IT’S INEVITABLE?
12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THE “IT DOESN’T APPLY TO ME” FALLACY
13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
IT APPLIES TO YOU MORE
15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCES OF CHAOS
16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 0: SOCIALIZATION & MONITORING
17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 0: SOCIALIZATION
Acknowledge
complexity
Define the
steady state
18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“I WORK AT A STARTUP, THERE IS NO
STEADY STATE”
19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
TIPS FOR DEFINING STEADY STATE
• Start with non-critical services
• Start in a staging environment, if
possible
• Only include services that want to be
Chaos’ed
20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“THE ARMIES OF CHAOS
ARE COMING!”
22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 0: SOCIALIZATION
Part of your job as a Chaos Engineer is to
understand the customer and their needs
23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 0: MONITORING
WHAT ARE YOUR KEY BUSINESS METRICS?
24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 0: MONITORING
25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SPS: NETFLIX’S KEY BUSINESS METRIC
26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 0: MONITORING
DON’T LOSE SIGHT OF YOUR COMPANY’S
CUSTOMERS
29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 1: GRACEFUL RESTARTS AND
DEGRADATION
30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 2: TARGETED CHAOS
33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 3: CAN WE CAUSE A
CASCADING FAILURE?
34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
NOT IF THIS FAILS, BUT
WHEN IT FAILS
35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
LATENCY MONKEY: “A LEARNING
OPPORTUNITY”
37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 4: FAILURE INJECTION
38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SAMPLE FAILURE INJECTION LIBRARY
50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
NETFLIX FAILURE INJECTION POINTS
HYSTRIX
51. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Service A
Service B
Routing
Failure
injection
Service
Injection
Points
NETFLIX FAILURE INJECTION
52. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 5: CHAOS AUTOMATION PLATFORM
53. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Service A Service BRouting
100%
54. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Service A
Control
Service A Service BRouting
98%
1%
55. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Service A
Control
Service A
Experiment
Service A Service BRouting
98%
1%
1%
56. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 0: MONITORING
57. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SPS: NETFLIX’S KEY BUSINESS METRIC
60. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ChAP GOAL: CHAOS ALL THE THINGS AND
RUN ALL THE TIME
61. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCE 6: WHAT’S NEXT?
62. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
NETFLIX FAILURE INJECTION POINTS
HYSTRIX
63. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AUTOMATE EXPERIMENT CREATION AND
PRIORITIZATION
69. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
70. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CRITICIALITY SCORE
71. RPS Stats Range bucket * number of retries * number of Hystrix Commands =
CRITICALITY SCORE
72. RPS Stats Range bucket * number of retries * number of Hystrix Commands =
Criticality Score
CRITICALITY SCORE
73. RPS Stats Range bucket * number of retries * number of Hystrix Commands =
Criticality Score
CRITICALITY SCORE
74. RPS Stats Range bucket * number of retries * number of Hystrix Commands =
Criticality Score
CRITICALITY SCORE
75. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FORCES OF CHAOS
0. Socialization and Monitoring
1. Graceful Restarts and Degradation
2. Targeted Chaos
3. Causing a Cascading Failure
4. Failure Injection
5. Chaos Automation Platform
76. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
RECORD CHAOS SUCCESS STORIES
77. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“We ran a Chaos Experiment that verifies
that our fallback path works and it
successfully caught an issue in the
fallback path before it resulted in an
availability incident”
82. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“While failing calls, we discovered an
increase in requests for the experiment
cluster (even though fallbacks were
successful)…”
83. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
84. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“…this likely means whoever was
consuming the fallback was retrying the
call, causing an increase in requests.”
85. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
TAKEAWAYS
• Everyone can and should be doing
Chaos Engineering
• The road to chaos is a learning
opportunity
• Safety is critical, involve your business
86. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CHAOS DOESN’T CAUSE PROBLEMS.
IT REVEALS THEM.
87. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!
N o r a J o n e s
@ n o r a _ j s