Complex distributed systems fail. They fail more frequently, and in different ways, as they scale and evolve over time. In this session, you learn how Netflix embraces failure to provide high service availability. Netflix discusses their motivations for inducing failure in production, the mechanics of how Netflix does this, and the lessons they learned along the way. Come hear about the Failure Injection Testing (FIT) framework and suite of tools that Netflix created and currently uses to induce controlled system failures in an effort to help discover vulnerabilities, resolve them, and improve the resiliency of their cloud environment.
2. •~50 million members, ~50 countries
•> 1 billion hours per month
•> 1000 device types
•3 AWS regions, hundreds of services
•Hundreds of thousands of requests/second
•CDN serves petabytes of data at terabits/second
Netflix Ecosystem
Service Partners
Static Content
Akamai
Netflix CDN
AWS/Netflix
Control Plane
Internet
5. Failures can happen any time
•Disks fail
•Power outages
•Natural disasters
•Software bugs
•Human error
6. We design for failure
•Exception handling
•Fault tolerance and isolation
•Fall-backs and degraded experiences
•Auto-scaling clusters
•Redundancy
7. Testing for failure is hard
•Web-scale traffic
•Massive, changing data sets
•Complex interactions and request patterns
•Asynchronous, concurrent requests
•Complete and partial failure modes
Constant innovation and change
8. What if we regularly inject failures into our systems under controlled circumstances?
9.
10. Blast Radius
•Unit of isolation
•Scope of an outage
•Scope a chaos exercise
Zone
Region
Instance
Global
11. An Instance Fails
Edge Cluster
Cluster A
Cluster B
Cluster D
Cluster C
12. Chaos Monkey
•Monkey loose in your DC
•Run during business hours
•What we learned
–Auto-replacement works
–State is problematic
13. A State of Xen -Chaos Monkey & Cassandra
Out of our 2700+ Cassandra nodes
•218 rebooted
•22 did not reboot successfully
•Automation replaced failed nodes
•0 downtime due to reboot
15. Chaos Gorilla
Simulate an Availability Zone outage
•3-zone configuration
•Eliminate one zone
•Ensure that others can handle the load and nothing breaks
Chaos Gorilla
16. Challenges
•Rapidly shifting traffic
–LBs must expire connections quickly
–Lingering connections to caches must be addressed
•Service configuration
–Not all clusters auto-scaled or pinned
–Services not configured for cross-zone calls
–Mismatched timeouts –fallbacks prevented fail-over
18. AZ1
AZ2
AZ3
Regional Load Balancers
Zuul –Traffic Shaping/Routing
Data
Data
Data
Geo-located
Chaos Kong
Chaos Kong
AZ1
AZ2
AZ3
Regional Load Balancers
Zuul –Traffic Shaping/Routing
Data
Data
Data
Customer Device
19. Challenges
●Rapidly shifting traffic
○Auto-scaling configuration
○Static configuration/pinning
○Instance start time
○Cache fill time
20. Challenges
●Service Configuration
○Timeout configurations
○Fallbacks fail or don’t provide the desired experience
●No minimal (critical) stack
○Any service may be critical!
22. Services Slow Down and Fail
Simulate latent/failed service calls
•Inject arbitrary latency and errors at the service level
•Observe for effects
Latency Monkey
24. Challenges
•Startup resiliency is an issue
•Services owners don’t know all dependencies
•Fallbacks can fail too
•Second order effects not easily tested
•Dependencies are in constant flux
•Latency Monkey tests function and scale
–Not a staged approach
–Lots of opt-outs
31. FIT Details
●Common Simulation Syntax
●Single Simulation Interface
●Transported via Http Request header
32. Integrating Failure
Service
Filter
Ribbon
Service
Filter
Ribbon
ServerRcv
ServerRcv
ClientSend
request
Service A
response
Service B
[sendRequestHeader] >>fit.failure: 1|fit.Serializer|
2|[[{"name”:”failSocial,
”whitelist":false,
"injectionPoints”:
[“SocialService”]},{}
]],
{"Id":
"252c403b-7e34-4c0b-a28a-3606fcc38768"}]]
33. Failure Scenarios
●Set of injection points to fail
●Defined based on
○Past outages
○Specific dependency interactions
○Whitelist of a set of critical services
○Dynamic tracing of dependencies
34. FIT Insights : Salp
●Distributed tracing inspired by Dapper paper
●Provides insight into dependencies
●Helps define & visualize scenarios
35. Functional Validation
●Isolated synthetic transactions
○Set of devices
Validation at Scale
●Dial up customer traffic -% based
●Simulation of full service failure
Dialing Up Failure
Chaos!
38. Take-aways
•Don’t wait for random failures
–Cause failure to validate resiliency
–Remove uncertainty by forcing failures regularly
–Better to fail at 2pm than 2am
•Test design assumptions by stressing them
Embrace Failure
39. The Simian Army is part of the Netflix open source cloud platform
http://netflix.github.com
40. Netflix talks at re:Invent
Talk
Time
Title
BDT-403
Wednesday, 2:15pm
Next Generation Big Data Platform at Netflix
PFC-306
Wednesday, 3:30pm
Performance Tuning Amazon EC2
DEV-309
Wednesday, 3:30pm
From Asgardto Zuul, How Netflix’s proven Open Source Tools can accelerate and scale your services
ARC-317
Wednesday, 4:30pm
Maintaining a Resilient Front-Door at Massive Scale
PFC-304
Wednesday, 4:30pm
Effective InterProcessCommunications in the Cloud: The Pros and Cons of MicroservicesArchitectures
ENT-209
Wednesday, 4:30pm
Cloud Migration, Dev-Ops and Distributed Systems
APP-310
Friday 9:00am
Scheduling using Apache Mesosin the Cloud
41. Please give us your feedback on this session.
Complete session evaluations and earn re:Invent swag.
http://bit.ly/awsevals
Josh Evans
jevans@netflix.com
@josh_evans_nflx
Naresh Gopalani
ngopalani@netflix.com