Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2h3bAvP.
Haley Tucker discusses how other systems may affect Netflix' services, strategies to protect their systems and make sure they won't fail even if things go wrong. Filmed at qconsf.com.
Haley Tucker works on the Playback Features team at Netflix, responsible for ensuring that customers receive the best possible viewing experience every time they click play. Her services fill a key role in enabling Netflix to stream amazing content to 65M+ members on 1000+ devices.
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-resilience
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
5. # A distributed system is
one in which the failure
of a computer you didn't
even know existed can
render your own computer
unusable.
--Leslie Lamport
11. Notes on Distributed Systems for Young Bloods
# Distributed
systems are
different because
they fail often.
--Jeff Hodges
12. TABLE OF CONTENTS
CHAPTER 1: THE WEIRD DATA IN THE CATALOG
• Metadata impacts on availability
CHAPTER 2: THE VANISHING OF CRITICAL SERVICES
• Crashing services and cascading failures
CHAPTER 3: THE THROTTLE
• Latency spikes and the impact of fallbacks
FORCES AT WORK
13.
14. Whoops, something went wrong…
Netflix Streaming Error
We’re having trouble playing this title right now. Please try again later or select a different title.
26. Canary, CC BY 2.0, Steve P2008 2014, Flikr
PREVENTION
CANARIES
27. TRADITIONAL CANARY
Canary
(New Code)
Baseline
(Old Code)
TrafficTraffic
Video
Metadata
Service
Amazon S3
Netflix
Services
Netflix
Services
Netflix
Services
Netflix
ServicesNetflix
Service
Source
System
Source
System
Traffic
35. # A distributed system is
one in which the failure
of a computer you didn't
even know existed can
render your own computer
unusable.
--Leslie Lamport
36. Proxy/Routing
Devices
LOG DATA
Log Data Service
Traffic
Cassandra
Playback ServiceNetflix
Playback
Service
Netflix
Playback
Service
Edge
Service
Edge
Service
Edge
Service
Playback
Service
Traffic
40. Proxy/Routing
Devices
CASCADING FAILURE
Log Data Service
Traffic
Cassandra
Playback ServiceNetflix
Playback
Service
Netflix
Playback
Service
Edge
Service
Edge
Service
Edge
Service
Playback
Service
Traffic
49. It's Electric, CC BY ND 2.0, Alan Hochberg 2008, Flikr
MITIGATION
CIRCUIT BREAKERS
50.
51. Wrecking Ball in Building, CC BY 2.0, Jason Eppink 2008, Flikr
FAILURE TESTING
52. Proxy/Routing
Devices
FAILURE TESTING
Log Data Service
Traffic
Cassandra
Playback Service Automating Chaos
Experiments in
Production
by Ali Basiri
Applying Failure
Testing Research
@Netflix
by Kolton Andrus and
Peter Alvaro
53. Manage resource constraints by
reducing surface area.
Leverage circuit breakers and
rigorously test failures.
73. KEY TAKEAWAYS
CHAPTER 1: THE WEIRD DATA IN THE CATALOG
• Verify consistency prior to applying state changes.
• One tool is a data canary.
CHAPTER 2: THE VANISHING OF CRITICAL SERVICES
• Manage resource constraints by reducing surface area.
• Leverage circuit breakers and rigorously test failures.
CHAPTER 3: THE THROTTLE
• No heavy fallbacks!! Fallbacks should be light and fast.
• Shard your application based on operational characteristics.