At Netflix, we provide a Java-based API that supports the content discovery, sign-up, and playback experience on thousands of device types that millions use around the world every day. As our user base and traffic have grown by leaps and bounds, we are continuously evolving this API to enable the best user experience. In this talk, I will give an overview of how and why the Netflix API has evolved to where it is today and where we plan to take it in the future. I will discuss how we make our system resilient against failures using tools such as Hystrix and FIT, while keeping it flexible and nimble enough to support continuous A/B testing.
5. Scale
❏ Peak
downstream
traffic in the
US is 37%,
upstream
almost 7%.
❏ 75 Million subscribers worldwide and growing
Source: http://www.sandvine.com/news/global_broadband_trends.asp
6. Netflix API
❏ Architecture
❏ Resiliency
❏ Developer velocity
❏ Tooling and DevOps
❏ Current and future directions
API
7. Netflix API
❏ Architecture
❏ Resiliency
❏ Developer velocity
❏ Tooling and DevOps
❏ Current and future directions
API
10. What is the API used
for?
Examples:
❏ Discovery
❏ Recommendations
❏ Move metadata
❏ Ratings
❏ Sign-up and Profiles
❏ Playback
❏ Bookmarks
❏ DRM
❏ A/B testing
API
12. Netflix API
❏ Architecture
❏ Resiliency
❏ Developer velocity
❏ Tooling and DevOps
❏ Current and future directions
API
13. Hystrix Primer
❏ Protection from and control over
latency and failure from dependencies
❏ Stop cascading failures in a complex
distributed system
❏ Fall back and gracefully degrade
❏ Fail fast and rapidly recover
https://github.com/Netflix/Hystrix
19. More automated failure testing
Goal: Find groups of service calls that are needed for
success.
http://techblog.netflix.com/2016/01/automated-failure-testing.html
21. Autoscaling & Capacity Management
❏ Red: traffic for current week (x-axis)
❏ Black: traffic for previous week for comparison
❏ What happened on February 7? Superbowl!
32. ❏ UI (script) changes can happen
independently
❏ Script changes can be pushed to running
servers, so decoupled from API push
schedule
❏ Decoupling leads to greater developer
velocity
Impact on velocity and collaboration
33. Netflix API
❏ Architecture
❏ Resiliency
❏ Developer velocity
❏ Tooling and DevOps
❏ Current and future directions
API
34. Run 1% of your traffic on the new
code and see how it does
35. ❏ Errors: 2xx, 4xx, 5xx
❏ latency
❏ network
❏ busy threads
❏ load, memory consumption
❏ ...
So you’ve run a canary. Now what?
Control Canary
48. Netflix API
❏ Architecture
❏ Resiliency
❏ Developer velocity
❏ Tooling and DevOps
❏ Current and future
directions
API
49. ● > 900 active
endpoints
● ~60 direct
dependencies
● 78 thread pools
● 1000+ threads
● high memory usage
What we’ve
grown to
50. Script isolation & node
❏ Groovy scripts run as
part of API process
❏ UI teams would like to
use other languages
(in particular node.js)
var response = model.get("todos[0..2]
['name','done']");
API remote
service layer
Client libs
UI/device
scripts (node)
Falcor
Services
51. Thin client libraries
❏ Fat client libraries
❏ business logic and
have
❏ multiple dependencies
❏ Move business logic and
dependencies to services
API remote
service layer
Thin client libs
UI/device
scripts (node)
Falcor
Services
52. Remove metadata from API servers
❏ Metadata takes up
significant memory
in API servers
❏ Challenge: reduce
chattiness to
metadata
Metadata
Service
API remote
service layer
Thin client libs
UI/device
scripts (node)
Falcor
Services