SlideShare a Scribd company logo
1 of 79
How we learned to stop worrying and start deploying
The Netflix API service
Sangeeta Narayanan
@sangeetan
http://www.linkedin.com/in/sangeetanarayanan
http://bit.ly/1wq2kkN
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/deploy-netflix-api-service
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
Netflix started out as a DVD rental by mail service in the US.
Introduced on-demand video streaming over the internet in 2007
Global Streaming for Movies and TV Shows
Started expanding the streaming service into international markets a few years after launching in the US
High Quality Original Content
Late 2011/2012 marked a major new strategic focus with foray into the world of original programming
Shows like HoC & Orange have been received with high acclaim; as evidenced by recent Emmy wins. Strategy is to expand internationally and pursue high
quality content to drive engagement and acquisition.
Over 50 Million Subscribers
Over 40 Countries
Global expansion, high quality originals and personalized content have fueled rapid subscriber growth.
> 34% of Peak Downstream Traffic in North
America
Over 2 billion streaming hours a month
Netflix now accounts for over 1/3rd of downstream internet traffic in NA at peak. This number has been in the news a lot lately!
Our members can choose to enjoy our service on over 1000 device types.
Personalized User Experience
Edge Engineering operates the services that are the entry point to the personalized discovery and streaming experience for our members.
This is an extremely high level view of how the Netflix Discovery experience is rendered. API is the internet facing service that all devices connect to to
provide the user experience. The API in turn consumes data from several middle-tier services, applies business logic on top of it as needed and provides
an abstraction layer for devices to interact with. The API in effect, acts as a broker of metadata between services and devices. Put another way, almost all
product functionality flows through the API.
http://goo.gl/VhokZV
Role of API
Enable rapid innovation
Conduit for metadata between Devices and
Services
Implements business logic
Scale with business growth
Maintain resiliency
Going back in time…
http://bit.ly/1yOWEjr
Looking at the motivations behind our move towards CD
PM: When can I get my feature?
Us: 2 -4 weeks
PM: When can I get my feature?
Us: 2 -4 weeks - ish…
PM: When can I get my feature?
Us: 2 -4 weeks - ish…
PM: When can I get my feature?
IF all goes well…
We were lacking confidence in our delivery process
2 week release cycle
Not Quite!
API was becoming a bottleneck where functionality would get delayed.
!
Stop being the bottleneck!
http://bit.ly/1zmYbAy
We had a simple goal.
What’s not
working?
Heavy weight Code Management
3 long lived branches with code in varying states of release readiness. Lots of manual tracking, merging and co-ordination.
Slow, non-repeatable builds
Constantly Changing Dependencies
Dependency management was hard and contributed to slow, unpredictable builds.
Slow, unreliable tests
Low coverage
Manual on-device testing
Lots of manual testing - on device too!
Manual
deployments
Push Lead!
Life of push on-call was not fun.
Requirements for new system
On-Demand, Rapid Feature Delivery
Intuitive and painless
Easy recovery from errors
Insight and Communication
Balance between Agility & Stability
R
elease
R
elease
R
elease
Patch
PatchPatch
2 week Releases + Ad-Hoc Patches
http://bit.ly/1E6a9yn
M
R
M
R
IR
IR
IR
IR
3 week Major Releases + Weekly Incremental Releases
Major releases (MR) every three weeks - dates shared outside the team
Weekly Incremental releases (IR) in between; two IRs per MR cycle
Automate SCM Tasks
Eliminated Code Freeze. Engineers were responsible for managing their commits.
Automated code merge tasks
Automated Dependency Validation
Dependency Management was creating a lot of churn in our cycle. We built a separate pipeline that resolved the dependency tree, validated it by running
a series of tests and then committed the resolved graph to source. All development is based off that known good set of dependencies until the next run
of that pipeline.
Test Strategy
Increasing
confidence
Worked out a test strategy so effort could be applied at the appropriate level of testing. The idea was to build a series of tests that acted as gates and as
code made its way up the pyramid, our confidence in it would increase.
Test Runtimes 60%
No False Positives
Eliminating non-determinism and shortening runtime is a fundamental requirement. The point to note is that this is an ongoing process; you need to stay
on top of this!
Improved Result Reporting
In keeping with the goal of making the system simple and intuitive, we added detailed insights into test results so anyone could quickly root cause failures
and act on them.
Automated Deployments
Internal Environments
Using Asgard API
Connected to builds
Driven from CI Server
By now, we were operating multiple internal environments and the company was getting ready to bring a new AWS region online. We automated
deployments to all those environments.
Pipelines
And now, we had ourselves a pipeline! In fact, we had 3 - one for each long lived branch.
• Multiple
deployments/day
!
• Multiple internal
environments
!
• Multiple AWS
regions
http://bit.ly/13qrIfw
A big milestone for the team.
Team Cohesion
!
• Shared ownership - no silos
• Increased partner satisfaction
• Greater productivity
Equally, if not more important was the change in the team dynamic. There was increased cohesion as people got comfortable with the self-service model
and the idea of sharing ownership.
http://bit.ly/1xJQqjD
Aiming Higher
Faster, Better, All the way!
Shorter Feedback Loop
Increased Confidence
Richer Insight & Communication
Build
Bake
Test
Deploy
Build
BakeTest
Deploy
Increase velocity: Developer workflow
NEtflix BUild LAnguage plugin for Gradle that provides specific functionality for the Netflix environment
!
Branching Strategy
Modeled after github-flow
!
Automated Pull Request Processing
!
Automated Patch Branching
Single long-lived branch
Always deploy-able
Feature branches
More, Better, Faster & to Prod
Shorter Feedback Loop
Increased Confidence
Richer Insight & Communication
Automated Canary
Analysis
Aggregate Health Score
>1500 metrics
Configurable
Multiple regions
Old$Code$(Baseline)$ New$Code$(Canary)$
~1%$Traffic$
Automated Canary Analysis is the arguably the most important tool in our toolkit. We started out small, comparing simple metrics. Then expanded it to
make it a system that generates a health score based on comparisons across 1000s of metrics.
Canary reports are generated at periodic intervals and emailed to the team. They are also available off the dashboard. The report shows an overall confidence score of
the readiness of that build. This one didn’t do very well.
Details of the problematic metrics that contributed to the poor canary score.
Developer Canaries
(dynamically provisioned)
Dependency Validation Canary
Not intended for deployment
Not deployable; failed tests
Deployed
Hands Free Production Deployments
http://bit.ly/1wQ8fPQ
Red/Black Deployments
Old Code
Production Traffic
Old Code New Code
Production Traffic
Old Code New Code
Production Traffic
We can see an outage in real time - the no. of 5XX errors & latency spiked during the incident. This data is being streamed by hundreds of servers, aggregated using Turbine and streamed to the dashboard.
Feature Rollback
Dynamic configuration using Archaius allows features to be toggled dynamically. If newly introduced feature proves to be problematic, turning it off is an easy way to
restore system health. Archaius is a set of config mgmt APIs based on Apache Common Config lib. This allows configuration changes to be propagated in a matter of
minutes; at runtime without requiring app downtime. Configuration properties are multi-dimensional and context aware so their scope can be applied to a specific
context e.g. env = Test/Staging/Production or region=us-east/us-west/eu-west etc.
Full Rollback
In the event that a newly deployed version of the software proves to be problematic, the system can be rolled back to the previous version. The old cluster is kept alive
for a few hours so the automation knows what to roll back to. Because of our extensive use of autoscaling, provisioning the clusters accurately is tricky; and having to do
it manually across three regions would make rollbacks slow and leave them to prone to error. Even though rollbacks are rare, the cost of getting it wrong is too high.
Old Code New Code
Re-enable
traffic
Production Traffic
Old Code New Code
Production Traffic
More, Better, Faster & to Prod
Shorter Feedback Loop
Increased Confidence
Richer Insight & Communication
Dashboard shows the status of current and upcoming deployments, builds and associated artifacts. Diff reports for source and libraries help identify
contents of the build.
Different views of the data are available. This is a build that passed through all the stages successfully; including the canary.
Committers and On-Calls are notified when a build is scheduled for deployment so they can be available if needed.
API Server Deployment Velocity
#Deployments
#Rollbacks
Deployments & Rollbacks/week
since Oct 2012
The current state is that we deploy ~3 times/week on average. Additionally, deployments can be triggered on demand.
Take-aways
Build Agility into Architecture
Embrace Change; Don’t Fight it!
Failure is inevitable
Insight is key
• Examples of building agility - cloud-native, loosely coupled microservices, distributed data stores, dynamic configuration using Archaius provides
ability to effect changes in the behavior of our deployed services dynamically
• Embrace change - dependencies will change; everyone is evolving and moving fast. Best to get ahead of it rather than try and fight it
• Failure is inevitable; re-assess balance of investment between preventing failure and rapid recovery + impact mitigation
• Insight - not just operational(that’s a given!), but engineering too. It becomes key when responsibility is shared.
What’s Next
“Tiered” Canary Analysis
Failure Injection Testing
Throughput Trending
http://goo.gl/zjiV6W
Culture
http://goo.gl/7l7xkM
Good architectural practices, automation & tooling and deep insight into our systems allow us to operate resilient systems and go fast at scale. But the key piece that
brings it all together and completes the picture is our culture.
Freedom and Responsibility
Culture is based on the principles of Freedom and Responsibility.
Employees have the freedom to make decisions and act on them as it pertains to their daily activities. The counterbalance is the responsibility they assume for the
implications of their actions. Management’s job is to set the appropriate context so employees have all the information they need to make the right decisions and
judgement calls. This fosters a blameless culture where people feel empowered to take risks.
http://netflix.github.io/
http://techblog.netflix.com
Visit our github site and techblog for information and details about interesting topics related to distributed systems,
How we learned to stop worrying and start deploying
The Netflix API service
Sangeeta Narayanan
@sangeetan
http://www.linkedin.com/in/sangeetanarayanan
http://bit.ly/1wq2kkN
Thank You!
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/deploy-
netflix-api-service

More Related Content

More from C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Recently uploaded

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 

Recently uploaded (20)

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 

How We Learned to Stop Worrying and Start Deploying the Netflix API Service

  • 1. How we learned to stop worrying and start deploying The Netflix API service Sangeeta Narayanan @sangeetan http://www.linkedin.com/in/sangeetanarayanan http://bit.ly/1wq2kkN
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /deploy-netflix-api-service
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4. Netflix started out as a DVD rental by mail service in the US.
  • 5. Introduced on-demand video streaming over the internet in 2007
  • 6. Global Streaming for Movies and TV Shows Started expanding the streaming service into international markets a few years after launching in the US
  • 7. High Quality Original Content Late 2011/2012 marked a major new strategic focus with foray into the world of original programming
  • 8. Shows like HoC & Orange have been received with high acclaim; as evidenced by recent Emmy wins. Strategy is to expand internationally and pursue high quality content to drive engagement and acquisition.
  • 9. Over 50 Million Subscribers Over 40 Countries Global expansion, high quality originals and personalized content have fueled rapid subscriber growth.
  • 10. > 34% of Peak Downstream Traffic in North America Over 2 billion streaming hours a month Netflix now accounts for over 1/3rd of downstream internet traffic in NA at peak. This number has been in the news a lot lately!
  • 11. Our members can choose to enjoy our service on over 1000 device types.
  • 12. Personalized User Experience Edge Engineering operates the services that are the entry point to the personalized discovery and streaming experience for our members.
  • 13. This is an extremely high level view of how the Netflix Discovery experience is rendered. API is the internet facing service that all devices connect to to provide the user experience. The API in turn consumes data from several middle-tier services, applies business logic on top of it as needed and provides an abstraction layer for devices to interact with. The API in effect, acts as a broker of metadata between services and devices. Put another way, almost all product functionality flows through the API.
  • 14. http://goo.gl/VhokZV Role of API Enable rapid innovation Conduit for metadata between Devices and Services Implements business logic Scale with business growth Maintain resiliency
  • 15. Going back in time… http://bit.ly/1yOWEjr Looking at the motivations behind our move towards CD
  • 16. PM: When can I get my feature?
  • 17. Us: 2 -4 weeks PM: When can I get my feature?
  • 18. Us: 2 -4 weeks - ish… PM: When can I get my feature?
  • 19. Us: 2 -4 weeks - ish… PM: When can I get my feature? IF all goes well… We were lacking confidence in our delivery process
  • 20. 2 week release cycle
  • 22. API was becoming a bottleneck where functionality would get delayed.
  • 23. ! Stop being the bottleneck! http://bit.ly/1zmYbAy We had a simple goal.
  • 25. Heavy weight Code Management 3 long lived branches with code in varying states of release readiness. Lots of manual tracking, merging and co-ordination.
  • 27. Constantly Changing Dependencies Dependency management was hard and contributed to slow, unpredictable builds.
  • 28. Slow, unreliable tests Low coverage Manual on-device testing Lots of manual testing - on device too!
  • 30. Push Lead! Life of push on-call was not fun.
  • 31. Requirements for new system On-Demand, Rapid Feature Delivery Intuitive and painless Easy recovery from errors Insight and Communication Balance between Agility & Stability
  • 32. R elease R elease R elease Patch PatchPatch 2 week Releases + Ad-Hoc Patches http://bit.ly/1E6a9yn
  • 33. M R M R IR IR IR IR 3 week Major Releases + Weekly Incremental Releases Major releases (MR) every three weeks - dates shared outside the team Weekly Incremental releases (IR) in between; two IRs per MR cycle
  • 34. Automate SCM Tasks Eliminated Code Freeze. Engineers were responsible for managing their commits. Automated code merge tasks
  • 35. Automated Dependency Validation Dependency Management was creating a lot of churn in our cycle. We built a separate pipeline that resolved the dependency tree, validated it by running a series of tests and then committed the resolved graph to source. All development is based off that known good set of dependencies until the next run of that pipeline.
  • 36. Test Strategy Increasing confidence Worked out a test strategy so effort could be applied at the appropriate level of testing. The idea was to build a series of tests that acted as gates and as code made its way up the pyramid, our confidence in it would increase.
  • 37. Test Runtimes 60% No False Positives Eliminating non-determinism and shortening runtime is a fundamental requirement. The point to note is that this is an ongoing process; you need to stay on top of this!
  • 38. Improved Result Reporting In keeping with the goal of making the system simple and intuitive, we added detailed insights into test results so anyone could quickly root cause failures and act on them.
  • 39. Automated Deployments Internal Environments Using Asgard API Connected to builds Driven from CI Server By now, we were operating multiple internal environments and the company was getting ready to bring a new AWS region online. We automated deployments to all those environments.
  • 40. Pipelines And now, we had ourselves a pipeline! In fact, we had 3 - one for each long lived branch.
  • 41. • Multiple deployments/day ! • Multiple internal environments ! • Multiple AWS regions http://bit.ly/13qrIfw A big milestone for the team.
  • 42. Team Cohesion ! • Shared ownership - no silos • Increased partner satisfaction • Greater productivity Equally, if not more important was the change in the team dynamic. There was increased cohesion as people got comfortable with the self-service model and the idea of sharing ownership.
  • 44. Faster, Better, All the way! Shorter Feedback Loop Increased Confidence Richer Insight & Communication
  • 45. Build Bake Test Deploy Build BakeTest Deploy Increase velocity: Developer workflow NEtflix BUild LAnguage plugin for Gradle that provides specific functionality for the Netflix environment
  • 46. ! Branching Strategy Modeled after github-flow ! Automated Pull Request Processing ! Automated Patch Branching
  • 47. Single long-lived branch Always deploy-able Feature branches
  • 48. More, Better, Faster & to Prod Shorter Feedback Loop Increased Confidence Richer Insight & Communication
  • 49. Automated Canary Analysis Aggregate Health Score >1500 metrics Configurable Multiple regions Old$Code$(Baseline)$ New$Code$(Canary)$ ~1%$Traffic$ Automated Canary Analysis is the arguably the most important tool in our toolkit. We started out small, comparing simple metrics. Then expanded it to make it a system that generates a health score based on comparisons across 1000s of metrics.
  • 50. Canary reports are generated at periodic intervals and emailed to the team. They are also available off the dashboard. The report shows an overall confidence score of the readiness of that build. This one didn’t do very well.
  • 51. Details of the problematic metrics that contributed to the poor canary score.
  • 54. Not intended for deployment Not deployable; failed tests Deployed
  • 55. Hands Free Production Deployments http://bit.ly/1wQ8fPQ
  • 58. Old Code New Code Production Traffic
  • 59. Old Code New Code Production Traffic
  • 60.
  • 61. We can see an outage in real time - the no. of 5XX errors & latency spiked during the incident. This data is being streamed by hundreds of servers, aggregated using Turbine and streamed to the dashboard.
  • 62. Feature Rollback Dynamic configuration using Archaius allows features to be toggled dynamically. If newly introduced feature proves to be problematic, turning it off is an easy way to restore system health. Archaius is a set of config mgmt APIs based on Apache Common Config lib. This allows configuration changes to be propagated in a matter of minutes; at runtime without requiring app downtime. Configuration properties are multi-dimensional and context aware so their scope can be applied to a specific context e.g. env = Test/Staging/Production or region=us-east/us-west/eu-west etc.
  • 63. Full Rollback In the event that a newly deployed version of the software proves to be problematic, the system can be rolled back to the previous version. The old cluster is kept alive for a few hours so the automation knows what to roll back to. Because of our extensive use of autoscaling, provisioning the clusters accurately is tricky; and having to do it manually across three regions would make rollbacks slow and leave them to prone to error. Even though rollbacks are rare, the cost of getting it wrong is too high.
  • 64. Old Code New Code Re-enable traffic Production Traffic
  • 65. Old Code New Code Production Traffic
  • 66. More, Better, Faster & to Prod Shorter Feedback Loop Increased Confidence Richer Insight & Communication
  • 67. Dashboard shows the status of current and upcoming deployments, builds and associated artifacts. Diff reports for source and libraries help identify contents of the build.
  • 68. Different views of the data are available. This is a build that passed through all the stages successfully; including the canary.
  • 69. Committers and On-Calls are notified when a build is scheduled for deployment so they can be available if needed.
  • 70.
  • 71. API Server Deployment Velocity #Deployments #Rollbacks Deployments & Rollbacks/week since Oct 2012 The current state is that we deploy ~3 times/week on average. Additionally, deployments can be triggered on demand.
  • 72. Take-aways Build Agility into Architecture Embrace Change; Don’t Fight it! Failure is inevitable Insight is key • Examples of building agility - cloud-native, loosely coupled microservices, distributed data stores, dynamic configuration using Archaius provides ability to effect changes in the behavior of our deployed services dynamically • Embrace change - dependencies will change; everyone is evolving and moving fast. Best to get ahead of it rather than try and fight it • Failure is inevitable; re-assess balance of investment between preventing failure and rapid recovery + impact mitigation • Insight - not just operational(that’s a given!), but engineering too. It becomes key when responsibility is shared.
  • 73. What’s Next “Tiered” Canary Analysis Failure Injection Testing Throughput Trending http://goo.gl/zjiV6W
  • 74. Culture http://goo.gl/7l7xkM Good architectural practices, automation & tooling and deep insight into our systems allow us to operate resilient systems and go fast at scale. But the key piece that brings it all together and completes the picture is our culture.
  • 75. Freedom and Responsibility Culture is based on the principles of Freedom and Responsibility.
  • 76. Employees have the freedom to make decisions and act on them as it pertains to their daily activities. The counterbalance is the responsibility they assume for the implications of their actions. Management’s job is to set the appropriate context so employees have all the information they need to make the right decisions and judgement calls. This fosters a blameless culture where people feel empowered to take risks.
  • 77. http://netflix.github.io/ http://techblog.netflix.com Visit our github site and techblog for information and details about interesting topics related to distributed systems,
  • 78. How we learned to stop worrying and start deploying The Netflix API service Sangeeta Narayanan @sangeetan http://www.linkedin.com/in/sangeetanarayanan http://bit.ly/1wq2kkN Thank You!
  • 79. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/deploy- netflix-api-service