SlideShare a Scribd company logo
1 of 32
Operational
Automation
Helping Netflix developers
sleep at night!
Jean-Sebastien Jeannotte / Sayli Karmarkar
Jean-Sebastien Jeannotte – JS
Senior Software Engineer
Platform Automation Engineering
jjeannotte@netflix.com
@jsjeannotte
http://www.linkedin.com/in/jsjeannotte
Sayli Karmarkar
Senior Software Engineer
Platform Automation Engineering
skarmarkar@netflix.com
@HikerTechy
https://www.linkedin.com/in/saylikarmarkar
Speakers
AWS
Bootre:
September 2014, Every AZ
C*
Priam
C*
Priam
C*
Priam
AtlasAtlas
Our Stack in 2014
Atlas Dashboard
Healthcheck Script
Every
30 min
Disappearing
instance?
Launch new
instance
Is the C* ring
healthy?
Are all instances
healthy?
Can we fix
automatically?
Replace bad
instance
First failure?
Sleep for X
minutes and
retry
First failure?
Is there an
offline
maintenance?
AWS
Bootre:
September 2014, Every AZ
How Did The Healthcheck Script Handled It
Every
30 min
Disappearing
instance?
Launch new
instance
Is the C* ring
healthy?
Are all instances
healthy?
Can we fix
automatically?
Replace bad
instance
First failure?
Sleep for X
minutes and
retry
First failure?
Is there an
offline
maintenance?
Let’s Take a Step Back
Engineer
Wakes up
Logs in
and ACK
Checks
runbook
Studies
the alert
Fixes the
problem
Runs
diagnostics
PagerDuty
Alert
2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM
On-call, Without Automation
Non-Automated On-Call Pain Points
MTTR
Productivity
New Direction
Failure / Alert Automation
Automation using Building Blocks
Integrations with Netflix Ecosystem
Platform as a Service
Event-driven Automation Platform
How Are Others Approaching This Problem?
Evaluation
Winston
+
Inbound Integrations
+
Outbound Integrations
...
As a Service
SQS queue
Atlas SQS Sensor
Poll
RabbitMQ
Atlas Alert Trigger
Stackstorm Action Runners
Action A Action B Workflow C
Rules EngineRule
Definitions
MongoDB
Replica Set
Winston Deployment
Atlas Telemetry
Platform
...
Cassandra Monitoring with Winston
C*
Priam
C*
Priam
C*
Priam
AtlasAtlas
Engineer
Wakes up
Logs in
and ACK
Checks
runbook
Studies
the alert
Fixes the
problem
Runs
diagnostics
PagerDuty
Alert
2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM
On-call, Without Automation
False
Positive
Winston
2:00 AM
2:05 AM
2:05 AM
2:15 AM
Assisted
Diagnostics
Fixed the
problem
On-call With Winston
Runbook patterns
False Positive
Assisted Diagnostics
Auto Remediation
● Product
○ Reduced MTTR (Mean Time To Recover)
○ Safety - Reduce risk of human errors
○ Capture operational knowledge as code
● People
○ Reduced pager fatigue for developers
○ Increase in productivity
○ Morale
Impact
Stackstorm Docs - http://docs.stackstorm.com/
Stackstorm Slack Channel - https://stackstorm-community.slack.com/
Netflix OpenSource: https://netflix.github.io/
Check out our https://jobs.netflix.com page for current openings
… no more questions
Netflix Winston meetup presentation 2015-11-18

More Related Content

Similar to Netflix Winston meetup presentation 2015-11-18

Cassandra Summit 2015 - A State of Xen - Chaos Monkey & Cassandra
Cassandra Summit 2015 - A State of Xen - Chaos Monkey & CassandraCassandra Summit 2015 - A State of Xen - Chaos Monkey & Cassandra
Cassandra Summit 2015 - A State of Xen - Chaos Monkey & CassandraJean-Sébastien Jeannotte
 
Netflix: A State of Xen - Chaos Monkey & Cassandra
Netflix: A State of Xen - Chaos Monkey & CassandraNetflix: A State of Xen - Chaos Monkey & Cassandra
Netflix: A State of Xen - Chaos Monkey & CassandraDataStax Academy
 
Generative AI For Everyone on AWS.pdf
Generative AI For Everyone on AWS.pdfGenerative AI For Everyone on AWS.pdf
Generative AI For Everyone on AWS.pdfManjunatha Sai
 
Rapid prototypingembeddedsystemsbypython
Rapid prototypingembeddedsystemsbypythonRapid prototypingembeddedsystemsbypython
Rapid prototypingembeddedsystemsbypythonAlbert Huang
 
Chaos Engineering Talk at DevOps Days Austin
Chaos Engineering Talk at DevOps Days AustinChaos Engineering Talk at DevOps Days Austin
Chaos Engineering Talk at DevOps Days Austinmatthewbrahms
 
Deep Learning con TensorFlow and Apache MXNet su Amazon SageMaker
Deep Learning con TensorFlow and Apache MXNet su Amazon SageMakerDeep Learning con TensorFlow and Apache MXNet su Amazon SageMaker
Deep Learning con TensorFlow and Apache MXNet su Amazon SageMakerAmazon Web Services
 
Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...
Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...
Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...Rosenfeld Media
 
Unikernels and another way of secure cloud computing
Unikernels and another way of secure cloud computingUnikernels and another way of secure cloud computing
Unikernels and another way of secure cloud computingMotiejus Jakštys
 
Deep Learning with TensorFlow and Apache MXNet on Amazon SageMaker (March 2019)
Deep Learning with TensorFlow and Apache MXNet on Amazon SageMaker (March 2019)Deep Learning with TensorFlow and Apache MXNet on Amazon SageMaker (March 2019)
Deep Learning with TensorFlow and Apache MXNet on Amazon SageMaker (March 2019)Julien SIMON
 
Making AI real - From the lab to the real World
Making AI real - From the lab to the real WorldMaking AI real - From the lab to the real World
Making AI real - From the lab to the real WorldSAS Italy
 
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)Julien SIMON
 
An introduction to nuxt.js
An introduction to nuxt.jsAn introduction to nuxt.js
An introduction to nuxt.jsHunter Jansen
 
SpringOne Platform recap 정윤진
SpringOne Platform recap 정윤진SpringOne Platform recap 정윤진
SpringOne Platform recap 정윤진VMware Tanzu Korea
 
Why Your Start Up Needs An Automated Infrastructure Presentation
Why Your Start Up Needs An Automated Infrastructure PresentationWhy Your Start Up Needs An Automated Infrastructure Presentation
Why Your Start Up Needs An Automated Infrastructure Presentationelliando dias
 
Rise of the Machines - Automate your Development
Rise of the Machines - Automate your DevelopmentRise of the Machines - Automate your Development
Rise of the Machines - Automate your DevelopmentSven Peters
 
Why Startups Need Automated Infrastructures
Why Startups Need Automated InfrastructuresWhy Startups Need Automated Infrastructures
Why Startups Need Automated InfrastructuresAdam Jacob
 
Building An Automated Infrastructure
Building An Automated InfrastructureBuilding An Automated Infrastructure
Building An Automated Infrastructureelliando dias
 
Metasepi team meeting #16: Safety on ATS language + MCU
Metasepi team meeting #16: Safety on ATS language + MCUMetasepi team meeting #16: Safety on ATS language + MCU
Metasepi team meeting #16: Safety on ATS language + MCUKiwamu Okabe
 
Uu 2019-05-08 - machine learning -alternative
Uu   2019-05-08 - machine learning -alternativeUu   2019-05-08 - machine learning -alternative
Uu 2019-05-08 - machine learning -alternativeAdarshMamidpelliwar1
 

Similar to Netflix Winston meetup presentation 2015-11-18 (20)

Cassandra Summit 2015 - A State of Xen - Chaos Monkey & Cassandra
Cassandra Summit 2015 - A State of Xen - Chaos Monkey & CassandraCassandra Summit 2015 - A State of Xen - Chaos Monkey & Cassandra
Cassandra Summit 2015 - A State of Xen - Chaos Monkey & Cassandra
 
Netflix: A State of Xen - Chaos Monkey & Cassandra
Netflix: A State of Xen - Chaos Monkey & CassandraNetflix: A State of Xen - Chaos Monkey & Cassandra
Netflix: A State of Xen - Chaos Monkey & Cassandra
 
Generative AI For Everyone on AWS.pdf
Generative AI For Everyone on AWS.pdfGenerative AI For Everyone on AWS.pdf
Generative AI For Everyone on AWS.pdf
 
Rapid prototypingembeddedsystemsbypython
Rapid prototypingembeddedsystemsbypythonRapid prototypingembeddedsystemsbypython
Rapid prototypingembeddedsystemsbypython
 
Chaos Engineering Talk at DevOps Days Austin
Chaos Engineering Talk at DevOps Days AustinChaos Engineering Talk at DevOps Days Austin
Chaos Engineering Talk at DevOps Days Austin
 
Deep Learning con TensorFlow and Apache MXNet su Amazon SageMaker
Deep Learning con TensorFlow and Apache MXNet su Amazon SageMakerDeep Learning con TensorFlow and Apache MXNet su Amazon SageMaker
Deep Learning con TensorFlow and Apache MXNet su Amazon SageMaker
 
Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...
Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...
Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...
 
Unikernels and another way of secure cloud computing
Unikernels and another way of secure cloud computingUnikernels and another way of secure cloud computing
Unikernels and another way of secure cloud computing
 
Deep Learning with TensorFlow and Apache MXNet on Amazon SageMaker (March 2019)
Deep Learning with TensorFlow and Apache MXNet on Amazon SageMaker (March 2019)Deep Learning with TensorFlow and Apache MXNet on Amazon SageMaker (March 2019)
Deep Learning with TensorFlow and Apache MXNet on Amazon SageMaker (March 2019)
 
Making AI real - From the lab to the real World
Making AI real - From the lab to the real WorldMaking AI real - From the lab to the real World
Making AI real - From the lab to the real World
 
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
 
Machine Learning at the Edge
Machine Learning at the EdgeMachine Learning at the Edge
Machine Learning at the Edge
 
An introduction to nuxt.js
An introduction to nuxt.jsAn introduction to nuxt.js
An introduction to nuxt.js
 
SpringOne Platform recap 정윤진
SpringOne Platform recap 정윤진SpringOne Platform recap 정윤진
SpringOne Platform recap 정윤진
 
Why Your Start Up Needs An Automated Infrastructure Presentation
Why Your Start Up Needs An Automated Infrastructure PresentationWhy Your Start Up Needs An Automated Infrastructure Presentation
Why Your Start Up Needs An Automated Infrastructure Presentation
 
Rise of the Machines - Automate your Development
Rise of the Machines - Automate your DevelopmentRise of the Machines - Automate your Development
Rise of the Machines - Automate your Development
 
Why Startups Need Automated Infrastructures
Why Startups Need Automated InfrastructuresWhy Startups Need Automated Infrastructures
Why Startups Need Automated Infrastructures
 
Building An Automated Infrastructure
Building An Automated InfrastructureBuilding An Automated Infrastructure
Building An Automated Infrastructure
 
Metasepi team meeting #16: Safety on ATS language + MCU
Metasepi team meeting #16: Safety on ATS language + MCUMetasepi team meeting #16: Safety on ATS language + MCU
Metasepi team meeting #16: Safety on ATS language + MCU
 
Uu 2019-05-08 - machine learning -alternative
Uu   2019-05-08 - machine learning -alternativeUu   2019-05-08 - machine learning -alternative
Uu 2019-05-08 - machine learning -alternative
 

Recently uploaded

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 

Recently uploaded (20)

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 

Netflix Winston meetup presentation 2015-11-18

Editor's Notes

  1. We focus on providing a common automation platform for Netflix Teams.
  2. Who runs a service on AWS? Amazon EC2 is hosted in multiple locations world-wide. These locations are composed of regions and Availability Zones. Each region is a separate geographic area. Each region has multiple, isolated locations known as Availability Zones. Why Re:Boot? Xen security issue. Reboot a lot of instances in all the Availability Zones. Why is it a big deal? For stateless services, it’s not. But for Stateful services it is. C* for example. Missing the 50M Party in L.A.
  3. Denial: That can’t be true … Anger: Yep, it’s confirmed Bargaining: Tried convincing AWS to delay Depression: They said no. Risk is too high. Acceptance: What now … Actually it’s easy to accept, because of the Simian Army. -
  4. Anyone heard about the simian army? The Simian Army is a suite of tools for keeping your cloud operating in top form. Janitor Monkey, Security Monkey, Coffee Monkey Chaos Monkey, the first member, is a resiliency tool that helps ensure that your applications can tolerate random instance failures Netflix EMBRACE chaos. We love it so much that we generate it. In PROD. We run it on most of Netflix services, and even on C*
  5. CDE has Chaos Monkey enabled on our C* clusters Maximum 1 node per day, during business hours Cassandra Team Health Check system detects the missing instance and replaces it Going back to our stages of grief, this made acceptance easier. We test for this Our automation can take it.
  6. What our stack looked like at the time? Bunch of Python/Shell scripts Jenkins as job scheduler (HC, node-replacements, repairs, upgrades and etc) On C* nodes: C* + Priam tAtlas is already a very powerful metrics and alerting tool, and our metric systems add non-C* related metrics (App metrics for example) that help in correlation.
  7. - Atlas is a very powerful time series metrics and alerting tool - Atlas is Open Source
  8. Simplified view of Healthcheck flow Assisted Diagnostics Auto-Remediation Auto-Remediation supported: Disappearing instances Replace instance with bad I/O
  9. How did the healthcheck behave during Re:boot
  10. 2 behaviour: Instance rebooting: False positive (transient issue) Instance rebooting, but failing AWS healthcheck and being terminated: Auto-remediation
  11. 218 C* nodes rebooted 22 nodes didn’t start and were automatically terminated by AWS internal healthcheck Our heathcheck identified the missing nodes and automatically remediated the issue 0 downtime L.A. Party was awesome
  12. Take the learnings from CDE, abstract it and see how to apply to other teams Increase in scope - How can we maximize impact? So our main focus was to apply the learnings: False Positive, Assisted Diagnostics, Auto-Remediation Help on-call engineers sleep at night (improve on-call automation) Why is it such a big deal? First, you need to understand the DevOps Model at Netflix Everyone is on-call at Netflix, every team manage their own service This means a lot of on-call engineer doing on-call operations. So what does it look like to be On-Call when there is no or limited automation?
  13. On-call before winston. Long MTTR (Mean Time to Recover).
  14. Operational knowledge in document - hard to maintain Risk of human errors Pager fatigue - Morale High MTTR (Mean Time To Recover) Impact on productivity
  15. JS -- Hand it over to Sayli who will cover how the new system help alleviate these pain points … Sayli -- To reduce the pain points that we were facing, we started thinking of new approach. Quality of a good engineer - learn not only from failures but from success as well like our AWS reboot success story. We survived during reboot using a system which automatically diagnosed and fixed a known failure scenario. The problem was that it wasn’t designed to extend for other failure scenarios or could be used by other teams. Proactive automation. Idea of reactive automatic troubleshooting and remediation will be highly useful especially in operations. With this expanded charter for our team, we focused on what will be the key features for a system that will solve these problems for us and the answer was event-driven automation.
  16. 2. Instead of autonomous systems, ability to share building blocks within a single service or even to multiple services 3. eg. sophisticates telemetry system Atlas and CI platform (spinnaker), jenkins 4. Last but not least, Service owners can focus on the automation and not platform - Make it self-serve
  17. Problem space not unique to Netflix We started working on initial design of our own in-house , internal POC Looked at Facebook (FBAR) / LinkedIn (Nurse) / DropBox (Naoru). This helped us see how they approached the problem Also came across this meetup group .. 400 auto remediators
  18. Now that we knew WHAT are requirements, we worked on figuring out HOW. Evaluated building platform from scratch, adopting an existing solution or mix and match - using some existing components and building some. After POC, stackstorm. Stackstorm platform for integration and automation across services and tools The usecases that stackstorm was targeting -- facilitated troubleshooting and auto remediation fitted right into what we are looking for.open source. Quality of the code. Great to collaborate and code with. Great discussions with respect to their usecases, approach and adoption challenges. Helped us validate benefits. Do our own or adopt existing solution? We started with our own POC, then we decided to go with Stackstorm-  event-driven automation platform Facilitated Troubleshooting/Event handling Automated remediation
  19. Stackstorm gave us and event driven automation platform and building blocks ..what about integration with netflix ecosystem?
  20. Pulp fiction fans in the audience?
  21. On-call before winston. Long MTTR (Mean Time to Recover).
  22. And now with Winston. Winston gets the Alert. Using its rule engine decide what the right action is. Action then analyse the issue and if it’s identified as a False Positive, no need to Page the on-call. Another use case is that Winston will identify that it can fix the issue. When it does, again, no need to Page the on-call. Last use case, the one we want you to focus on is Assisted Diagnostics. While the on-call is being Paged, Winston runs a series of pre-defined diagnostics and prepare a report for the On-call so that when he logs in the system, he has comprehensive information like the Discovery status, list of recent exceptions or error, or any other relevant context to help him make a decision faster.
  23. Let’s look at some of the real-life scenarios Anybody who doesn’t know what a runbook is? a 'runbook' is a routine compilation of procedures and steps that a sys-admin or a person on-call goes through to diagnose and remediate a failure. Generally runbooks have 3 broadly classified steps -- Real-life scenarios Remove False Positive - expected scenario, can safely be ignored Diagnostics - collect troubleshooting information Remediation - fix the problem Now let’s see some examples of how winston can assist in these steps ...
  24. First example is for False Positive: Data Pipeline Team, Broker Offline. But instance was terminated by AWS, so it’s expected that the broker is offline. Issue resolved. No need to Page on-call for that.
  25. Another Assisted Diagnostics example for Cassandra: Disk Space issue Gives context around the size of the actual C* data Checks if there is any Repair or Compaction running which temporarily increases disk usage Try some auto-remediation: Clean-up old snapshots Still above disk usage threshold, Paging On-call In this case, on-call doesn’t have to try to cleanup snapshot since it was already done by Winston, and can now focus on other unknown root causes. Faster TTR.
  26. Last example: Auto-Remediation. For Data Pipeline team: Broker is offline again, like in the first example, but this time, the EC2 instance is still running. So it’s not a false positive. Check if there is any disk failure If not, tries to restart the Kafka broker. Succeeded. Broker is back online. Resolved. Not paging on-call.
  27. Add resources for stackstorm, slack channel and happy hour Stackstorm guys are here