Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Winston
Diagnostic and Remediation Engineering (DaRE)
Vinay Shah & Jean-Sebastien Jeannotte
● Introduction
● Internals - How it works?
● Demo - See it in action!
● Learnings and challenges
● Metrics & Road ahead
● ...
Introduction
Landscape
Operational load vs.
new features
Scale and Growth Availability
Application or
Service
Monitoring
Alerting
Pagerduty Email Winston
● Reduce MTTR
● Reduce risk of human errors
● Reduce pager fatigue, provide tier 1 support
● Don’t worry about infrastruct...
Winston is an event driven runbook
automation platform. It is designed to host
and execute runbooks in response to
operati...
Internals
Howisitdeployed?
Execution Flow
● One stop portal for all things Winston
● Supports Create, Read, Update, Delete, Execute and Diagnose functionality
● Imp...
● Pack
A group of related automations typically organized around a discreet
service or product
● Action
Set of steps to he...
Demo
Winston Studio
DEMO
● False positives
○ Cassandra ring health
● Diagnostics - correlation could point towards causation - e.g:
○ Querying Chro...
Learnings &
challenges
Common patterns
● Usage
○ Culture of automating the manual and repeatable
○ Noisy signals become more interesting
○ Lesser the control mor...
● Don’t reinvent the wheel
● Start simple and iterate
● Allow experimentation
● Pay special care to usability of your prod...
Metrics and
Road ahead
● Adoption. Adoption. Adoption.
● Usability
○ Polyglot support (Groovy based actions)
○ Deeper Integrations
● Safety
○ Res...
● Introducing Winston:
http://techblog.netflix.com/2016/08/introducing-winston-event-driven.html
● Stackstorm: https://doc...
Thank you.
Upcoming SlideShare
Loading in …5
×

Winston - Netflix's event driven auto remediation and diagnostics tool

1,746 views

Published on

This was a slide deck on Winston presented at a meetup on auto remediation and diagnostics: https://www.meetup.com/Auto-Remediation-and-Event-Driven-Automation/events/234628846/

Published in: Software
  • You can ask here for a help. They helped me a lot an i`m highly satisfied with quality of work done. I can promise you 100% un-plagiarized text and good experts there. Use with pleasure! ⇒ www.HelpWriting.net ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Winston - Netflix's event driven auto remediation and diagnostics tool

  1. 1. Winston Diagnostic and Remediation Engineering (DaRE) Vinay Shah & Jean-Sebastien Jeannotte
  2. 2. ● Introduction ● Internals - How it works? ● Demo - See it in action! ● Learnings and challenges ● Metrics & Road ahead ● Additional resources Topics
  3. 3. Introduction
  4. 4. Landscape Operational load vs. new features Scale and Growth Availability
  5. 5. Application or Service Monitoring Alerting Pagerduty Email Winston
  6. 6. ● Reduce MTTR ● Reduce risk of human errors ● Reduce pager fatigue, provide tier 1 support ● Don’t worry about infrastructure, focus on your business logic ● Best practice for runbook lifecycle management Business goals
  7. 7. Winston is an event driven runbook automation platform. It is designed to host and execute runbooks in response to operational events.
  8. 8. Internals
  9. 9. Howisitdeployed?
  10. 10. Execution Flow
  11. 11. ● One stop portal for all things Winston ● Supports Create, Read, Update, Delete, Execute and Diagnose functionality ● Implements best practises ○ Compliance/Auditing ○ Persistence ○ Security (Authentication/Authorization) ● Self serve & scalable Winston Studio
  12. 12. ● Pack A group of related automations typically organized around a discreet service or product ● Action Set of steps to help with diagnostics or remediations written as code ● Event & event source External services that are the source of events that trigger a runbook Terminology
  13. 13. Demo
  14. 14. Winston Studio DEMO
  15. 15. ● False positives ○ Cassandra ring health ● Diagnostics - correlation could point towards causation - e.g: ○ Querying Chronos events ○ Querying dependencies upstream and downstream for anomalous behaviour ● Remediation ○ Clean up disk space ○ Restart Kafka process Sample use cases
  16. 16. Learnings & challenges
  17. 17. Common patterns
  18. 18. ● Usage ○ Culture of automating the manual and repeatable ○ Noisy signals become more interesting ○ Lesser the control more the opportunity ● Product ○ Safety is crucial ○ Usability is important ○ Resiliency Insights
  19. 19. ● Don’t reinvent the wheel ● Start simple and iterate ● Allow experimentation ● Pay special care to usability of your product ● Push for changing the culture - usage will follow ● Talk to us/others who have gone through some of the pains and learnings Recommendations to get started
  20. 20. Metrics and Road ahead
  21. 21. ● Adoption. Adoption. Adoption. ● Usability ○ Polyglot support (Groovy based actions) ○ Deeper Integrations ● Safety ○ Resource isolation (Containers) ○ Rate limiting The road ahead
  22. 22. ● Introducing Winston: http://techblog.netflix.com/2016/08/introducing-winston-event-driven.html ● Stackstorm: https://docs.stackstorm.com/ ● Reach out: vshah@netflix.com or jjeannotte@netflix.com We are hiring Senior Software Engineer - https://jobs.netflix.com/jobs/860752 Links & resources
  23. 23. Thank you.

×