Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1O6hOj9.
Roy Rapoport shares some of the lessons Netflix learned building a monitoring system, the challenges, pitfalls and opportunities encountered along the way. Filmed at qconlondon.com.
Roy Rapoport manages the Insight Engineering group at Netflix, responsible for building Netflix's Operational Insight platforms, including cloud telemetry, alerting, and real-time analytics". He originally joined Netflix as part of its datacenter-based IT/Ops group, and prior to transferring over to Product Engineering, was managing Service Delivery for IT/Ops.
Netflix Built Its Own Monitoring System - and Why You Probably Shouldn't
1. Netflix Built Its Own
Monitoring System
(And You Probably Shouldn’t)
Roy Rapoport
rsr@netflix.com @royrapoport
6 March 2015
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/netflix-monitoring-system
3. Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
4. Not So Much About Telemetry
• I telemetry
• Architecture track Open Space,
11:30AM, Fleming 3rd Floor
6. Agenda
• Introductions
• On Judgment
• Your Problem
• Your (no, really) Solution
• Mitigation and Anecdotes
• (Not) building your own monitoring
system
7. Introductions: Me
• About 23 years in technology
• Systems engineering, networking, software
development, QA, release management
• Time at Netflix: 2076 days (5y:8m:7d)
• At Netflix:
• Systems Engineering, Service Delivery in IT
• Troubleshooter and Builder of Python Things
in Product Engineering
• Now: Engineering Manager, Insight Engineering
8. Introductions: Netflix
• Optimize speed of innovation
• Constrain availability
• Cost is what it is
• Hire smart people,
get out of their way
• Anti-process bias
“Freedom and Responsibility”
10. You Have a Problem
(Your job would likely be boring otherwise)
• Are you the first
• To have it?
• To care?
• Are you sure?
One that looks nice
And not too expensive
11. You Have a Problem
(Your job would likely be boring otherwise)
• You’re not the first, or only
• Good news!
• Then what?
12. Adventures in IT-Land
• (import disclaimer)
• Not developers
• Cautious about ongoing support
load
• Not well-trusted
21. A Question of Trust
• Technical: I don’t trust your product
• Organizational: I don’t trust you
22. I Don’t Trust You
To Care About Me as a Customer
• You’re selling me something
• I’m not your only customer
• I’m not an important customer
• You don’t care about your
customers
23. I Don’t Trust You
To build a good product
• Past performance …
• “Good for me”
• Because you said so, that’s why!
24. I Don’t Trust You
To build it fast enough
• Unpredictable velocity
• When best-case is too slow
• Or maybe ever (OSS)
40. Composability: Example
Deployments and Automated Canary Analysis at Netflix
Edge Systems
Deployment
Automation Platform
Edge Systems
Canary Analysis
API
API
Mainline
Deployment
Automation Platform
41. Composability: Example
Deployments and Automated Canary Analysis at Netflix
Edge Systems
Deployment
Automation Platform
Edge Systems
Canary Analysis
API
Email
Insight Engineering
Canary Analysis
Mainline
Deployment
Automation Platform
42. Composability: Example
Deployments and Automated Canary Analysis at Netflix
Edge Systems
Deployment
Automation Platform
Edge Systems
Canary Analysis
API
Insight Engineering
Canary Analysis
Mainline
Deployment
Automation Platform
43. Composability: Example
Deployments and Automated Canary Analysis at Netflix
Edge Systems
Deployment
Automation Platform
Edge Systems
Canary Analysis
Insight Engineering
Canary Analysis
Mainline
Deployment
Automation Platform
44. Composability: Example
Deployments and Automated Canary Analysis at Netflix
Edge Systems
Deployment
Automation Platform
Insight Engineering
Canary Analysis
Mainline
Deployment
Automation Platform
45. Composability: Example
Deployments and Automated Canary Analysis at Netflix
Edge Systems
Deployment
Automation Platform
Insight Engineering
Canary Analysis
Mainline
Deployment
Automation Platform
51. The Grand Example
Netflix’s Monitoring Platform
• Prior system owned by IT
• No great OSS products
• Ridiculous scale
52. The Grand Example
Netflix’s Monitoring Platform
• Prior system owned by IT
• No great OSS products
• Ridiculous scale
• Seriously, how hard can it be?
53. The Grand Example
Netflix’s Monitoring Platform
• Took longer than expected
• Ongoing maintenance
• UI only recent priority
54. The Grand Example
Netflix’s Monitoring Platform
• Scales efficientlyish
• impedance match with dev lifestyle
• Nicely pluggable*
• Aggressivish OSS efforts
* Ask me about Real-Time Analytics!
55. The Grand Example
Netflix’s Monitoring Platform
• Still the right solution
• Worried about Sunk Cost Fallacy
• Most shouldn’t do this
56. Can You Repeat That?
Or: What’s Your Point?
Or: I was Tweeting. Did I miss something?
• What’s important to you?
• Is this a technical decision? Really?
• Honest and non-judgmental
• Any mitigation?
• Don’t build your own monitoring
system. Seriously.
57. Name This Group
• United States
• Europe
• China
• Russia
• India
• Japan
• Blue Origin
• SpaceX
• Virgin Galactic