SlideShare a Scribd company logo
1 of 59
Netflix Built Its Own
Monitoring System
(And You Probably Shouldn’t)
Roy Rapoport
rsr@netflix.com @royrapoport
6 March 2015
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/netflix-monitoring-system
Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Not So Much About Telemetry
• I telemetry
• Architecture track Open Space,
11:30AM, Fleming 3rd Floor
The Knights
Who Say
NIH
Agenda
• Introductions
• On Judgment
• Your Problem
• Your (no, really) Solution
• Mitigation and Anecdotes
• (Not) building your own monitoring
system
Introductions: Me
• About 23 years in technology
• Systems engineering, networking, software
development, QA, release management
• Time at Netflix: 2076 days (5y:8m:7d)
• At Netflix:
• Systems Engineering, Service Delivery in IT
• Troubleshooter and Builder of Python Things
in Product Engineering
• Now: Engineering Manager, Insight Engineering
Introductions: Netflix
• Optimize speed of innovation
• Constrain availability
• Cost is what it is
• Hire smart people,

get out of their way
• Anti-process bias
“Freedom and Responsibility”
Judgment
You Have a Problem
(Your job would likely be boring otherwise)
• Are you the first
• To have it?
• To care?
• Are you sure?
One that looks nice
And not too expensive
You Have a Problem
(Your job would likely be boring otherwise)
• You’re not the first, or only
• Good news!
• Then what?
Adventures in IT-Land
• (import disclaimer)
• Not developers
• Cautious about ongoing support
load
• Not well-trusted
Adventures in IT-Land
A Little Bit of …
• Time, courage, knowledge, pride
• Cynicism, hubris, fear
Technical Reasons for Rejection
(Or: It’s Not You, It’s … Actually, It’s You)
• Financial Cost
• Technical incompatibility
Overqualified!
• https://www.flickr.com/photos/54945394@N00
A Moment for Pedantry
Or: Requirements for “Not Invented Here”
The Knights
Who Say
IbPWAU
A Question of Trust
• Technical: I don’t trust your product
• Organizational: I don’t trust you
I Don’t Trust You
To Care About Me as a Customer
• You’re selling me something
• I’m not your only customer
• I’m not an important customer
• You don’t care about your
customers
I Don’t Trust You
To build a good product
• Past performance …
• “Good for me”
• Because you said so, that’s why!
I Don’t Trust You
To build it fast enough
• Unpredictable velocity
• When best-case is too slow
• Or maybe ever (OSS)
What Now?
Eventual Consistency
• Fork n’ merge
• THE model for OSS
• Works better for incremental
changes
• Requires alignment of goals
Eventual Consistency
No Fork Required
• Start With a New Idea
• Eventually merge concepts
Eventual Consistency Example
Mainline
Cloud Orchestration
2011
Eventual Consistency Example
Mainline
Cloud Orchestration
2011 2013
Eventual Consistency Example
Mainline
Cloud Orchestration
2011 2013
Insight Engineering
CD Automation
Eventual Consistency Example
Mainline
Cloud Orchestration
2011 2013
Insight Engineering
CD Automation
2014
Mainline
CD Automation
Eventual Consistency Example
Mainline
Cloud Orchestration
2011 2013
Insight Engineering
CD Automation
2014
Mainline
CD Automation
2015
Eventual Consistency Example
Mainline
Cloud Orchestration
2011 2013 2014
Mainline
CD Automation
2015
Insight Engineering
CD Automation
Composability
• Want this anyway
• Map scope to options’ scopes
Composability: Example
Netflix’s Atlas Telemetry Platform
Global Query
Endpoint
Composability: Example
Netflix’s Atlas Telemetry Platform
Global Query
Endpoint
Regional Query
Endpoint
Regional Query
Endpoint
Regional Query
Endpoint
Regional Query
Endpoint
Regional
Boundary
Composability: Example
Netflix’s Atlas Telemetry Platform
Global Query
Endpoint
Regional Query
Endpoint
Regional Query
Endpoint
Regional Query
Endpoint
Regional Query
Endpoint
Memory
Epic
Cloudwatch
Composability: Example
Netflix’s Atlas Telemetry Platform
Global Query
Endpoint
Regional Query
Endpoint
Regional Query
Endpoint
Regional Query
Endpoint
Regional Query
Endpoint
Memory
Cloudwatch
Composability: Example
Netflix’s Atlas Telemetry Platform
Global Query
Endpoint
Regional Query
Endpoint
Regional Query
Endpoint
Regional Query
Endpoint
Regional Query
Endpoint
Memory
Cloudwatch
OpenTSDB
InfluxDB
Composability: Example
Deployments and Automated Canary Analysis at Netflix
Edge Systems
Deployment
Automation Platform
Edge Systems
Canary Analysis
API
API
Mainline
Deployment
Automation Platform
Composability: Example
Deployments and Automated Canary Analysis at Netflix
Edge Systems
Deployment
Automation Platform
Edge Systems
Canary Analysis
API
Email
Insight Engineering
Canary Analysis
Mainline
Deployment
Automation Platform
Composability: Example
Deployments and Automated Canary Analysis at Netflix
Edge Systems
Deployment
Automation Platform
Edge Systems
Canary Analysis
API
Insight Engineering
Canary Analysis
Mainline
Deployment
Automation Platform
Composability: Example
Deployments and Automated Canary Analysis at Netflix
Edge Systems
Deployment
Automation Platform
Edge Systems
Canary Analysis
Insight Engineering
Canary Analysis
Mainline
Deployment
Automation Platform
Composability: Example
Deployments and Automated Canary Analysis at Netflix
Edge Systems
Deployment
Automation Platform
Insight Engineering
Canary Analysis
Mainline
Deployment
Automation Platform
Composability: Example
Deployments and Automated Canary Analysis at Netflix
Edge Systems
Deployment
Automation Platform
Insight Engineering
Canary Analysis
Mainline
Deployment
Automation Platform
Composability: Example
Deployments and Automated Canary Analysis at Netflix
Insight Engineering
Canary Analysis
Mainline
Deployment
Automation Platform
One More Reason“Think of the glory.
Think of your
reputation. Think how
great it'll look on your
next resume.”
- Lois McMaster Bujold
Judgment
The Grand Example
Netflix’s Monitoring Platform
• Prior system owned by IT
The Grand Example
Netflix’s Monitoring Platform
• Prior system owned by IT
• No great OSS products
The Grand Example
Netflix’s Monitoring Platform
• Prior system owned by IT
• No great OSS products
• Ridiculous scale
The Grand Example
Netflix’s Monitoring Platform
• Prior system owned by IT
• No great OSS products
• Ridiculous scale
• Seriously, how hard can it be?
The Grand Example
Netflix’s Monitoring Platform
• Took longer than expected
• Ongoing maintenance
• UI only recent priority
The Grand Example
Netflix’s Monitoring Platform
• Scales efficientlyish
• impedance match with dev lifestyle
• Nicely pluggable*
• Aggressivish OSS efforts
* Ask me about Real-Time Analytics!
The Grand Example
Netflix’s Monitoring Platform
• Still the right solution
• Worried about Sunk Cost Fallacy
• Most shouldn’t do this
Can You Repeat That?
Or: What’s Your Point?
Or: I was Tweeting. Did I miss something?
• What’s important to you?
• Is this a technical decision? Really?
• Honest and non-judgmental
• Any mitigation?
• Don’t build your own monitoring
system. Seriously.
Name This Group
• United States
• Europe
• China
• Russia
• India
• Japan
• Blue Origin
• SpaceX
• Virgin Galactic
11:30am Frasier Room (3rd Floor)
@royrapoport
rsr@netflix.com
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/netflix-
monitoring-system

More Related Content

Viewers also liked

Spring Boot + Netflix Eureka
Spring Boot + Netflix EurekaSpring Boot + Netflix Eureka
Spring Boot + Netflix Eureka心 谷本
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud ArchitectureAdrian Cockcroft
 
AWS Lambda from the trenches
AWS Lambda from the trenchesAWS Lambda from the trenches
AWS Lambda from the trenchesYan Cui
 
Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...
Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...
Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...Andreas Grabner
 
Docker/DevOps Meetup: Metrics-Driven Continuous Performance and Scalabilty
Docker/DevOps Meetup: Metrics-Driven Continuous Performance and ScalabiltyDocker/DevOps Meetup: Metrics-Driven Continuous Performance and Scalabilty
Docker/DevOps Meetup: Metrics-Driven Continuous Performance and ScalabiltyAndreas Grabner
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
Monitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance AnalysisMonitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance AnalysisBrendan Gregg
 

Viewers also liked (9)

Spring Boot + Netflix Eureka
Spring Boot + Netflix EurekaSpring Boot + Netflix Eureka
Spring Boot + Netflix Eureka
 
Scalable Real-time analytics using Druid
Scalable Real-time analytics using DruidScalable Real-time analytics using Druid
Scalable Real-time analytics using Druid
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
 
AWS Lambda from the trenches
AWS Lambda from the trenchesAWS Lambda from the trenches
AWS Lambda from the trenches
 
Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...
Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...
Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...
 
Docker/DevOps Meetup: Metrics-Driven Continuous Performance and Scalabilty
Docker/DevOps Meetup: Metrics-Driven Continuous Performance and ScalabiltyDocker/DevOps Meetup: Metrics-Driven Continuous Performance and Scalabilty
Docker/DevOps Meetup: Metrics-Driven Continuous Performance and Scalabilty
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
Monitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance AnalysisMonitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance Analysis
 
Culture
CultureCulture
Culture
 

More from C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Recently uploaded

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 

Recently uploaded (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 

Netflix Built Its Own Monitoring System - and Why You Probably Shouldn't

  • 1. Netflix Built Its Own Monitoring System (And You Probably Shouldn’t) Roy Rapoport rsr@netflix.com @royrapoport 6 March 2015
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /netflix-monitoring-system
  • 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Not So Much About Telemetry • I telemetry • Architecture track Open Space, 11:30AM, Fleming 3rd Floor
  • 6. Agenda • Introductions • On Judgment • Your Problem • Your (no, really) Solution • Mitigation and Anecdotes • (Not) building your own monitoring system
  • 7. Introductions: Me • About 23 years in technology • Systems engineering, networking, software development, QA, release management • Time at Netflix: 2076 days (5y:8m:7d) • At Netflix: • Systems Engineering, Service Delivery in IT • Troubleshooter and Builder of Python Things in Product Engineering • Now: Engineering Manager, Insight Engineering
  • 8. Introductions: Netflix • Optimize speed of innovation • Constrain availability • Cost is what it is • Hire smart people,
 get out of their way • Anti-process bias “Freedom and Responsibility”
  • 10. You Have a Problem (Your job would likely be boring otherwise) • Are you the first • To have it? • To care? • Are you sure? One that looks nice And not too expensive
  • 11. You Have a Problem (Your job would likely be boring otherwise) • You’re not the first, or only • Good news! • Then what?
  • 12. Adventures in IT-Land • (import disclaimer) • Not developers • Cautious about ongoing support load • Not well-trusted
  • 14. A Little Bit of … • Time, courage, knowledge, pride • Cynicism, hubris, fear
  • 15.
  • 16. Technical Reasons for Rejection (Or: It’s Not You, It’s … Actually, It’s You) • Financial Cost • Technical incompatibility
  • 19. A Moment for Pedantry Or: Requirements for “Not Invented Here”
  • 21. A Question of Trust • Technical: I don’t trust your product • Organizational: I don’t trust you
  • 22. I Don’t Trust You To Care About Me as a Customer • You’re selling me something • I’m not your only customer • I’m not an important customer • You don’t care about your customers
  • 23. I Don’t Trust You To build a good product • Past performance … • “Good for me” • Because you said so, that’s why!
  • 24. I Don’t Trust You To build it fast enough • Unpredictable velocity • When best-case is too slow • Or maybe ever (OSS)
  • 26. Eventual Consistency • Fork n’ merge • THE model for OSS • Works better for incremental changes • Requires alignment of goals
  • 27. Eventual Consistency No Fork Required • Start With a New Idea • Eventually merge concepts
  • 30. Eventual Consistency Example Mainline Cloud Orchestration 2011 2013 Insight Engineering CD Automation
  • 31. Eventual Consistency Example Mainline Cloud Orchestration 2011 2013 Insight Engineering CD Automation 2014 Mainline CD Automation
  • 32. Eventual Consistency Example Mainline Cloud Orchestration 2011 2013 Insight Engineering CD Automation 2014 Mainline CD Automation 2015
  • 33. Eventual Consistency Example Mainline Cloud Orchestration 2011 2013 2014 Mainline CD Automation 2015 Insight Engineering CD Automation
  • 34. Composability • Want this anyway • Map scope to options’ scopes
  • 35. Composability: Example Netflix’s Atlas Telemetry Platform Global Query Endpoint
  • 36. Composability: Example Netflix’s Atlas Telemetry Platform Global Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Boundary
  • 37. Composability: Example Netflix’s Atlas Telemetry Platform Global Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Memory Epic Cloudwatch
  • 38. Composability: Example Netflix’s Atlas Telemetry Platform Global Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Memory Cloudwatch
  • 39. Composability: Example Netflix’s Atlas Telemetry Platform Global Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Memory Cloudwatch OpenTSDB InfluxDB
  • 40. Composability: Example Deployments and Automated Canary Analysis at Netflix Edge Systems Deployment Automation Platform Edge Systems Canary Analysis API API Mainline Deployment Automation Platform
  • 41. Composability: Example Deployments and Automated Canary Analysis at Netflix Edge Systems Deployment Automation Platform Edge Systems Canary Analysis API Email Insight Engineering Canary Analysis Mainline Deployment Automation Platform
  • 42. Composability: Example Deployments and Automated Canary Analysis at Netflix Edge Systems Deployment Automation Platform Edge Systems Canary Analysis API Insight Engineering Canary Analysis Mainline Deployment Automation Platform
  • 43. Composability: Example Deployments and Automated Canary Analysis at Netflix Edge Systems Deployment Automation Platform Edge Systems Canary Analysis Insight Engineering Canary Analysis Mainline Deployment Automation Platform
  • 44. Composability: Example Deployments and Automated Canary Analysis at Netflix Edge Systems Deployment Automation Platform Insight Engineering Canary Analysis Mainline Deployment Automation Platform
  • 45. Composability: Example Deployments and Automated Canary Analysis at Netflix Edge Systems Deployment Automation Platform Insight Engineering Canary Analysis Mainline Deployment Automation Platform
  • 46. Composability: Example Deployments and Automated Canary Analysis at Netflix Insight Engineering Canary Analysis Mainline Deployment Automation Platform
  • 47. One More Reason“Think of the glory. Think of your reputation. Think how great it'll look on your next resume.” - Lois McMaster Bujold
  • 49. The Grand Example Netflix’s Monitoring Platform • Prior system owned by IT
  • 50. The Grand Example Netflix’s Monitoring Platform • Prior system owned by IT • No great OSS products
  • 51. The Grand Example Netflix’s Monitoring Platform • Prior system owned by IT • No great OSS products • Ridiculous scale
  • 52. The Grand Example Netflix’s Monitoring Platform • Prior system owned by IT • No great OSS products • Ridiculous scale • Seriously, how hard can it be?
  • 53. The Grand Example Netflix’s Monitoring Platform • Took longer than expected • Ongoing maintenance • UI only recent priority
  • 54. The Grand Example Netflix’s Monitoring Platform • Scales efficientlyish • impedance match with dev lifestyle • Nicely pluggable* • Aggressivish OSS efforts * Ask me about Real-Time Analytics!
  • 55. The Grand Example Netflix’s Monitoring Platform • Still the right solution • Worried about Sunk Cost Fallacy • Most shouldn’t do this
  • 56. Can You Repeat That? Or: What’s Your Point? Or: I was Tweeting. Did I miss something? • What’s important to you? • Is this a technical decision? Really? • Honest and non-judgmental • Any mitigation? • Don’t build your own monitoring system. Seriously.
  • 57. Name This Group • United States • Europe • China • Russia • India • Japan • Blue Origin • SpaceX • Virgin Galactic
  • 58. 11:30am Frasier Room (3rd Floor) @royrapoport rsr@netflix.com
  • 59. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/netflix- monitoring-system