SlideShare a Scribd company logo
1 of 26
Why Does (My) Monitoring Suck?
Todd Palino
Senior Staff Engineer, Site Reliability
LinkedIn
This Is The Only Slide You May Need a Picture Of
https://slideshare.net/ToddPalino
What’s On Our List Today?
Alerting Anti-Patterns
Setting Goals
What Is Monitoring?
Designing For Success
Wrapping Up
Alerting Anti-Patterns
Network Operations Center
• Central monitoring and
alerting
• Gatekeeping monitored
alerts with no deep
knowledge
• Information overload for a
moderate sized system
• Glorified telephone
operators
Kafka Under-Replicated Partitions
• Unclear meaning
• Sometimes it’s not a
problem at all
• Does the customer care as
long as requests are getting
served?
• Frequently gets ignored in
the middle of the night
CPU Load
• Relative measure of how
busy the processors are
• Who cares? Processors are
supposed to be busy
• What’s causing it?
• Might be capacity. Maybe
Setting Goals
• SLI Indicator
• SLO Objective
• SLT Target
• SLA Agreement
Service
Level
Whatever
Let’s Be Smart
About This
• Specific
• Measurable
• Agreed
• Realistic
• Time-limited, Testable
Common SLOs
Is the service able to
handle requests?
Availability
Are requests being
handled promptly?
Latency
Are the responses
being returned
correct?
Correctness
What Is Monitoring?
Observe and check the progress or quality of
(something) over a period of time; keep under
systematic review.
M o n i t o r
So WTF is Observability?
• Comes from control theory
• A measure of how well
internal states of a system
can be inferred from
knowledge of its external
outputs
• It’s a noun – you have this
(to some extent). You can’t
“do” it.
What Are We Looking For?
What Can We Work With?
Single numbers
• Counters
• Gauges
• Histograms (and Summaries)
Metrics Events
Structured data
• Log messages
• Tracing (collection of events)
Where Can We Get It?
• Rich data on internal state
• Necessary for high observability
• Tons of data possible, but the utility is often
questionable
• Beware! Here be dragons!
Subjective Objective
• Customer view of your system
• Think of “Down For Everyone Or Just Me?”
• Critical for SLO monitoring
• More difficult to do, but it’s the authority on
whether or not something is broken
Designing For Success
Build For Failure
Rich instrumentation
on every aspect
Intelligence
Tolerate single
component failures
(not just N+1)
Availability
Limit resource
creation and
utilization
Capacity
It’s the only thing
that matters
Using the SLO
• Always measure the SLIs
• Objective monitoring is best
• Don’t beat the SLO
• Only alert on the SLO
ONLY???
• SLO alerts find unknown-unknowns
• Known-unknowns and unknown-knowns
must only exist transiently
• A known-known should not require a
human. Automate responses to known
issues
• For all else, if you have a 100% signal it
can be an alert. But if it doesn’t impact
the SLO, does it need to wake you up?
What About Capacity?
Assure no single user
can quickly overrun
capacity
Use Quotas
Frequently enough to
respond to trend
changes
Report &
Review
Never ignore or put
off expansion work
Act Promptly
Wrapping Up
What Should I Do Next?
• Talk to your customers and
agree on what they can expect
• Add objective monitoring for
these expectations
Define Your SLOs Clean Up Alerts
Add
Instrumentation
• Inventory alerts and eliminate
any that are not a clear signal
• Add alerts for the SLOs that
you have agreed on
• Implement quotas, if needed,
to assure capacity isn’t
suddenly overrun
• Switch to structured logging
• For distributed systems,
consider adding request
tracing
• But make sure you don’t hold
this extra information for longer
than it’s needed for debugging
More Resources
• Finding Me
• linkedin.com/in/toddpalino
• slideshare.net/ToddPalino
• @bonkoif
• Code Yellow – How we help overburdened teams
• devops.com/code-yellow-when-operations-isnt-perfect
• Usenix LISA (10/29 – Nashville) “Code Yellow: Helping Top-Heavy Teams the Smart
Way”
• SRE – What does the culture look like at LinkedIn
• “Building SRE” usenix.org/conference/srecon18asia/presentation/palino
• Every Day is Monday in Operations - everydayismondayinoperations.com
• Kafka – Deep dive on monitoring for Apache Kafka
• confluent.io/kafka-summit-london18/urp-excuse-you-the-three-metrics-you-have-to-
know
Questions?

More Related Content

What's hot

Agile performance testing
Agile performance testingAgile performance testing
Agile performance testing
Cesario Ramos
 
Agile_SDLC_Node.js@Paypal_ppt
Agile_SDLC_Node.js@Paypal_pptAgile_SDLC_Node.js@Paypal_ppt
Agile_SDLC_Node.js@Paypal_ppt
Hitesh Kumar
 

What's hot (20)

The Continuous delivery value - Funaro
The Continuous delivery value - FunaroThe Continuous delivery value - Funaro
The Continuous delivery value - Funaro
 
Technical Capabilities as enabler for Agile and DevOps
Technical Capabilities as enabler for Agile and DevOpsTechnical Capabilities as enabler for Agile and DevOps
Technical Capabilities as enabler for Agile and DevOps
 
State of continuous delivery in 2015 - Minsk 15-5-2015
State of continuous delivery in 2015 - Minsk 15-5-2015State of continuous delivery in 2015 - Minsk 15-5-2015
State of continuous delivery in 2015 - Minsk 15-5-2015
 
So long scrum, hello kanban
So long scrum, hello kanbanSo long scrum, hello kanban
So long scrum, hello kanban
 
Working Effectively with PeopleSoft Support
Working Effectively with PeopleSoft SupportWorking Effectively with PeopleSoft Support
Working Effectively with PeopleSoft Support
 
Extreme Makeover OnBase Edition
Extreme Makeover OnBase EditionExtreme Makeover OnBase Edition
Extreme Makeover OnBase Edition
 
Agile performance testing
Agile performance testingAgile performance testing
Agile performance testing
 
Implementing Test Automation: What a Manager Should Know
Implementing Test Automation: What a Manager Should KnowImplementing Test Automation: What a Manager Should Know
Implementing Test Automation: What a Manager Should Know
 
Making disaster routine
Making disaster routineMaking disaster routine
Making disaster routine
 
DevOps for Database webinar
DevOps for Database webinarDevOps for Database webinar
DevOps for Database webinar
 
So you-want-to-go-faster
So you-want-to-go-fasterSo you-want-to-go-faster
So you-want-to-go-faster
 
Django production
Django productionDjango production
Django production
 
Api360 Summit The Automated Monolith
Api360 Summit  The Automated MonolithApi360 Summit  The Automated Monolith
Api360 Summit The Automated Monolith
 
Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015
 
Evolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBayEvolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBay
 
Kudo codefest : Delivering High Quality Software Through Better Release Process
Kudo codefest : Delivering High Quality Software Through Better Release ProcessKudo codefest : Delivering High Quality Software Through Better Release Process
Kudo codefest : Delivering High Quality Software Through Better Release Process
 
DR in the Cloud: Finding the Right Tool for the Job
DR in the Cloud: Finding the Right Tool for the JobDR in the Cloud: Finding the Right Tool for the Job
DR in the Cloud: Finding the Right Tool for the Job
 
Outsmarting Merge Edge Cases in Component Based Design
Outsmarting Merge Edge Cases in Component Based DesignOutsmarting Merge Edge Cases in Component Based Design
Outsmarting Merge Edge Cases in Component Based Design
 
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...
 
Agile_SDLC_Node.js@Paypal_ppt
Agile_SDLC_Node.js@Paypal_pptAgile_SDLC_Node.js@Paypal_ppt
Agile_SDLC_Node.js@Paypal_ppt
 

Similar to Why Does (My) Monitoring Suck?

PA2557_SQM_Lecture7 - Defect Prevention.pdf
PA2557_SQM_Lecture7 - Defect Prevention.pdfPA2557_SQM_Lecture7 - Defect Prevention.pdf
PA2557_SQM_Lecture7 - Defect Prevention.pdf
hulk smash
 

Similar to Why Does (My) Monitoring Suck? (20)

Webinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDB
Webinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDBWebinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDB
Webinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDB
 
Latency Control And Supervision In Resilience Design Patterns
Latency Control And Supervision In Resilience Design Patterns Latency Control And Supervision In Resilience Design Patterns
Latency Control And Supervision In Resilience Design Patterns
 
Mucon microservices and innovation
Mucon microservices and innovationMucon microservices and innovation
Mucon microservices and innovation
 
Its Not You Its Me MSSP Couples Counseling
Its Not You Its Me   MSSP Couples CounselingIts Not You Its Me   MSSP Couples Counseling
Its Not You Its Me MSSP Couples Counseling
 
Building trust within the organization, first steps towards DevOps
Building trust within the organization, first steps towards DevOpsBuilding trust within the organization, first steps towards DevOps
Building trust within the organization, first steps towards DevOps
 
DevOps for the sysadmin
DevOps for the sysadminDevOps for the sysadmin
DevOps for the sysadmin
 
Brainstorming failure
Brainstorming failureBrainstorming failure
Brainstorming failure
 
Sre summary
Sre summarySre summary
Sre summary
 
Security Outsourcing - Couples Counseling - Atif Ghauri
Security Outsourcing - Couples Counseling - Atif GhauriSecurity Outsourcing - Couples Counseling - Atif Ghauri
Security Outsourcing - Couples Counseling - Atif Ghauri
 
itSMF Belgium kickoff 2015
itSMF Belgium kickoff 2015itSMF Belgium kickoff 2015
itSMF Belgium kickoff 2015
 
Doing monitoring right
Doing monitoring rightDoing monitoring right
Doing monitoring right
 
DevSecCon Asia 2017 Shannon Lietz: Security is Shifting Left
DevSecCon Asia 2017 Shannon Lietz: Security is Shifting LeftDevSecCon Asia 2017 Shannon Lietz: Security is Shifting Left
DevSecCon Asia 2017 Shannon Lietz: Security is Shifting Left
 
Atlassian Based DevOps Command Center: Adding Opsgenie to the Powerful Mix!
Atlassian Based DevOps Command Center: Adding Opsgenie to the Powerful Mix!Atlassian Based DevOps Command Center: Adding Opsgenie to the Powerful Mix!
Atlassian Based DevOps Command Center: Adding Opsgenie to the Powerful Mix!
 
Is Your Vulnerability Management Program Irrelevant?
Is Your Vulnerability Management Program Irrelevant?Is Your Vulnerability Management Program Irrelevant?
Is Your Vulnerability Management Program Irrelevant?
 
DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015
 
DevSecCon Keynote
DevSecCon KeynoteDevSecCon Keynote
DevSecCon Keynote
 
Melissa Tondi - Automation We_re Doing it Wrong.pdf
Melissa Tondi - Automation We_re Doing it Wrong.pdfMelissa Tondi - Automation We_re Doing it Wrong.pdf
Melissa Tondi - Automation We_re Doing it Wrong.pdf
 
DevSecCon London 2017: Shift happens ... by Colin Domoney
DevSecCon London 2017: Shift happens ... by Colin Domoney DevSecCon London 2017: Shift happens ... by Colin Domoney
DevSecCon London 2017: Shift happens ... by Colin Domoney
 
PA2557_SQM_Lecture7 - Defect Prevention.pdf
PA2557_SQM_Lecture7 - Defect Prevention.pdfPA2557_SQM_Lecture7 - Defect Prevention.pdf
PA2557_SQM_Lecture7 - Defect Prevention.pdf
 
Deliver Fast and Reliably with Dev Ops and Atlassian
Deliver Fast and Reliably with Dev Ops and AtlassianDeliver Fast and Reliably with Dev Ops and Atlassian
Deliver Fast and Reliably with Dev Ops and Atlassian
 

More from Todd Palino

Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Todd Palino
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
Todd Palino
 

More from Todd Palino (13)

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafka
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a Service
 

Recently uploaded

Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 

Why Does (My) Monitoring Suck?

  • 1. Why Does (My) Monitoring Suck? Todd Palino Senior Staff Engineer, Site Reliability LinkedIn
  • 2. This Is The Only Slide You May Need a Picture Of https://slideshare.net/ToddPalino
  • 3. What’s On Our List Today? Alerting Anti-Patterns Setting Goals What Is Monitoring? Designing For Success Wrapping Up
  • 5. Network Operations Center • Central monitoring and alerting • Gatekeeping monitored alerts with no deep knowledge • Information overload for a moderate sized system • Glorified telephone operators
  • 6. Kafka Under-Replicated Partitions • Unclear meaning • Sometimes it’s not a problem at all • Does the customer care as long as requests are getting served? • Frequently gets ignored in the middle of the night
  • 7. CPU Load • Relative measure of how busy the processors are • Who cares? Processors are supposed to be busy • What’s causing it? • Might be capacity. Maybe
  • 9. • SLI Indicator • SLO Objective • SLT Target • SLA Agreement Service Level Whatever
  • 10. Let’s Be Smart About This • Specific • Measurable • Agreed • Realistic • Time-limited, Testable
  • 11. Common SLOs Is the service able to handle requests? Availability Are requests being handled promptly? Latency Are the responses being returned correct? Correctness
  • 13. Observe and check the progress or quality of (something) over a period of time; keep under systematic review. M o n i t o r
  • 14. So WTF is Observability? • Comes from control theory • A measure of how well internal states of a system can be inferred from knowledge of its external outputs • It’s a noun – you have this (to some extent). You can’t “do” it.
  • 15. What Are We Looking For?
  • 16. What Can We Work With? Single numbers • Counters • Gauges • Histograms (and Summaries) Metrics Events Structured data • Log messages • Tracing (collection of events)
  • 17. Where Can We Get It? • Rich data on internal state • Necessary for high observability • Tons of data possible, but the utility is often questionable • Beware! Here be dragons! Subjective Objective • Customer view of your system • Think of “Down For Everyone Or Just Me?” • Critical for SLO monitoring • More difficult to do, but it’s the authority on whether or not something is broken
  • 19. Build For Failure Rich instrumentation on every aspect Intelligence Tolerate single component failures (not just N+1) Availability Limit resource creation and utilization Capacity
  • 20. It’s the only thing that matters Using the SLO • Always measure the SLIs • Objective monitoring is best • Don’t beat the SLO • Only alert on the SLO
  • 21. ONLY??? • SLO alerts find unknown-unknowns • Known-unknowns and unknown-knowns must only exist transiently • A known-known should not require a human. Automate responses to known issues • For all else, if you have a 100% signal it can be an alert. But if it doesn’t impact the SLO, does it need to wake you up?
  • 22. What About Capacity? Assure no single user can quickly overrun capacity Use Quotas Frequently enough to respond to trend changes Report & Review Never ignore or put off expansion work Act Promptly
  • 24. What Should I Do Next? • Talk to your customers and agree on what they can expect • Add objective monitoring for these expectations Define Your SLOs Clean Up Alerts Add Instrumentation • Inventory alerts and eliminate any that are not a clear signal • Add alerts for the SLOs that you have agreed on • Implement quotas, if needed, to assure capacity isn’t suddenly overrun • Switch to structured logging • For distributed systems, consider adding request tracing • But make sure you don’t hold this extra information for longer than it’s needed for debugging
  • 25. More Resources • Finding Me • linkedin.com/in/toddpalino • slideshare.net/ToddPalino • @bonkoif • Code Yellow – How we help overburdened teams • devops.com/code-yellow-when-operations-isnt-perfect • Usenix LISA (10/29 – Nashville) “Code Yellow: Helping Top-Heavy Teams the Smart Way” • SRE – What does the culture look like at LinkedIn • “Building SRE” usenix.org/conference/srecon18asia/presentation/palino • Every Day is Monday in Operations - everydayismondayinoperations.com • Kafka – Deep dive on monitoring for Apache Kafka • confluent.io/kafka-summit-london18/urp-excuse-you-the-three-metrics-you-have-to- know

Editor's Notes

  1. Before we get started, I want to give everyone a chance to snap this. After this talk is over, by the end of the day, I will post the slide deck up on my SlideShare along with the powerpoint original (with slide notes). So anything you want to be available to reference will be there for you.
  2. Now that we’ve got that out of the way, what are we going to talk about? I’m going to start with examining a few anti-patterns in alerting that I’ve seen at LinkedIn, and how they’ve caused us pain. We’ll then talk a little about establishing what our application goals should look like. I’ll talk about the current state of the monitoring world we have to work with (along with a buzzword or two), and exactly what we are trying to get out of good monitoring. Once we have that established, we can talk about how we should design our stacks so that we get the information we need without killing ourselves. Lastly, we’ll quickly review what we should all be doing right now to make our lives easier.
  3. When it comes to alerting, we make a lot of mistakes. Still. We set up far too many alerts, on information that we only partially understand, and we don’t even know how to act on it. This leads to us ignoring the things that we set up to ping us, and that muddies the signal even more. We often can’t see the size of the problem until we’re working through multiple sleepless nights, and at that point all we have the energy for is digging ourselves out of the current crisis. So what are some of the worst ones that I have seen at LinkedIn?
  4. Probably the best example of signals going to the wrong place is the Network Operations Center. Our Site Operations Center is comprised of some really sharp engineers who are tasked with both coordinating large incidents, but also helping to monitor the growth and overall site health. Like every other engineer, they build tools and processes around this. But back when we still called them the NOC, the role was vastly different. At the time, we didn’t have a good way to get alerts to individual engineers. So the NOC would have alerting dashboards they would monitor with hundreds upon hundreds of alerts. These would have to be onboarded to the NOC, making them the gatekeeper for what was and wasn’t important enough to have someone monitoring 24x7. To make it worse, most of the runbooks were “Call the SRE”, so they couldn’t even help to resolve most problems. This left us with a bunch of engineers who had to restrict what information was sent to them, without deep knowledge of the systems being monitored. And while they also had responsibilities around monitoring the growth metrics, they were often working as little more than a switchboard operator – waking engineers up at night, taking orders for who to escalate to. Thankfully, this is much better now – while the SOC still helps with escalation, it’s a minor part of their role. Alerts are now managed by the individual SRE teams, and the notifications go to those teams, not a central team that has to dispatch them.
  5. Another problem, one that I am still constantly fighting, is that of unclear monitoring signals. In Apache Kafka, the “under-replicated partitions” metric is often used as the gold standard of what to monitor and alert on. I know, because I helped to write the definitive guide for Kafka and I told everyone that it was the gold standard of what to monitor. Now I’m stuck with the unenviable task of having to tell everyone how wrong I was and why. The problem is that this metric doesn’t provide a clear signal as to what is wrong. This graph usually means that one of the Kafka brokers is offline, but not always – it could also mean that it’s up but replication is broken in a variety of ways. The next graph is usually a problem with a broker talking to a single other broker for replication. But it could also be a single hot topic or partition that’s unbalanced. This last graph, nobody has any idea what it means. And sometimes it’s not a problem at all – when you need to move partitions around to balance a cluster, they are naturally under-replicated while data is being moved. So what happens? The Kafka team has an alert that goes off if this metric is more than 10. Why 10? Because normally we don’t move more than 10 partition at a time, so that will mask a known false alert. But the alert often gets ignored when it goes off. There’s a separate signal for a broker that is down, and it’s often a sign of a capacity problem that can’t be immediately resolved (more on that later). In addition, the customers don’t really care that the cluster is under-replicated (although sometimes they do). So it’s not a crisis. Which begs the question, why is it an alert in the first place?
  6. It’s much like this other lovely metric, CPU load. I would guess that we have all seen this, or some variant like CPU utilization, as an “important metric” in our alerts. Despite the fact that most engineers I have had the pleasure of working with over the years don’t understand what it is a measure of without asking Google. And even when you understand the measurement (hint – it’s a relative measure of how busy the processors on a system are, averaged over some period of time), you then need to understand what is a good number and what is a bad number. For example, is a CPU load of 20 bad? If you have 16 CPUs in your system the answer is very different than if you have 24. Even then, it’s hard to say if it’s bad. There’s a question I heard someone ask at one point. If you run an email server, and the CPU utilization is at 99%, what do you do? The answer is that if the emails are getting delivered properly, you go get a drink. The CPU is supposed to be busy – it’s there to do work for you. If the work is happening properly, what do you care about how busy it is at any point in time. The one caveat is that you might have a capacity concern building, but this is a lousy way to measure it. And even if you do, what are you going to do about it in the middle of the night? Complicating this, it’s a measure of the overall system. Any given system has hundreds of processes running. Which one is causing the CPU to be high? Maybe it’s an opportunistic process that will back off if the CPU is demanded by something else. Yet another reason why this shouldn’t wake you up.
  7. If we shouldn’t be alerting on things that aren’t clear, or don’t matter to our users. Or if we want to make sure that the signals we provide don’t swamp us (or a NOC) with irrelevant information, what should we do? The answer is that we have to start with defining what our goals are for the application. The only goals that matter are what our customers expect – they don’t care if the CPU is a little hot. They also don’t care, unless we have made promises around internal replication, whether or not a given thing is under-replicated. They care about the agreements that we have made with them, so we have to define them before we know what we should measure, track, and alert on.
  8. Most everyone here will be familiar with the term “service level agreement”. However, I would also guess that most of us (or our management) use it somewhat incorrectly, especially when discussing parts of an internal distributed architecture (and not external customers). There are actually several terms that you should know, and encourage the use of properly so that we have clear communication on what our goals are. All of these start with “service level”, which indicates that we’re talking about items that define how we deliver the services that we’re responsible for – what the level of performance is. The first is “service level indicator”, abbreviated SLI. The SLI is the measurement that will form the basis for any goal. For example, if we want to track a service’s error rate as a goal, then the metric that tells us what the error rate is will be an SLI. If we want to track latency, we may have several SLIs – we might want to use both the 50th and 99th percentiles, for example (more on that later). Next we have “service level objective”, or SLO. This is the term that more people know, however ITIL has decided to deprecate the use of this term in their documentation and use “service level target” (SLT) instead. They have the same meaning, however. The SLO is the specific measurements of your SLI that are considered good and bad. So if your goal is to have the error rate be less than 1%, and you’re going to track that over the course of a day, your SLO could be “service error rate < 1% averaged over 24 hours”, a combination of an SLI and a specific target for it, including the timeframe that it is measured over. On top of all this is the “service level agreement”, or SLA. The SLA comprises the entire agreement with the cluster, which includes not only the SLO, but also consequences for what happens if the SLO is missed, how customer support is provided. It is a contract, and if you’re running a system with internal customers you probably don’t have one of these (even if you have SLOs already). With terminology defined, how do we define what the SLOs should be for a given service?
  9. We’re going to talk about another thing that we’ve probably all heard about, SMART goals. There’s a number of variants on what the letters S, M, A, R, and T are supposed to stand for here, so we’re going to use a set that will apply well to talking about SLOs. The end result should be an understanding of how to pick SLOs that won’t bite us later. First, they need to be specific. It’s not sufficient to say we’re going to measure latency. We need to specify exactly what metric we are going to use and how it is measured. For something like latency, it’s important to know if we’re using a histogram so we can specify the right components (like 99th percentile). We also need exact target measurements and the time periods that they are measured over. Next, and this may seem like it should go without saying, they need to be measurable. We can’t have a goal of making the customer happy – we don’t have a way to measure that. Even more reasonable things, like error rates and latencies, if we don’t have a way to measure them properly. It’s also worth noting here that where we measure these things from matters. What our server says the latency and error rate is will probably differ from what our clients think. So it’s important to say where the measurement is made from. The SLOs also have to be agreed on between us and our customers. This is pretty simple – I can’t specify alone what the SLOs for my service are, because I need to know what’s required and acceptable to my customers. Neither can they say what the SLOs are without my involvement. This is because the goals also need to be realistic. If you have a service that performs a complex calculation which takes a minimum of 10ms of CPU time (in terms of the wall clock), then it’s not reasonable to have a latency SLO at 5ms. If you can’t agree on the goals, specifically if they’re not realistic, it’s a sign you need to have a deep discussion with your customers about what they’re trying to accomplish. Lastly, we’re going to have two “T”s – time-limited and testable. Testable is very similar to measurable, but we can also use it to say that we should be able to test for compliance with the SLOs before we release new code. That might be through pre-release performance testing, canaries, or some other mechanism for minimizing the impact of bad code. Time-limited means that we need to define the time period over which we are measuring the SLO. If we’re talking about availability of the service, then four nines measured over a day is very different than when it is measured over a week. The first one means that we can only have less than 9 seconds of downtime every day, whereas the latter means we can have a single event where we are down for a minute in a week.
  10. So what are the SLOs you should be looking at? It’s hard for me to tell you what your customers want you to agree to for any given service, but they’re usually going to be one of these three things: availability, latency, or correctness. Availability is simply whether or not your service is up and running and able to handle requests. It doesn’t (usually) mean that every component is up, just that a customer can send a request and get a response. Usually this is expressed in terms of how many nines you have. Latency is how long it takes to service a request. This may be overall for your service, or if you have endpoints that behave differently, it might be per endpoint. You also may have SLOs that cover different percentiles for the same value. For example, you might have a 100ms response time at the 50th percentile, and a 1 second response time at the 99th percentile. Correctness is the term I use instead of error rate. This is because an error is just one type of bad response. Giving a wrong answer is usually worse than returning an error code. All together, these are the things customers usually care about: can I make requests, do I get responses in a reasonable time period, and are the responses correct?
  11. Now we can talk about what we have to work with to make sure that we hit those goals. The entire discipline is called monitoring, but what is it really?
  12. The Oxford definition of “to monitor” is: Observe and check the progress or quality of (something) over a period of time; keep under systematic review. This matches up with what we know – monitoring is about watching and measuring the status of our services.
  13. If that’s the case, then what the heck is this term that keeps coming up, observability? Well, what I’m gonna say may sound indelicate… There are some who use this term to indicate a magical new discipline that will solve all your problems, an alternative to monitoring, by giving you a view into what your applications, or your distributed systems, are doing. The reality? What they’re talking about is a specific kind of monitoring that we’ll get to in a minute. Observability is a term from control theory. The definition there is “a measure of how well internal states of a system can be inferred from knowledge of its external outputs.” Sounds an awful lot like monitoring, doesn’t it? There’s a good reason for that – they’re not at odds with each other. Observability is a noun – it’s something you have, or don’t. It’s not something you can “do”. Monitoring is a verb – it’s something you do. In many cases, observability in this world is more of a measure of how good your monitoring is. Image Copyright © 2002 Viacom International Inc
  14. OK, enough about observability. What are we looking for with our monitoring? Where are we trying to get to? We’re going to call this the Rumsfeld Quadrant. On the side, we have detection of problems – there are known issues that we are able to detect, and unknown issues that we are not able to detect. Along the top, we have our response to those problems – there are issues where we know how to respond to them, how to fix them, and we have issues that we don’t know how to fix. For issues that we can detect and that we know how to respond to, these are the known-knowns. This is the realm of good monitoring – crisp signals, and runbooks to address them. These are things that should be automated as well – we know how to fix them, so why spend engineer hours on doing this? If we have problems that we can detect, but we don’t know how to respond to, these are known-unknowns: our active incidents that we’re working on. For problems we can’t detect, but we know how to fix, these are unknown-knowns. These are monitoring gaps that need to be resolved. The last category is the unknown-unknowns – problems that we don’t know about, and we don’t know how to fix. These are tweets about your service being broken – you’re hearing about it from customers, and not from your own system. And you’re being spun up to quickly figure out what the heck broke. By necessity, these need to rapidly migrate to either known-unknowns or unknown-knowns, depending on whether you can fix it or detect it faster. Eventually, all of these issues need to migrate to known-knowns. If they don’t, you’re stuck in reactive work all the time.
  15. So we know what we’re looking for. How can we get there? What data do we have to work with? Monitoring generally breaks down to two types of things: metrics, and events. Metrics are single numbers, whereas events are structured data. By the way, when you hear people talk about “observability”, they’re talking about events. Metrics are what we make our graphs out of. Counters are pretty simple – constantly increasing integers. Total number of requests, total number of errors. Gauges are numbers (either integers or floats) that fluctuate up and down, like a speedometer. This could be requests per second, or network utilization, or CPU utilization. A third type of data is histograms. This is bucketed data, typically represented as percentiles. The 50th percentile is the median, the 99th percentile means that 99% of values are less than the metric value. We often use histograms for metrics like latencies, so that we can not only see the average but also what the worst offenders are. For events, the one we probably all have already is log messages. Of course, they’re probably not as structured as we’d like them to be – most of us are using plain strings, even if they’re in a well-defined format like Apache HTTPD logs. Everyone, if they’re not already, should be moving towards true structured logging in a format like JSON (there are other options, but JSON is the most common). The other type of event data, specifically relevant for distributed systems, is request tracing. This is actually a collection of events, where each event is a discrete call made as part of the initial request. For example, a user makes a request at service A. Service A then calls services B and C to get results, and assembles a response to the user. You would have 3 events in a trace, at minimum – one each for the initial user call to A, the call from A to B, and the call from B to C. These events will have rich data about the requests: the caller, the endpoint called, the status of the call, the time the call took, and possibly much more information. This list isn’t exhaustive, by any means. But it is the most common types of monitoring data we have to work with. Metrics are usually where your alerts come from – they let you know that there is a problem. Events are usually where you get your detail for debugging from – they help you figure out why you have a problem. This is not an ”either/or” – use both.
  16. We also have a choice on where we can get a lot of our data from. Specifically, I will talk about subjective measurements, and objective measurements. Subjective measurements are things that we measure about ourselves. “I am a very handsome presenter” is a subjective statement. Objective measurements are things that other people measure about us. “This guy has no idea what he’s talking about” is an objective statement. Subjective monitoring has the ability to give us very rich data on the internal state of a system, because we are instrumenting the system directly. These types of data are absolutely necessary for high observability. However, we need to be careful here because while there is lots of data that we can make available, not all of it is going to be useful. Do you really care about the byte size of the representation of a specific array in your code? You might, but it’s probably not going to be a very useful piece of information. Additionally, it might not be the best way to measure something. If you measure the error rate for requests at the server, it won’t include all the requests that don’t make it to the server at all. On the other side, we have objective monitoring. This provides a view of your system that is more like what your customers see. You can think of a service like “Down for everyone or just me?” – it makes a measurement against your system from the outside, and takes into account something other than what the system thinks about itself. This type of data is critically important for monitoring our SLOs, since when it comes to an agreement with our customers, they don’t really care if the service is down because the service itself is broken or if the network getting to the service is broken. It’s definitely harder to do, but it’s also much more of an authority on whether or not your service is working. Again, most of the time you need both of these. You need objective monitoring to know whether or not there is a problem, and you need subjective monitoring to really dig into the detail of why.
  17. With a shared understanding of what we can do, we can now move into how we handle our services to make sure that we have enough data to understand them, but avoid killing ourselves with noisy alerts and information overload. In large part, this means we need to design our services from the ground up to be successful.
  18. This means building them knowing that they are going to fail. Anyone who says they have a service that will never fail is a liar – everything fails, even if it means the failure is outside of your direct control. If we know we’re going to fail, what do we need to do in advance to make it as easy as possible for us to manage that? First, the code needs to provide us with appropriate intelligence about what is going on – we need high observability. We want detailed instrumentation on the operations of our entire system, including both metrics and events, so that we can debug problems easily. We also need to make sure that any SLIs that we have identified are included in there as well. Next, we need to build availability into the architecture. This means that we will strive to tolerate the failure of any single component without affecting the overall availability of the system. And this doesn’t mean N+1, which is what most people think of when it comes to availability. With the Kafka infrastructure at LinkedIn, we have previously used a replication factor of 2 – this means that we can lose a single broker, and the cluster will continue to operate. However, we can only lose one broker, and this means that whenever we had a hardware failure, we needed to spin up the on call engineer immediately to get that system back up and running, because a second failure would mean that we were down. This isn’t really being able to tolerate a single failure - it just meant that we had a very small window in which it wasn’t impacting customers. The third part is that we need to manage the capacity of our services so that we don’t get surprised by a sudden surge. For storage systems like Kafka, this means limiting the creation of resources, and how much storage and processing those resources can use by default. For non-storage things, this might mean quotas on request rates from upstream callers, or dynamic allocation of servers to accommodate surges in traffic. Either way, capacity problems cannot surprise us because we don’t have the ability to magically make new resources appear. And if you do, that should be automatic.
  19. When we’re setting up the alerting on all that rich instrumentation, the SLO is the thing to look at. We’ve hit some of these points already, but it’s worth a recap: We always have to measure the SLIs that have been defined. If we don’t, you may as well not bother having SLOs at all, because you don’t know if you’re hitting them or not. What’s worse is if you rely on your customers to monitor the SLOs. As a service owner, I never want to hear about any problem from my customers – I should always be the first to know about an issue, and I should be informing them. Not the other way around. When it comes to the SLIs, the best monitoring is objective monitoring. Since the SLO is an agreement with the customer, it is almost always measured from the customer’s point of view, not ours. Our monitoring should match that. We can rely on subjective monitoring to provide detailed data when we’re debugging a problem, but generally not for the SLOs themselves. Once we’ve got our SLOs defined, we should not try to beat them. At least not by very much, since we probably always want a little bit of buffer. But if you agree in your SLOs on 90% uptime, but you’re actually delivering 99% uptime all the time, guess what? Your customers probably have become accustomed to that, and they’re going to start making noise if the uptime is 98% even though that’s well within the SLO on paper. This is a good time to have maintenance windows where the service is offline even if it doesn’t need to be. Or artificial latency that can be dynamically adjusted if you have a downstream problem. It may seem counterintuitive to not be the best we can be here, but you’re going to sleep a lot better if you deliver what you promised. Lastly, only set up alerts on the SLO.
  20. Wait, you say. I must have misheard you. I thought you said that I shouldn’t alert on anything except the SLOs, and not these dozens of other metrics that indicate problems! Yes, that’s exactly what I said. And here is why. Think back to the Rumsfeld Quadrant. Alerts on the SLOs will find the unknown-unknowns – the problems that we can’t currently detect and our customers are going to tell us about. The known-unknowns, which indicate an active incident that we’re debugging how to fix, and the unknown-knowns, where we have a monitoring gap, by definition must only exist in our systems transiently. They have to transition to known-knowns if we’re going to avoid spending our entire day in reactive work. The known-knowns have a known detection and a known response – that should not require a human being to handle them. Those responses should be automated. But those are the only things we have. Monitoring signals used for alerting either tell us about a problem clearly, or they don’t – there is no useful grey area. And if they don’t tell us about a problem clearly, we’re setting ourselves up for failure if we use them. So if you have another signal that is 100% clear, that can potentially be used as an alert. But if you have a problem that doesn’t impact the SLO, why does that need to wake you up? That problem is either a known-known, in which case it can be automated away, or it’s a known-unknown, which someone had better be working on turning into a known-known. Now, I will admit that what I have described is the ideal world. But it is of the utmost importance that we have the ideal state in mind, and make sure that everything we do moves us towards that ideal state, not away from it.
  21. The one gotcha in all this is the capacity of our service. That’s probably not an SLO, but it does need to be monitored and managed correctly. Now, I’m specifically talking about systems that don’t have dynamic capacity changes available to them. If you’re running in a public cloud, and your service is well designed for scalability, you can probably react to capacity changes by spinning up new instances. Automate that. For the rest of us, and those of us who are running storage systems that require special handling, there are a few things we can do. First, as I mentioned earlier, use quotas to make sure that by default, your users are not able to overrun your available capacity. In Kafka, for example, this means using message retention by bytes on disk, as well as restricting the inbound and outbound bytes per second rates. Your customers should have resource limits in place so that if they want to significantly change their call patterns, they need to interact with you somehow to make sure you know about it in advance, and can plan for it. Once you have limits in place, you need to report on the capacity of your system and make sure you’re reviewing those reports frequently. Maybe you can automate this entire process, and have the reporting system put in hardware orders automatically. Maybe not. But either way, this is a process that is out-of-band from your normal alerting because hardware doesn’t magically appear when you snap your fingers. And don’t ever ignore those reports, or put off the expansion work that’s required. That’s really only asking to have problems that you could have taken care of proactively.
  22. If we can manage to do these things - design our systems well, monitor for the SLOs, and manage capacity proactively – we set ourselves up for success. The SLOs will let us know about the unknown-unknowns, and the detailed metrics and events will provide what we need to fix those new problems. As someone who wants to move towards the ideal state of automating their troubles away and cleaning up the noise of bad alerts, where do you start?
  23. The first thing you want to do is define your SLOs. Have a conversation with your customers, whether internal or external, and come to an agreement on what your service can and will provide for them. Then add some objective monitoring so that you can stick to those agreements. Next, you need to work on cleaning up your alerts. Take a look at all of them, and eliminate anything that doesn’t have a clear signal to start with. At least put them in quarantine and see if you can stop waking yourself up with them. Add new alerts for the SLOs that you now have. And make sure you have quotas in place to avoid surprises. Lastly, beef up the instrumentation for your services, and manage the monitoring data appropriately. If you’re not there already, for the love of all that is good please switch to structured logging. The larger your systems get, the more you realize that all of your data (like logs) should be targeted at automated processing, not humans being able to read it. Also think about adding request tracing if it seems appropriate. This is really important for distributed systems, but it’s also useful for tracing a request within a single service. It can be as simple as logging detailed request data, or more complex like emitting messages to a system like Kafka for stream processing. Ultimately, the more instrumentation you add, the more you need to make sure you’re only holding onto what you need. Make sure that monitoring data, including tracing, is only retained for as long as it’s important for debugging. This will help keep you in the good graces of the teams that are responsible for your monitoring storage, as well.
  24. I’ve also got a few resources for you. Due to monitoring overload and capacity problems, among other things, our Kafka team spent the first half of the year in a state we call Code Yellow. This is a way for us to put the brakes on and fix critical problems with a service or a team. If you’d like to learn more about that, I’ve written a blog post on what Code Yellow is. Michael Kehoe and I will be at LISA in a couple weeks to talk about that and other Code Yellows at LinkedIn. If you’re looking for more information about SRE at LinkedIn, I’ve spoken on that several times including at SREcon Asia this past. There’s also an excellent series written by one of my colleagues, Ben Purgason along with our former head of engineering David Henke called “Every Day is Monday in Operations”. And if you’re interested in learning more about Apache Kafka, there are many resources available and previous talks I have done. When it comes to monitoring, there’s a talk that I’ve given that echoes a lot of what you’ve heard here, applied to Kafka. You can view the talk from Kafka Summit London earlier this year. I’ll also be giving this talk next week at Kafka Summit SF, if you happen to be going. And as always, if you have any questions you can feel free to connect with me and ask. I’m on LinkedIn, of course, and you can find me on Twitter as well.