Why Does (My) Monitoring Suck?

•Download as PPTX, PDF•

5 likes•1,283 views

Monitoring services is easy, right? Set up a notification that goes out when a certain number increases past a certain threshold to let you know that there’s a problem. But if that’s the case, why are so many teams drowning in alerts and dreading their time on call? The reason is that we tend to monitor the wrong things: reactive alerts, metrics that we don’t completely understand how they impact our service, and capacity alerts. We look at our own view of the service and fail to consider that our customers have a different view. Come learn to let go of what does not help, and explore how to monitor for what truly matters: what the customer sees. This starts with defining our agreements with our customers, continues through building applications intelligently and instrumenting all the things, and finishes with picking the right signals out of that instrumentation to generate alerts that are actionable, not ones that introduce confusion and noise. We will also touch on capacity planning, and how it should never wake you up. You’ll find it’s possible to assure that you meet your service level objectives while still maximizing your sleep level objectives.

Engineering

Why Does (My) Monitoring Suck?
Todd Palino
Senior Staff Engineer, Site Reliability
LinkedIn

This Is The Only Slide You May Need a Picture Of
https://slideshare.net/ToddPalino

What’s On Our List Today?
Alerting Anti-Patterns
Setting Goals
What Is Monitoring?
Designing For Success
Wrapping Up

Network Operations Center
• Central monitoring and
alerting
• Gatekeeping monitored
alerts with no deep
knowledge
• Information overload for a
moderate sized system
• Glorified telephone
operators

Kafka Under-Replicated Partitions
• Unclear meaning
• Sometimes it’s not a
problem at all
• Does the customer care as
long as requests are getting
served?
• Frequently gets ignored in
the middle of the night

CPU Load
• Relative measure of how
busy the processors are
• Who cares? Processors are
supposed to be busy
• What’s causing it?
• Might be capacity. Maybe

• SLI Indicator
• SLO Objective
• SLT Target
• SLA Agreement
Service
Level
Whatever

Let’s Be Smart
About This
• Specific
• Measurable
• Agreed
• Realistic
• Time-limited, Testable

Common SLOs
Is the service able to
handle requests?
Availability
Are requests being
handled promptly?
Latency
Are the responses
being returned
correct?
Correctness

Observe and check the progress or quality of
(something) over a period of time; keep under
systematic review.
M o n i t o r

So WTF is Observability?
• Comes from control theory
• A measure of how well
internal states of a system
can be inferred from
knowledge of its external
outputs
• It’s a noun – you have this
(to some extent). You can’t
“do” it.

What Can We Work With?
Single numbers
• Counters
• Gauges
• Histograms (and Summaries)
Metrics Events
Structured data
• Log messages
• Tracing (collection of events)

Where Can We Get It?
• Rich data on internal state
• Necessary for high observability
• Tons of data possible, but the utility is often
questionable
• Beware! Here be dragons!
Subjective Objective
• Customer view of your system
• Think of “Down For Everyone Or Just Me?”
• Critical for SLO monitoring
• More difficult to do, but it’s the authority on
whether or not something is broken

Build For Failure
Rich instrumentation
on every aspect
Intelligence
Tolerate single
component failures
(not just N+1)
Availability
Limit resource
creation and
utilization
Capacity

It’s the only thing
that matters
Using the SLO
• Always measure the SLIs
• Objective monitoring is best
• Don’t beat the SLO
• Only alert on the SLO

ONLY???
• SLO alerts find unknown-unknowns
• Known-unknowns and unknown-knowns
must only exist transiently
• A known-known should not require a
human. Automate responses to known
issues
• For all else, if you have a 100% signal it
can be an alert. But if it doesn’t impact
the SLO, does it need to wake you up?

What About Capacity?
Assure no single user
can quickly overrun
capacity
Use Quotas
Frequently enough to
respond to trend
changes
Report &
Review
Never ignore or put
off expansion work
Act Promptly

What Should I Do Next?
• Talk to your customers and
agree on what they can expect
• Add objective monitoring for
these expectations
Define Your SLOs Clean Up Alerts
Add
Instrumentation
• Inventory alerts and eliminate
any that are not a clear signal
• Add alerts for the SLOs that
you have agreed on
• Implement quotas, if needed,
to assure capacity isn’t
suddenly overrun
• Switch to structured logging
• For distributed systems,
consider adding request
tracing
• But make sure you don’t hold
this extra information for longer
than it’s needed for debugging

More Resources
• Finding Me
• linkedin.com/in/toddpalino
• slideshare.net/ToddPalino
• @bonkoif
• Code Yellow – How we help overburdened teams
• devops.com/code-yellow-when-operations-isnt-perfect
• Usenix LISA (10/29 – Nashville) “Code Yellow: Helping Top-Heavy Teams the Smart
Way”
• SRE – What does the culture look like at LinkedIn
• “Building SRE” usenix.org/conference/srecon18asia/presentation/palino
• Every Day is Monday in Operations - everydayismondayinoperations.com
• Kafka – Deep dive on monitoring for Apache Kafka
• confluent.io/kafka-summit-london18/urp-excuse-you-the-three-metrics-you-have-to-
know

What's hot

The Continuous delivery value - Funaro

Codemotion

Technical Capabilities as enabler for Agile and DevOps

Nelis Boucké

State of continuous delivery in 2015 - Minsk 15-5-2015

Pavel Chunyayev

Last year, Stormpath made the big shift from Scrum to Kanban. While we love Agile principles, the Scrum process wasn’t working for us. Kanban made our team more efficient, happier, and increased our focus on quality software. More importantly, it has become a core part of our company culture, and is now used by non-technical teams like Marketing and HR. Kanban software development focuses on continuous delivery and drives high efficiency by limiting how much work can be done at once. Invented by Toyota and modified by David J. Anderson for software development, Kanban can have a huge impact on modern teams delivering cloud software in continuous environments.

So long scrum, hello kanban

Stormpath

Working Effectively with PeopleSoft Support

Smart ERP Solutions, Inc.

In our recent webinar hosted by Mike Current, a member of the Hyland Upgrade Council, and Mark Hamilton, DataBank's Infrastructure Engineer, we expanded on how upgrading OnBase offers the ability to not only gain enhancements and fixes, but also radically improve the security, stability and architecture of your entire OnBase environment. In this presentation you will... 1. Learn the formula for upgrade success with actionable items to work through right away 2. Understand the team needed to get the job done and how DataBank can step in to help 3. The importance of establishing a test environment and more You can also watch the full webinar here: http://info.databankimx.com/Upgrade-Webinar-RCD.html Download the Hyland 3rd Part Compatibility Matrix from slide #25 here: http://info.databankimx.com/rs/167-SSD-475/images/Third%20Party%20Product%20Compatibility%20Matrix.pdf

Extreme Makeover OnBase Edition

DataBank, A KYOCERA Group Company

Agile performance testing

Cesario Ramos

Implementing Test Automation: What a Manager Should Know

SoftServe

Making disaster routine

Peter Varhol

DevOps for Database webinar

DBmaestro - Database DevOps

How frequently does a good agile team deploy to production? Not every team is capable of deploying "on every commit". What does it take for a team to even start deploying at the end of each sprint, or each week, or each day? Most companies don't realize that deploying more frequently often requires both significant technical change as well as cultural change. In this talk, I'll guide you through what it takes to deploy more frequently, both from the technical side of setting up pipelines as well as the organizational side of removing red tape. I'll draw on the unique challenges that teams must overcome at each step of the way, from deploying once a month all the way down to full continuous delivery. If your team has been struggling to go faster, come see how you can change to get there. And if you already are at full continuous delivery, come see how to go even faster than that!

So you-want-to-go-faster

Ooblioob

Django production

pythonsd

Api360 Summit The Automated Monolith

Haufe-Lexware GmbH & Co KG

Over time, almost all large, well-known web sites have evolved their architectures from an early monolithic application to a loosely-coupled ecosystem of polyglot microservices. While first-order goals are almost always driven by the needs of scalability and velocity, this evolution also produces second-order effects on the organization as well. This session will discuss modern service architectures at scale, using specific examples from both Google and eBay. It covers some interesting -- and perhaps nonintuitive -- lessons learned in building and operating these sites. It concludes with a number of experience-based recommendations for other smaller organizations evolving to -- and sustaining -- an effective service ecosystem.

Service Architectures At Scale - QCon London 2015

Randy Shoup

Keynote at DevOpsDays Cuba Successful Internet companies are built on a foundation of excellent culture, efficient organization, and solid technology. As a company needs to scale, all of these parts of the foundation need to grow and scale with it. This session covers modern best practices at innovative companies in Silicon Valley for scaling culture, organization, and technology. Driven primarily by the presenter's experience ranging from small Valley startups to Google and eBay, it discusses: * Organizing small, fast-moving engineering teams * Building a scalable system out of smaller microservices * Maintaining a culture of ownership and collaboration * Developing effective engineering processes of continuous integration and continuous delivery

Evolving Architecture and Organization - Lessons from Google and eBay

Randy Shoup

Kudo Codefest-Android Developer TechTalk: At the early stage of Kudo, the release process were rather chaotic. Overdue releases and bug ridden software were some of the complications that we were experiencing. Over the course of its growth, we learn and steadily refines our engineering policy with heavy emphasize on release process. This slides were presented by Adani and Gerald, mobile developer at Kudo, at Kudo CODEFEST Tech Talk. They explained how Kudo revolutionize their release process, from managing change of scope in an agile startup to maintaining code convention within the team. As a result of our new process, Kudo mobile developers were able to keep up with release date while achieving low crash rate and improving their code quality.

Kudo codefest : Delivering High Quality Software Through Better Release Process

Kudo Developers

VM replication technologies like SRM or Zerto aren’t always the best way to replicate and protect all workloads. For example, most databases have native replication that’s more context-sensitive than hypervisor-based block-level replication. Join me, Darrell Hyde, HOSTING CTO, on November 12, 2015 for DR in the Cloud: Finding the Right Tool for the Job – an interactive webinar where I’ll discuss: •Common use cases •Different approaches to DR •Pros and cons of each approach

DR in the Cloud: Finding the Right Tool for the Job

Hostway|HOSTING

At MathWorks, we have a lot of Perforce users, and even more components. Throughout our componentization process, a fair share of merge edge cases surfaced that affected us in unexpected ways. See the various strategies we’ve devised for properly working around them and for managing this continuous influx of changes—all while guaranteeing quality and correctness, as our products are used to develop safety-critical systems.

Outsmarting Merge Edge Cases in Component Based Design

Perforce

Lessons Learned from a Parallel Universe David N. Blank-Edelman, Technical Evangelist, Apcera Just within the last ten or so years, we have seen at least two separate communities evolve at the crossroads of development and operations. The first—DevOps—grew up very much in public, the second matured sequestered within the halls of “special” companies like Google and Facebook and is only now starting to gain visibility and traction in the wider world. The DevOps and Site Reliability Engineering (SRE) communities barely speak, yet both have common ancestors and much to offer each other. Let’s look at what they have in common, how they differ, and what are the key things we can learn from both. DevOps Enterprise Summit San Francisco 2016

DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...

Gene Kim

Agile_SDLC_Node.js@Paypal_ppt

Hitesh Kumar

What's hot (20)

The Continuous delivery value - Funaro

Technical Capabilities as enabler for Agile and DevOps

State of continuous delivery in 2015 - Minsk 15-5-2015

So long scrum, hello kanban

Working Effectively with PeopleSoft Support

Extreme Makeover OnBase Edition

Agile performance testing

Implementing Test Automation: What a Manager Should Know

Making disaster routine

DevOps for Database webinar

So you-want-to-go-faster

Django production

Api360 Summit The Automated Monolith

Service Architectures At Scale - QCon London 2015

Evolving Architecture and Organization - Lessons from Google and eBay

Kudo codefest : Delivering High Quality Software Through Better Release Process

DR in the Cloud: Finding the Right Tool for the Job

Outsmarting Merge Edge Cases in Component Based Design

DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...

Agile_SDLC_Node.js@Paypal_ppt

Similar to Why Does (My) Monitoring Suck?

Setting up proactive monitoring systems can help you and your team prepare for operations problems before they happen and react appropriately when disaster strikes. In this presentation, we reviewed diagnostic tools and strategies for monitoring MongoDB. We reviewed how to do capacity planning and establish KPIs, and present the monitoring utilities available in MongoDB. The KPIs to monitor in your database, including throughput metrics, database performance, resource utilization, resource saturation, assertions/errors The commands, utilities and monitoring tools to leverage in order to set up your proactive monitoring installation Key alerts to set for monitoring your KPIs

Webinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDB

MongoDB

Latency Control And Supervision In Resilience Design Patterns

Tu Pham

Mucon microservices and innovation

Gawain Hammond

Its Not You Its Me MSSP Couples Counseling

Atif Ghauri

Building trust within the organization, first steps towards DevOps

Guido Serra

DevOps for the sysadmin

Robert Nelson

Brainstorming failure

Jeffery Smith

Sre summary

Yogesh Shah

Why do many managed services relationships fail? And fail again? Both organizations need to be aligned up front and hold hands during onboarding. This presentation covers the top five focus areas. Many MSSP relationships are doomed at the onboarding stage when an organization first becomes a customer. Given how critical these early stage activities are to your partnership, it's imperative to understand the top five areas of focus: technology deployment (the easy part, getting the tech running); the call tree (who do I wake up at 3 a.m.?); process sync (the fun part: mutual synchronization on who does what and when); access, access, access (you need access to do something); and the context of technology (the need to understand your shop). What you’ll take away: Understand proven success criteria for successful outsourcing of security operations Learn how to align security technologies to security processes, and the key focus areas of security operations Access to key checklists and charts to drive onboarding of managed services An understanding of specific terms and conditions that need to be included in data-related contracts under applicable laws Discover how other organizations have succeeded and failed in MSSP relations

Security Outsourcing - Couples Counseling - Atif Ghauri

Atif Ghauri

itSMF Belgium kickoff 2015

itSMF Belgium

Doing monitoring right

John-Daniel Trask

DevSecCon Asia 2017 Shannon Lietz: Security is Shifting Left

DevSecCon

Atlassian’s acquisition of Opsgenie peaked organizations' interest in integrating these technologies into a powerful DevOps solution. While these integrations are currently available to cloud users only, there are still innovative ways to tap into capabilities and integrate product such as Jira Cloud, Jira Software, Jira Service Desk, Opsgenie and other 3rd party apps into a powerhouse DevOps solution. In the cloud or behind the firewall. Learn more about these integrations and explore some fresh ideas, new use cases you can implement right now, combined with instructive hands-on examples.

Atlassian Based DevOps Command Center: Adding Opsgenie to the Powerful Mix!

Cprime

In this webcast, Scott Crawford from Enterprise Management Associates and Michelle Johnson Cobb of Skybox Security will discuss how to: Link vulnerability discovery, risk-based prioritization, and remediation activities to effectively mitigate risks before exploitation. Build a remediation strategy that addresses ‘unpatchable’ systems Minimize change management headaches by anticipating unintended impacts due to system and application interdependencies. Use metrics and key performance indicators (KPI’s) like remediation latency to track effectiveness of the vulnerability management program.

Is Your Vulnerability Management Program Irrelevant?

Skybox Security

DevSecCon KeyNote London 2015

Shannon Lietz

DevSecCon Keynote

Shannon Lietz

Melissa Tondi - Automation We_re Doing it Wrong.pdf

QA or the Highway

DevSecCon London 2017: Shift happens ... by Colin Domoney

DevSecCon

PA2557_SQM_Lecture7 - Defect Prevention.pdf

hulk smash

Deliver Fast and Reliably with Dev Ops and Atlassian

Xpand IT

Similar to Why Does (My) Monitoring Suck? (20)

Webinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDB

Latency Control And Supervision In Resilience Design Patterns

Mucon microservices and innovation

Its Not You Its Me MSSP Couples Counseling

Building trust within the organization, first steps towards DevOps

DevOps for the sysadmin

Brainstorming failure

Sre summary

Security Outsourcing - Couples Counseling - Atif Ghauri

itSMF Belgium kickoff 2015

Doing monitoring right

DevSecCon Asia 2017 Shannon Lietz: Security is Shifting Left

Atlassian Based DevOps Command Center: Adding Opsgenie to the Powerful Mix!

Is Your Vulnerability Management Program Irrelevant?

DevSecCon KeyNote London 2015

DevSecCon Keynote

Melissa Tondi - Automation We_re Doing it Wrong.pdf

DevSecCon London 2017: Shift happens ... by Colin Domoney

PA2557_SQM_Lecture7 - Defect Prevention.pdf

Deliver Fast and Reliably with Dev Ops and Atlassian

More from Todd Palino

Increasingly, technical organizations are developing career paths to build and recognize leaders outside of the traditional management roles. But what should an SRE who wants to be a leader be focusing on? Through the eyes of an engineer who reinvented his career in one of the largest SRE organizations, we will examine what technical leadership looks like, and how an individual can help guide the strategic path of a team, department, or company without taking on the role of a people manager. You'll pick up tactical work that you can start immediately to set yourself up for success, and some pointers to be able to identify the opportunities when they show up.

Leading Without Managing: Becoming an SRE Technical Leader

Todd Palino

Across industries, modern operations teams have noted the emergence of a new role: the Site Reliability Engineer (SRE): an IT craftsperson who fuses software engineering and operations best practices to enable highly reliable software systems. Once the domain of technology giants, this discipline is both applicable and important for any organization looking to differentiate itself in a world increasingly defined by software. In this session, Todd Palino from LinkedIn explores how SRE evolves from Operations by taking the ‘lid-off’ SRE at LinkedIn. He’ll describe how by crafting automation, problem solving, and building a partnership with software engineering teams, companies can build a high-trust and inclusive team culture that is needed to drive continuous improvement — and importantly, have lots of fun doing it!

From Operations to Site Reliability in Five Easy Steps

Todd Palino

What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows. We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain: Under-replicated Partitions: The mother of all metrics Request Latencies: Why your users complain Thread pool utilization: How could 80% be a problem? We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!

URP? Excuse You! The Three Kafka Metrics You Need to Know

Todd Palino

Across industries, modern operations teams have noted the emergence of a new role: the Site Reliability Engineer (SRE); a new IT craftsperson who fuses software engineering and operations best practices to enable highly reliable software systems. Once the domain of web-scale businesses, this discipline is both applicable and important for any organization looking to differentiate itself in a world increasingly defined by software. In this session, Todd Palino from LinkedIn explores SRE from organizational, team and individual perspectives. He’ll describe how by crafting automation and problem solving, SRE can permeate across a technical organization – not only ensuring a massively high-performant and always available site, but used to inform optimum decision making - in everything from system procurement to application design, builds and deployment. Todd will talk in depth about what constitutes the best in SRE in a DevOps world, using examples to examine the techniques needed to accelerate value and grow teams. Taking the ‘lid-off’ SRE at LinkedIn, join Todd as he describes how it started and continues to evolve, what goals are important, and how it’s instrumental in building a high-trust and inclusive team culture needed to drive continuous improvement -- and importantly, have lots of fun doing it!

Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...

Todd Palino

Kafka makes so many things easier to do, from managing metrics to processing streams of data. Yet it seems that so many things we have done to this point in configuring and managing it have been object studies in how to make our lives, as the plumbers who keep the data flowing, more difficult than they have to be. What are some of our favorites? * Kafka without access controls * Multitenant clusters with no capacity controls * Worrying about message schemas * MirrorMaker inefficiencies * Hope and pray log compaction * Configurations as shared secrets * One-way upgrades We’ve made a lot of progress over the last few years improving the situation, in part by focusing some of this incredibly talented community towards operational concerns. We’ll talk about the big mistakes you can avoid when setting up multi-tenant Kafka, and some that you still can’t. And we will talk about how to continue down the path of marrying the hot, new features with operational stability so we can all continue to come back here every year to talk about it.

Running Kafka for Maximum Pain

Todd Palino

The operations engineer is often seen as the hero, toiling away late nights on call to keep the systems running through failures of hardware and of code. While developers try as hard as possible to move quickly and break things, we stand as the voice of reason urging caution. We’re the only ones who truly understand the systems, but you’ll rarely find documentation because it’s just too complex and changeable to write down. When we’re doing our jobs well, we’re unappreciated because nobody understands how difficult it is. When things break, everyone thinks we’re doing our jobs badly. These are not the things we aspire to. At LinkedIn, Site Reliability Engineers are one layer in a stack that starts with the way we manage our code and basic hardware, and is built with common systems for application management, monitoring, and alerting. Each layer has its own specialist engineers, focused on making their piece as resilient as it can be and building it to integrate with the rest of the stack. This lets Software Engineers concentrate on developing their applications, without having to spend time building systems to build, package, and distribute their code. SREs can dedicate their time to integrating applications with the stack, architecting and scaling deployments, as well as developing tools and documentation to make the job easier. When the inevitable failure happens, many experts come together to quickly identify and resolve the problem and improve the entire stack for everyone. Description: Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany. Organized by EIT Digital and Huawei GRC, Germany. Twitter: @CloudRR2016

I'm No Hero: Full Stack Reliability at LinkedIn

Todd Palino

At LinkedIn, the Kafka infrastructure is run as a service: the Streaming team develops and deploys Kafka, but is not the producer or consumer of the data that flows through it. With multiple datacenters, and numerous applications sharing these clusters, we have developed an architecture with multiple pipelines and multiple tiers. Most days, this works out well, but it has led to many interesting problems. Over the years we have worked to develop a number of solutions, most of them open source, to make it possible for us to reliably handle over a trillion messages a day.

Multi tier, multi-tenant, multi-problem kafka

Todd Palino

Big Data means big hardware, and the less of it we can use to do the job properly, the better the bottom line. Apache Kafka makes up the core of our data pipelines at many organizations, including LinkedIn, and we are on a perpetual quest to squeeze as much as we can out of our systems, from Zookeeper, to the brokers, to the various client applications. This means we need to know how well the system is running, and only then can we start turning the knobs to optimize it. In this talk, we will explore how best to monitor Kafka and its clients to assure they are working well. Then we will dive into how to get the best performance from Kafka, including how to pick hardware and the effect of a variety of configurations in both the broker and clients. We’ll also talk about setting up Kafka for no data loss.

Kafka at Peak Performance

Todd Palino

Presented at Kafka Summit 2016 Operating out of multiple datacenters is a large part of most disaster recovery plans, but it brings extra complications to our data pipelines. Instead of having a straight path from front to back, it now has forks and dead ends and odd little use cases that don’t match up with a perfect view of the world. This talk will focus on how to best utilize Apache Kafka in this world, including basic architectures for multi-datacenter and multi-tier clusters. We will also touch on how to assure messages make it from producer to consumer, and how to monitor the entire ecosystem.

More Datacenters, More Problems

Todd Palino

Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Learn the right approach for getting the most out of Kafka from the experts at LinkedIn and Confluent. Todd Palino and Gwen Shapira demonstrate how to monitor, optimize, and troubleshoot performance of your data pipelines—from producer to consumer, development to production—as they explore some of the common problems that Kafka developers and administrators encounter when they take Apache Kafka from a proof of concept to production usage. Too often, systems are overprovisioned and underutilized and still have trouble meeting reasonable performance agreements. Topics include: - What latencies and throughputs you should expect from Kafka - How to select hardware and size components - What you should be monitoring - Design patterns and antipatterns for client applications - How to go about diagnosing performance bottlenecks - Which configurations to examine and which ones to avoid

Putting Kafka Into Overdrive

Todd Palino

Tuning Kafka for Fun and Profit

Todd Palino

This is a talk given at ApacheCon 2015 If data is the lifeblood of high technology, Apache Kafka is the circulatory system in use at LinkedIn. It is used for moving every type of data around between systems, and it touches virtually every server, every day. This can only be accomplished with multiple Kafka clusters, installed at several sites, and they must all work together to assure no message loss, and almost no message duplication. In this presentation, we will discuss the architectural choices behind how the clusters are deployed, and the tools and processes that have been developed to manage them. Todd Palino will also discuss some of the challenges of running Kafka at this scale, and how they are being addressed both operationally and in the Kafka development community. Note - there are a significant amount of slide notes on each slide that goes into detail. Please make sure to check out the downloaded file to get the full content!

Kafka at Scale: Multi-Tier Architectures

Todd Palino

Kafka is a publish/subscribe messaging system that, while young, forms a vital core for data flow inside many organizations, including LinkedIn. We will discuss Kafka from an Operations point of view, including the use cases for Kafka and the tools LinkedIn has been developing to improve the management of deployed clusters. We'll also talk about some of the challenges of managing a multi-tenant data service and how to avoid getting woken up at 3 AM. NOTE: I highly recommend viewing the original PPT. It has copious speaker notes for each slide, and the animations will actually work properly.

Enterprise Kafka: Kafka as a Service

Todd Palino

More from Todd Palino (13)

Leading Without Managing: Becoming an SRE Technical Leader

From Operations to Site Reliability in Five Easy Steps

URP? Excuse You! The Three Kafka Metrics You Need to Know

Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...

Running Kafka for Maximum Pain

I'm No Hero: Full Stack Reliability at LinkedIn

Multi tier, multi-tenant, multi-problem kafka

Kafka at Peak Performance

More Datacenters, More Problems

Putting Kafka Into Overdrive

Tuning Kafka for Fun and Profit

Kafka at Scale: Multi-Tier Architectures

Enterprise Kafka: Kafka as a Service

Recently uploaded

From customer value engagements to hands-on production support, our Services span across every stage of our customers digital transformation journey, to help ensure that every customer is successful in their adoption of our solutions. • Implementation, Upgrade, Migration, and Maintenance Services • On-Premises and On-Cloud • COTS Training Services; On-Site and Virtual • Software Support Services; Legacy and 3DEXPERIENCE • Value Engagement & Blueprinting • Specialized Consulting and Support Services • Customized Training Services • Automation and Configuration Services • Technical Resource Augmentation Services • Project Management • Know-how Training (mentoring) and Resource Augmentation

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...

Arindam Chakraborty, Ph.D., P.E. (CA, TX)

Employee leave management system project.

Kamal Acharya

KubeKraft presentation @CloudNativeHooghly

sanyuktamishra911

This presentation takes a deep dive into the methodologies of HAZID and HAZOP, two cornerstone risk assessment techniques in the oil and gas industry. Over 12 slides, we compare the structured approaches of both HAZID and HAZOP, detail their individual steps, and discuss their benefits and drawbacks. The aim is to provide professionals with concise insight to make informed decisions about which method suits their project's needs. By the end, viewers will have understood each method's strategic value in the pursuit of workplace safety and efficiency.

Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...

soginsider

Unit 1 - Soil Classification and Compaction.pdf

RagavanV2

In the dynamic landscape of energy storage, choosing the right battery pack is a critical decision that significantly impacts performance, efficiency, and overall product design. This webinar aims to unravel the complexities surrounding standard and custom battery packs (primarily lithium), providing you with a comprehensive understanding to make informed decisions. We'll embark on a journey to explore the fundamental distinctions between standard off-the-shelf battery packs and their bespoke counterparts tailored to specific applications. We will delve into the nuance of what is defined as a standard battery pack along with its benefits, limitations, and how it caters to broad market needs. Simultaneously, we will dissect the world of custom battery packs, diving into the advantages they offer in terms of precise energy requirements, form factors, and unique design considerations. Whether you are a product designer, engineer, or industry professional, this webinar is designed to equip you with the knowledge necessary to navigate the intricate terrain of battery pack selection. Join us for this webinar as we navigate through the intricacies of standard and custom battery packs, empowering you to make strategic decisions that align with your project goal. For more information on our battery pack solutions, visit https://www.epectec.com/batteries/.

Standard vs Custom Battery Packs - Decoding the Power Play

Epec Engineered Technologies

Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For Sex At Your Doorstep Booking Contact Details WhatsApp Chat: +91-6297143586 pune Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts pune understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide - 02-may-2024(v.n)

Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...

tanu pandey

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking Escorts Service Available Whatsapp SABANA ☎️ : [+91-7001035870] Escorts Service are always ready to make their clients happy. Their exotic looks and sexy personalities are sure to turn heads. You can enjoy with them, including massages and erotic encounters. Our area Escorts are young and sexy, so you can expect to have an exotic time with them. They are trained to satiate your naughty nerves and they can handle anything that you want. They are also intelligent, so they know how to make you feel comfortable and relaxed Independent Escorts Service They know all the sex positions and can satisfy you in any way that you desire. They can even give you erotic massages to help you relax before your session. This is essential, because a man who is stressed won’t be receptive to the pleasures of sex. They also know how to play with your sexy organs, so you’ll have plenty of foreplay and cuddling. P252024SS SERVICE ✅ ❣️ ⭐➡️HOT & SEXY MODELS // COLLEGE GIRLS HOUSE WIFE RUSSIAN , AIR HOSTES ,VIP MODELS . AVAILABLE FOR COMPLETE ENJOYMENT WITH HIGH PROFILE INDIAN MODEL AVAILABLE HOTEL & HOME ★ SAFE AND SECURE HIGH CLASS SERVICE AFFORDABLE RATE ★ SATISFACTION,UNLIMITED ENJOYMENT. ★ All Meetings are confidential and no information is provided to any one at any cost. ★ EXCLUSIVE PROFILes Are Safe and Consensual with Most Limits Respected ★ Service Available In: - HOME & HOTEL Star Hotel Service .In Call & Out call SeRvIcEs : ★ A-Level (star escort) ★ Strip-tease ★ BBBJ (Bareback Blowjob)Receive advanced sexual techniques in different mode make their life more pleasurable. ★ Spending time in hotel rooms ★ BJ (Blowjob Without a Condom) ★ Completion (Oral to completion) ★ Covered (Covered blowjob Without condom ★ANAL SERVICES.

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking

dharasingh5698

Call Girl Aurangabad Indira Call Now: 8617697112 Aurangabad Escorts Booking Contact Details WhatsApp Chat: +91-8617697112 Aurangabad Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts Aurangabad understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide –

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7

Call Girls in Nagpur High Profile Call Girls

UNIT - IV - Air Compressors and its Performance

sivaprakash250

A Study of Urban Area Plan for Pabna Municipality

Morshed Ahmed Rahath

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking Escorts Service Available Whatsapp SABANA ☎️ : [+91-7001035870] Escorts Service are always ready to make their clients happy. Their exotic looks and sexy personalities are sure to turn heads. You can enjoy with them, including massages and erotic encounters. Our area Escorts are young and sexy, so you can expect to have an exotic time with them. They are trained to satiate your naughty nerves and they can handle anything that you want. They are also intelligent, so they know how to make you feel comfortable and relaxed Independent Escorts Service They know all the sex positions and can satisfy you in any way that you desire. They can even give you erotic massages to help you relax before your session. This is essential, because a man who is stressed won’t be receptive to the pleasures of sex. They also know how to play with your sexy organs, so you’ll have plenty of foreplay and cuddling. P252024SS SERVICE ✅ ❣️ ⭐➡️HOT & SEXY MODELS // COLLEGE GIRLS HOUSE WIFE RUSSIAN , AIR HOSTES ,VIP MODELS . AVAILABLE FOR COMPLETE ENJOYMENT WITH HIGH PROFILE INDIAN MODEL AVAILABLE HOTEL & HOME ★ SAFE AND SECURE HIGH CLASS SERVICE AFFORDABLE RATE ★ SATISFACTION,UNLIMITED ENJOYMENT. ★ All Meetings are confidential and no information is provided to any one at any cost. ★ EXCLUSIVE PROFILes Are Safe and Consensual with Most Limits Respected ★ Service Available In: - HOME & HOTEL Star Hotel Service .In Call & Out call SeRvIcEs : ★ A-Level (star escort) ★ Strip-tease ★ BBBJ (Bareback Blowjob)Receive advanced sexual techniques in different mode make their life more pleasurable. ★ Spending time in hotel rooms ★ BJ (Blowjob Without a Condom) ★ Completion (Oral to completion) ★ Covered (Covered blowjob Without condom ★ANAL SERVICES.

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking

dharasingh5698

Call Girl Bhosari Indira Call Now: 8617697112 Bhosari Escorts Booking Contact Details WhatsApp Chat: +91-8617697112 Bhosari Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts Bhosari understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide –

(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7

Call Girls in Nagpur High Profile Call Girls

data_management_and _data_science_cheat_sheet.pdf

JiananWang21

The purpose of Integrated Test Rig is to assess performance of Fuel Cooled Oil Cooler (FCOC), HP Pump & Oil Pump prior to fitment on Aero Engine. The design of rig is such that it can cater to individual testing of individual LRU or integrated testing of all the LRU’s installed together (in combination). The pumps under test shall be mounted directly on the high-speed electric motor flange through adapter quill shaft assembly. Variable speed drives shall be provided for all the pumps under testing. The rig shall cater for real time monitoring and recording of performance parameters. KEY FEATURES Key features of this Rig are as follows: - 1. It has separate power packs for both the fluids i.e. ATF Power pack for ATF Fluid, OX Power Pack for OX-27 Oil. 2. This Rig is designed for high temperature Application. It works on 200 deg. Celsius. 3. Two Dedicated heating systems have been integrated with this Rig to raise the temperature of OX-27 Oil. 4. In both Power Packs insulated tanks are being used, which is good for safe operation and safety of operator. 5. This rig is designed in such a way, that it’s OX power pack is in its left side, ATF Power pack is in right Side and test bed is in Centre. 6. A big Test bed have been provided for smooth testing of UUT’s. 7. All piping is being insulated for safe operation and maintenance. 8. It comprises with DAQ System. 9. This rig can test several components at a time, together or separately. Application All the UUT’s are mounted on HTFE-25 Engine after successful performance testing of those. Application of this Rig are as follows: - 1. One of its application is to test LP Pump of HTFE-25. It is designed for functional testing of LP Pump. a. Free Flow Test b. Load Test 2. It is used to test HP Pump of HTFE-25. It is designed for functional testing of HP Pump. a. Free Flow Test b. Load Test 3. It is used to test Oil Pump of HTFE-25. It is designed for functional testing of Oil Pump. 4. It is used to test FCOC of HTFE-25.

Integrated Test Rig For HTFE-25 - Neometrix

Neometrix_Engineering_Pvt_Ltd

Minimum and Maximum Modes of microprocessor 8086

anil_gaur

Thermal Engineering-R & A / C - unit - V

DineshKumar4165

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand Booking Contact Details :- WhatsApp Chat :- +91-7737669865 Call Girls In Model Towh +91-7737669865 !! Best Woman Seeking Man Call Girls Service, Escorts Service in Home Hotel in NCR 24 Hours Available Service Call Girls, Contact Us +91-7737669865 (Any Time. Any Where) Call Girls in , Noida, Gurgaon, Ghaziabad,Sexy Indian Female Escorts Service NCRWelcome To Escorts Service – An All Over New Very Sexy Hot Call Girls Agency Service Escorts In South NCR’s No. 1 High Profile Independent Female Escorts Service. We Provide Good Quality Educated Profile At #K09 Very Regnebal Price 100% Safe And Original.We Are Provide Escorts Service All OYO Hotels ,3*,4*,5* Star Hotel And Home Flat, Apartment. Guest-House. Services In -Call And Out – Call Both Are Services Available. 24Hrs. Any Time Any Where. In All Over Noida Gurgaon Ghaziabad Faridabad.More Information And Contact Profile Real Pic Visit Our Website City Wise Escorts Service Agency.Good Looking Cheap And Best Models Girls U Can Get Best Click On Link……Night Call Girls Now In Hotel Le Meridien Gurgaon Near Female Escort One Shot — 5000/in call (time 1 hour), 6000/out call Two shot with one girl — 8000/in call (time 2 hour), 10000/out call Body to body massage with sex- 8000/in call (time 1 hour) Full night Service for one person– 12000/in call, 13000/out call (shot limit 3-4 shots) Full night Service for more than 1 person — please contact Us —7737669865 We are available 24*7 all days of the year. Call us — 7737669865 Thank you for Visiting.

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand

amitlee9823

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to 25K High Profile Escorts In Pune Booking Now open +91- 8005736733 Why you Choose Us- +91- 8005736733 HOT⇄ 8005736733 Mr ashu ji Call Mr ashu Ji +91- 8005736733 (V030524]N) 𝐇𝐨𝐭𝐞𝐥 𝐑𝐨𝐨𝐦𝐬 𝐈𝐧𝐜𝐥𝐮𝐝𝐢𝐧𝐠 𝐑𝐚𝐭𝐞 𝐒𝐡𝐨𝐭𝐬/𝐇𝐨𝐮𝐫𝐲🆓 .█▬█⓿▀█▀ 𝐈𝐍𝐃𝐄𝐏𝐄𝐍𝐃𝐄𝐍𝐓 𝐆𝐈𝐑𝐋 𝐕𝐈𝐏 𝐄𝐒𝐂𝐎𝐑𝐓 Hello Guys ! High Profiles young Beauties and Good Looking standard Profiles Available , Enquire Now if you are interested in Hifi Service and want to get connect with someone who can understand your needs. Service offers you the most beautiful High Profile sexy independent female Escorts in genuine ✔✔✔ To enjoy with hot and sexy girls ✔✔✔ ★providing:- • Models • vip Models • Russian Models • Foreigner Models • TV Actress and Celebrities • Receptionist • Air Hostess • Call Center Working Girls/Women • Hi-Tech Co. Girls/Women • Housewife

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...

SUHANI PANDEY

22-prompt engineering noted slide shown.pdf

203318pmpc

Recently uploaded (20)

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...

Employee leave management system project.

KubeKraft presentation @CloudNativeHooghly

Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...

Unit 1 - Soil Classification and Compaction.pdf

Standard vs Custom Battery Packs - Decoding the Power Play

Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7

UNIT - IV - Air Compressors and its Performance

A Study of Urban Area Plan for Pabna Municipality

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking

(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7

data_management_and _data_science_cheat_sheet.pdf

Integrated Test Rig For HTFE-25 - Neometrix

Minimum and Maximum Modes of microprocessor 8086

Thermal Engineering-R & A / C - unit - V

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...

22-prompt engineering noted slide shown.pdf

Why Does (My) Monitoring Suck?

1. Why Does (My) Monitoring Suck? Todd Palino Senior Staff Engineer, Site Reliability LinkedIn

2. This Is The Only Slide You May Need a Picture Of https://slideshare.net/ToddPalino

3. What’s On Our List Today? Alerting Anti-Patterns Setting Goals What Is Monitoring? Designing For Success Wrapping Up

4. Alerting Anti-Patterns

5. Network Operations Center • Central monitoring and alerting • Gatekeeping monitored alerts with no deep knowledge • Information overload for a moderate sized system • Glorified telephone operators

6. Kafka Under-Replicated Partitions • Unclear meaning • Sometimes it’s not a problem at all • Does the customer care as long as requests are getting served? • Frequently gets ignored in the middle of the night

7. CPU Load • Relative measure of how busy the processors are • Who cares? Processors are supposed to be busy • What’s causing it? • Might be capacity. Maybe

8. Setting Goals

9. • SLI Indicator • SLO Objective • SLT Target • SLA Agreement Service Level Whatever

10. Let’s Be Smart About This • Specific • Measurable • Agreed • Realistic • Time-limited, Testable

11. Common SLOs Is the service able to handle requests? Availability Are requests being handled promptly? Latency Are the responses being returned correct? Correctness

12. What Is Monitoring?

13. Observe and check the progress or quality of (something) over a period of time; keep under systematic review. M o n i t o r

14. So WTF is Observability? • Comes from control theory • A measure of how well internal states of a system can be inferred from knowledge of its external outputs • It’s a noun – you have this (to some extent). You can’t “do” it.

15. What Are We Looking For?

16. What Can We Work With? Single numbers • Counters • Gauges • Histograms (and Summaries) Metrics Events Structured data • Log messages • Tracing (collection of events)

17. Where Can We Get It? • Rich data on internal state • Necessary for high observability • Tons of data possible, but the utility is often questionable • Beware! Here be dragons! Subjective Objective • Customer view of your system • Think of “Down For Everyone Or Just Me?” • Critical for SLO monitoring • More difficult to do, but it’s the authority on whether or not something is broken

18. Designing For Success

19. Build For Failure Rich instrumentation on every aspect Intelligence Tolerate single component failures (not just N+1) Availability Limit resource creation and utilization Capacity

20. It’s the only thing that matters Using the SLO • Always measure the SLIs • Objective monitoring is best • Don’t beat the SLO • Only alert on the SLO

21. ONLY??? • SLO alerts find unknown-unknowns • Known-unknowns and unknown-knowns must only exist transiently • A known-known should not require a human. Automate responses to known issues • For all else, if you have a 100% signal it can be an alert. But if it doesn’t impact the SLO, does it need to wake you up?

22. What About Capacity? Assure no single user can quickly overrun capacity Use Quotas Frequently enough to respond to trend changes Report & Review Never ignore or put off expansion work Act Promptly

23. Wrapping Up

24. What Should I Do Next? • Talk to your customers and agree on what they can expect • Add objective monitoring for these expectations Define Your SLOs Clean Up Alerts Add Instrumentation • Inventory alerts and eliminate any that are not a clear signal • Add alerts for the SLOs that you have agreed on • Implement quotas, if needed, to assure capacity isn’t suddenly overrun • Switch to structured logging • For distributed systems, consider adding request tracing • But make sure you don’t hold this extra information for longer than it’s needed for debugging

25. More Resources • Finding Me • linkedin.com/in/toddpalino • slideshare.net/ToddPalino • @bonkoif • Code Yellow – How we help overburdened teams • devops.com/code-yellow-when-operations-isnt-perfect • Usenix LISA (10/29 – Nashville) “Code Yellow: Helping Top-Heavy Teams the Smart Way” • SRE – What does the culture look like at LinkedIn • “Building SRE” usenix.org/conference/srecon18asia/presentation/palino • Every Day is Monday in Operations - everydayismondayinoperations.com • Kafka – Deep dive on monitoring for Apache Kafka • confluent.io/kafka-summit-london18/urp-excuse-you-the-three-metrics-you-have-to- know

26. Questions?

Editor's Notes

Before we get started, I want to give everyone a chance to snap this. After this talk is over, by the end of the day, I will post the slide deck up on my SlideShare along with the powerpoint original (with slide notes). So anything you want to be available to reference will be there for you.
Now that we’ve got that out of the way, what are we going to talk about? I’m going to start with examining a few anti-patterns in alerting that I’ve seen at LinkedIn, and how they’ve caused us pain. We’ll then talk a little about establishing what our application goals should look like. I’ll talk about the current state of the monitoring world we have to work with (along with a buzzword or two), and exactly what we are trying to get out of good monitoring. Once we have that established, we can talk about how we should design our stacks so that we get the information we need without killing ourselves. Lastly, we’ll quickly review what we should all be doing right now to make our lives easier.
When it comes to alerting, we make a lot of mistakes. Still. We set up far too many alerts, on information that we only partially understand, and we don’t even know how to act on it. This leads to us ignoring the things that we set up to ping us, and that muddies the signal even more. We often can’t see the size of the problem until we’re working through multiple sleepless nights, and at that point all we have the energy for is digging ourselves out of the current crisis. So what are some of the worst ones that I have seen at LinkedIn?
Probably the best example of signals going to the wrong place is the Network Operations Center. Our Site Operations Center is comprised of some really sharp engineers who are tasked with both coordinating large incidents, but also helping to monitor the growth and overall site health. Like every other engineer, they build tools and processes around this. But back when we still called them the NOC, the role was vastly different. At the time, we didn’t have a good way to get alerts to individual engineers. So the NOC would have alerting dashboards they would monitor with hundreds upon hundreds of alerts. These would have to be onboarded to the NOC, making them the gatekeeper for what was and wasn’t important enough to have someone monitoring 24x7. To make it worse, most of the runbooks were “Call the SRE”, so they couldn’t even help to resolve most problems. This left us with a bunch of engineers who had to restrict what information was sent to them, without deep knowledge of the systems being monitored. And while they also had responsibilities around monitoring the growth metrics, they were often working as little more than a switchboard operator – waking engineers up at night, taking orders for who to escalate to. Thankfully, this is much better now – while the SOC still helps with escalation, it’s a minor part of their role. Alerts are now managed by the individual SRE teams, and the notifications go to those teams, not a central team that has to dispatch them.
Another problem, one that I am still constantly fighting, is that of unclear monitoring signals. In Apache Kafka, the “under-replicated partitions” metric is often used as the gold standard of what to monitor and alert on. I know, because I helped to write the definitive guide for Kafka and I told everyone that it was the gold standard of what to monitor. Now I’m stuck with the unenviable task of having to tell everyone how wrong I was and why. The problem is that this metric doesn’t provide a clear signal as to what is wrong. This graph usually means that one of the Kafka brokers is offline, but not always – it could also mean that it’s up but replication is broken in a variety of ways. The next graph is usually a problem with a broker talking to a single other broker for replication. But it could also be a single hot topic or partition that’s unbalanced. This last graph, nobody has any idea what it means. And sometimes it’s not a problem at all – when you need to move partitions around to balance a cluster, they are naturally under-replicated while data is being moved. So what happens? The Kafka team has an alert that goes off if this metric is more than 10. Why 10? Because normally we don’t move more than 10 partition at a time, so that will mask a known false alert. But the alert often gets ignored when it goes off. There’s a separate signal for a broker that is down, and it’s often a sign of a capacity problem that can’t be immediately resolved (more on that later). In addition, the customers don’t really care that the cluster is under-replicated (although sometimes they do). So it’s not a crisis. Which begs the question, why is it an alert in the first place?
It’s much like this other lovely metric, CPU load. I would guess that we have all seen this, or some variant like CPU utilization, as an “important metric” in our alerts. Despite the fact that most engineers I have had the pleasure of working with over the years don’t understand what it is a measure of without asking Google. And even when you understand the measurement (hint – it’s a relative measure of how busy the processors on a system are, averaged over some period of time), you then need to understand what is a good number and what is a bad number. For example, is a CPU load of 20 bad? If you have 16 CPUs in your system the answer is very different than if you have 24. Even then, it’s hard to say if it’s bad. There’s a question I heard someone ask at one point. If you run an email server, and the CPU utilization is at 99%, what do you do? The answer is that if the emails are getting delivered properly, you go get a drink. The CPU is supposed to be busy – it’s there to do work for you. If the work is happening properly, what do you care about how busy it is at any point in time. The one caveat is that you might have a capacity concern building, but this is a lousy way to measure it. And even if you do, what are you going to do about it in the middle of the night? Complicating this, it’s a measure of the overall system. Any given system has hundreds of processes running. Which one is causing the CPU to be high? Maybe it’s an opportunistic process that will back off if the CPU is demanded by something else. Yet another reason why this shouldn’t wake you up.
If we shouldn’t be alerting on things that aren’t clear, or don’t matter to our users. Or if we want to make sure that the signals we provide don’t swamp us (or a NOC) with irrelevant information, what should we do? The answer is that we have to start with defining what our goals are for the application. The only goals that matter are what our customers expect – they don’t care if the CPU is a little hot. They also don’t care, unless we have made promises around internal replication, whether or not a given thing is under-replicated. They care about the agreements that we have made with them, so we have to define them before we know what we should measure, track, and alert on.
Most everyone here will be familiar with the term “service level agreement”. However, I would also guess that most of us (or our management) use it somewhat incorrectly, especially when discussing parts of an internal distributed architecture (and not external customers). There are actually several terms that you should know, and encourage the use of properly so that we have clear communication on what our goals are. All of these start with “service level”, which indicates that we’re talking about items that define how we deliver the services that we’re responsible for – what the level of performance is. The first is “service level indicator”, abbreviated SLI. The SLI is the measurement that will form the basis for any goal. For example, if we want to track a service’s error rate as a goal, then the metric that tells us what the error rate is will be an SLI. If we want to track latency, we may have several SLIs – we might want to use both the 50th and 99th percentiles, for example (more on that later). Next we have “service level objective”, or SLO. This is the term that more people know, however ITIL has decided to deprecate the use of this term in their documentation and use “service level target” (SLT) instead. They have the same meaning, however. The SLO is the specific measurements of your SLI that are considered good and bad. So if your goal is to have the error rate be less than 1%, and you’re going to track that over the course of a day, your SLO could be “service error rate < 1% averaged over 24 hours”, a combination of an SLI and a specific target for it, including the timeframe that it is measured over. On top of all this is the “service level agreement”, or SLA. The SLA comprises the entire agreement with the cluster, which includes not only the SLO, but also consequences for what happens if the SLO is missed, how customer support is provided. It is a contract, and if you’re running a system with internal customers you probably don’t have one of these (even if you have SLOs already). With terminology defined, how do we define what the SLOs should be for a given service?
We’re going to talk about another thing that we’ve probably all heard about, SMART goals. There’s a number of variants on what the letters S, M, A, R, and T are supposed to stand for here, so we’re going to use a set that will apply well to talking about SLOs. The end result should be an understanding of how to pick SLOs that won’t bite us later. First, they need to be specific. It’s not sufficient to say we’re going to measure latency. We need to specify exactly what metric we are going to use and how it is measured. For something like latency, it’s important to know if we’re using a histogram so we can specify the right components (like 99th percentile). We also need exact target measurements and the time periods that they are measured over. Next, and this may seem like it should go without saying, they need to be measurable. We can’t have a goal of making the customer happy – we don’t have a way to measure that. Even more reasonable things, like error rates and latencies, if we don’t have a way to measure them properly. It’s also worth noting here that where we measure these things from matters. What our server says the latency and error rate is will probably differ from what our clients think. So it’s important to say where the measurement is made from. The SLOs also have to be agreed on between us and our customers. This is pretty simple – I can’t specify alone what the SLOs for my service are, because I need to know what’s required and acceptable to my customers. Neither can they say what the SLOs are without my involvement. This is because the goals also need to be realistic. If you have a service that performs a complex calculation which takes a minimum of 10ms of CPU time (in terms of the wall clock), then it’s not reasonable to have a latency SLO at 5ms. If you can’t agree on the goals, specifically if they’re not realistic, it’s a sign you need to have a deep discussion with your customers about what they’re trying to accomplish. Lastly, we’re going to have two “T”s – time-limited and testable. Testable is very similar to measurable, but we can also use it to say that we should be able to test for compliance with the SLOs before we release new code. That might be through pre-release performance testing, canaries, or some other mechanism for minimizing the impact of bad code. Time-limited means that we need to define the time period over which we are measuring the SLO. If we’re talking about availability of the service, then four nines measured over a day is very different than when it is measured over a week. The first one means that we can only have less than 9 seconds of downtime every day, whereas the latter means we can have a single event where we are down for a minute in a week.
So what are the SLOs you should be looking at? It’s hard for me to tell you what your customers want you to agree to for any given service, but they’re usually going to be one of these three things: availability, latency, or correctness. Availability is simply whether or not your service is up and running and able to handle requests. It doesn’t (usually) mean that every component is up, just that a customer can send a request and get a response. Usually this is expressed in terms of how many nines you have. Latency is how long it takes to service a request. This may be overall for your service, or if you have endpoints that behave differently, it might be per endpoint. You also may have SLOs that cover different percentiles for the same value. For example, you might have a 100ms response time at the 50th percentile, and a 1 second response time at the 99th percentile. Correctness is the term I use instead of error rate. This is because an error is just one type of bad response. Giving a wrong answer is usually worse than returning an error code. All together, these are the things customers usually care about: can I make requests, do I get responses in a reasonable time period, and are the responses correct?
Now we can talk about what we have to work with to make sure that we hit those goals. The entire discipline is called monitoring, but what is it really?
The Oxford definition of “to monitor” is: Observe and check the progress or quality of (something) over a period of time; keep under systematic review. This matches up with what we know – monitoring is about watching and measuring the status of our services.
If that’s the case, then what the heck is this term that keeps coming up, observability? Well, what I’m gonna say may sound indelicate… There are some who use this term to indicate a magical new discipline that will solve all your problems, an alternative to monitoring, by giving you a view into what your applications, or your distributed systems, are doing. The reality? What they’re talking about is a specific kind of monitoring that we’ll get to in a minute. Observability is a term from control theory. The definition there is “a measure of how well internal states of a system can be inferred from knowledge of its external outputs.” Sounds an awful lot like monitoring, doesn’t it? There’s a good reason for that – they’re not at odds with each other. Observability is a noun – it’s something you have, or don’t. It’s not something you can “do”. Monitoring is a verb – it’s something you do. In many cases, observability in this world is more of a measure of how good your monitoring is. Image Copyright © 2002 Viacom International Inc
OK, enough about observability. What are we looking for with our monitoring? Where are we trying to get to? We’re going to call this the Rumsfeld Quadrant. On the side, we have detection of problems – there are known issues that we are able to detect, and unknown issues that we are not able to detect. Along the top, we have our response to those problems – there are issues where we know how to respond to them, how to fix them, and we have issues that we don’t know how to fix. For issues that we can detect and that we know how to respond to, these are the known-knowns. This is the realm of good monitoring – crisp signals, and runbooks to address them. These are things that should be automated as well – we know how to fix them, so why spend engineer hours on doing this? If we have problems that we can detect, but we don’t know how to respond to, these are known-unknowns: our active incidents that we’re working on. For problems we can’t detect, but we know how to fix, these are unknown-knowns. These are monitoring gaps that need to be resolved. The last category is the unknown-unknowns – problems that we don’t know about, and we don’t know how to fix. These are tweets about your service being broken – you’re hearing about it from customers, and not from your own system. And you’re being spun up to quickly figure out what the heck broke. By necessity, these need to rapidly migrate to either known-unknowns or unknown-knowns, depending on whether you can fix it or detect it faster. Eventually, all of these issues need to migrate to known-knowns. If they don’t, you’re stuck in reactive work all the time.
So we know what we’re looking for. How can we get there? What data do we have to work with? Monitoring generally breaks down to two types of things: metrics, and events. Metrics are single numbers, whereas events are structured data. By the way, when you hear people talk about “observability”, they’re talking about events. Metrics are what we make our graphs out of. Counters are pretty simple – constantly increasing integers. Total number of requests, total number of errors. Gauges are numbers (either integers or floats) that fluctuate up and down, like a speedometer. This could be requests per second, or network utilization, or CPU utilization. A third type of data is histograms. This is bucketed data, typically represented as percentiles. The 50th percentile is the median, the 99th percentile means that 99% of values are less than the metric value. We often use histograms for metrics like latencies, so that we can not only see the average but also what the worst offenders are. For events, the one we probably all have already is log messages. Of course, they’re probably not as structured as we’d like them to be – most of us are using plain strings, even if they’re in a well-defined format like Apache HTTPD logs. Everyone, if they’re not already, should be moving towards true structured logging in a format like JSON (there are other options, but JSON is the most common). The other type of event data, specifically relevant for distributed systems, is request tracing. This is actually a collection of events, where each event is a discrete call made as part of the initial request. For example, a user makes a request at service A. Service A then calls services B and C to get results, and assembles a response to the user. You would have 3 events in a trace, at minimum – one each for the initial user call to A, the call from A to B, and the call from B to C. These events will have rich data about the requests: the caller, the endpoint called, the status of the call, the time the call took, and possibly much more information. This list isn’t exhaustive, by any means. But it is the most common types of monitoring data we have to work with. Metrics are usually where your alerts come from – they let you know that there is a problem. Events are usually where you get your detail for debugging from – they help you figure out why you have a problem. This is not an ”either/or” – use both.
We also have a choice on where we can get a lot of our data from. Specifically, I will talk about subjective measurements, and objective measurements. Subjective measurements are things that we measure about ourselves. “I am a very handsome presenter” is a subjective statement. Objective measurements are things that other people measure about us. “This guy has no idea what he’s talking about” is an objective statement. Subjective monitoring has the ability to give us very rich data on the internal state of a system, because we are instrumenting the system directly. These types of data are absolutely necessary for high observability. However, we need to be careful here because while there is lots of data that we can make available, not all of it is going to be useful. Do you really care about the byte size of the representation of a specific array in your code? You might, but it’s probably not going to be a very useful piece of information. Additionally, it might not be the best way to measure something. If you measure the error rate for requests at the server, it won’t include all the requests that don’t make it to the server at all. On the other side, we have objective monitoring. This provides a view of your system that is more like what your customers see. You can think of a service like “Down for everyone or just me?” – it makes a measurement against your system from the outside, and takes into account something other than what the system thinks about itself. This type of data is critically important for monitoring our SLOs, since when it comes to an agreement with our customers, they don’t really care if the service is down because the service itself is broken or if the network getting to the service is broken. It’s definitely harder to do, but it’s also much more of an authority on whether or not your service is working. Again, most of the time you need both of these. You need objective monitoring to know whether or not there is a problem, and you need subjective monitoring to really dig into the detail of why.
With a shared understanding of what we can do, we can now move into how we handle our services to make sure that we have enough data to understand them, but avoid killing ourselves with noisy alerts and information overload. In large part, this means we need to design our services from the ground up to be successful.
This means building them knowing that they are going to fail. Anyone who says they have a service that will never fail is a liar – everything fails, even if it means the failure is outside of your direct control. If we know we’re going to fail, what do we need to do in advance to make it as easy as possible for us to manage that? First, the code needs to provide us with appropriate intelligence about what is going on – we need high observability. We want detailed instrumentation on the operations of our entire system, including both metrics and events, so that we can debug problems easily. We also need to make sure that any SLIs that we have identified are included in there as well. Next, we need to build availability into the architecture. This means that we will strive to tolerate the failure of any single component without affecting the overall availability of the system. And this doesn’t mean N+1, which is what most people think of when it comes to availability. With the Kafka infrastructure at LinkedIn, we have previously used a replication factor of 2 – this means that we can lose a single broker, and the cluster will continue to operate. However, we can only lose one broker, and this means that whenever we had a hardware failure, we needed to spin up the on call engineer immediately to get that system back up and running, because a second failure would mean that we were down. This isn’t really being able to tolerate a single failure - it just meant that we had a very small window in which it wasn’t impacting customers. The third part is that we need to manage the capacity of our services so that we don’t get surprised by a sudden surge. For storage systems like Kafka, this means limiting the creation of resources, and how much storage and processing those resources can use by default. For non-storage things, this might mean quotas on request rates from upstream callers, or dynamic allocation of servers to accommodate surges in traffic. Either way, capacity problems cannot surprise us because we don’t have the ability to magically make new resources appear. And if you do, that should be automatic.
When we’re setting up the alerting on all that rich instrumentation, the SLO is the thing to look at. We’ve hit some of these points already, but it’s worth a recap: We always have to measure the SLIs that have been defined. If we don’t, you may as well not bother having SLOs at all, because you don’t know if you’re hitting them or not. What’s worse is if you rely on your customers to monitor the SLOs. As a service owner, I never want to hear about any problem from my customers – I should always be the first to know about an issue, and I should be informing them. Not the other way around. When it comes to the SLIs, the best monitoring is objective monitoring. Since the SLO is an agreement with the customer, it is almost always measured from the customer’s point of view, not ours. Our monitoring should match that. We can rely on subjective monitoring to provide detailed data when we’re debugging a problem, but generally not for the SLOs themselves. Once we’ve got our SLOs defined, we should not try to beat them. At least not by very much, since we probably always want a little bit of buffer. But if you agree in your SLOs on 90% uptime, but you’re actually delivering 99% uptime all the time, guess what? Your customers probably have become accustomed to that, and they’re going to start making noise if the uptime is 98% even though that’s well within the SLO on paper. This is a good time to have maintenance windows where the service is offline even if it doesn’t need to be. Or artificial latency that can be dynamically adjusted if you have a downstream problem. It may seem counterintuitive to not be the best we can be here, but you’re going to sleep a lot better if you deliver what you promised. Lastly, only set up alerts on the SLO.
Wait, you say. I must have misheard you. I thought you said that I shouldn’t alert on anything except the SLOs, and not these dozens of other metrics that indicate problems! Yes, that’s exactly what I said. And here is why. Think back to the Rumsfeld Quadrant. Alerts on the SLOs will find the unknown-unknowns – the problems that we can’t currently detect and our customers are going to tell us about. The known-unknowns, which indicate an active incident that we’re debugging how to fix, and the unknown-knowns, where we have a monitoring gap, by definition must only exist in our systems transiently. They have to transition to known-knowns if we’re going to avoid spending our entire day in reactive work. The known-knowns have a known detection and a known response – that should not require a human being to handle them. Those responses should be automated. But those are the only things we have. Monitoring signals used for alerting either tell us about a problem clearly, or they don’t – there is no useful grey area. And if they don’t tell us about a problem clearly, we’re setting ourselves up for failure if we use them. So if you have another signal that is 100% clear, that can potentially be used as an alert. But if you have a problem that doesn’t impact the SLO, why does that need to wake you up? That problem is either a known-known, in which case it can be automated away, or it’s a known-unknown, which someone had better be working on turning into a known-known. Now, I will admit that what I have described is the ideal world. But it is of the utmost importance that we have the ideal state in mind, and make sure that everything we do moves us towards that ideal state, not away from it.
The one gotcha in all this is the capacity of our service. That’s probably not an SLO, but it does need to be monitored and managed correctly. Now, I’m specifically talking about systems that don’t have dynamic capacity changes available to them. If you’re running in a public cloud, and your service is well designed for scalability, you can probably react to capacity changes by spinning up new instances. Automate that. For the rest of us, and those of us who are running storage systems that require special handling, there are a few things we can do. First, as I mentioned earlier, use quotas to make sure that by default, your users are not able to overrun your available capacity. In Kafka, for example, this means using message retention by bytes on disk, as well as restricting the inbound and outbound bytes per second rates. Your customers should have resource limits in place so that if they want to significantly change their call patterns, they need to interact with you somehow to make sure you know about it in advance, and can plan for it. Once you have limits in place, you need to report on the capacity of your system and make sure you’re reviewing those reports frequently. Maybe you can automate this entire process, and have the reporting system put in hardware orders automatically. Maybe not. But either way, this is a process that is out-of-band from your normal alerting because hardware doesn’t magically appear when you snap your fingers. And don’t ever ignore those reports, or put off the expansion work that’s required. That’s really only asking to have problems that you could have taken care of proactively.
If we can manage to do these things - design our systems well, monitor for the SLOs, and manage capacity proactively – we set ourselves up for success. The SLOs will let us know about the unknown-unknowns, and the detailed metrics and events will provide what we need to fix those new problems. As someone who wants to move towards the ideal state of automating their troubles away and cleaning up the noise of bad alerts, where do you start?
The first thing you want to do is define your SLOs. Have a conversation with your customers, whether internal or external, and come to an agreement on what your service can and will provide for them. Then add some objective monitoring so that you can stick to those agreements. Next, you need to work on cleaning up your alerts. Take a look at all of them, and eliminate anything that doesn’t have a clear signal to start with. At least put them in quarantine and see if you can stop waking yourself up with them. Add new alerts for the SLOs that you now have. And make sure you have quotas in place to avoid surprises. Lastly, beef up the instrumentation for your services, and manage the monitoring data appropriately. If you’re not there already, for the love of all that is good please switch to structured logging. The larger your systems get, the more you realize that all of your data (like logs) should be targeted at automated processing, not humans being able to read it. Also think about adding request tracing if it seems appropriate. This is really important for distributed systems, but it’s also useful for tracing a request within a single service. It can be as simple as logging detailed request data, or more complex like emitting messages to a system like Kafka for stream processing. Ultimately, the more instrumentation you add, the more you need to make sure you’re only holding onto what you need. Make sure that monitoring data, including tracing, is only retained for as long as it’s important for debugging. This will help keep you in the good graces of the teams that are responsible for your monitoring storage, as well.
I’ve also got a few resources for you. Due to monitoring overload and capacity problems, among other things, our Kafka team spent the first half of the year in a state we call Code Yellow. This is a way for us to put the brakes on and fix critical problems with a service or a team. If you’d like to learn more about that, I’ve written a blog post on what Code Yellow is. Michael Kehoe and I will be at LISA in a couple weeks to talk about that and other Code Yellows at LinkedIn. If you’re looking for more information about SRE at LinkedIn, I’ve spoken on that several times including at SREcon Asia this past. There’s also an excellent series written by one of my colleagues, Ben Purgason along with our former head of engineering David Henke called “Every Day is Monday in Operations”. And if you’re interested in learning more about Apache Kafka, there are many resources available and previous talks I have done. When it comes to monitoring, there’s a talk that I’ve given that echoes a lot of what you’ve heard here, applied to Kafka. You can view the talk from Kafka Summit London earlier this year. I’ll also be giving this talk next week at Kafka Summit SF, if you happen to be going. And as always, if you have any questions you can feel free to connect with me and ask. I’m on LinkedIn, of course, and you can find me on Twitter as well.

Why Does (My) Monitoring Suck?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Why Does (My) Monitoring Suck?

Similar to Why Does (My) Monitoring Suck? (20)

More from Todd Palino

More from Todd Palino (13)

Recently uploaded

Recently uploaded (20)

Why Does (My) Monitoring Suck?

Editor's Notes