SlideShare a Scribd company logo
1 of 36
1
Metrics Are Not Enough
Gwen Shapira, Product Manager
@gwenshap
Monitoring Apache Kafka and Streaming Applications
2
Monitoring Distributed Systems is hard
“Google SRE team with 10–12 members
typically has one or sometimes two members
whose primary assignment is to build and maintain
monitoring systems for their service.”
https://www.oreilly.com/ideas/monitoring-distributed-systems
3
Apache Kafka is a distributed system and has many components
4
Many Moving Parts to Watch
• Producers
• Consumers
• Consumer Groups
• Brokers
• Controller
• Zookeeper
• Topics
• Partitions
• Messages
• …..
5
And many metrics to monitor
• Broker throughput
• Topic throughput
• Disk utilization
• Unclean leader elections
• Network pool usage
• Request pool usage
• Request latencies – 30 request types, 5 phases
each
• Topic partition status counts: online, under
replicated, offline
• Log flush rates
• ZK disconnects
• Garbage collection pauses
• Message delivery
• Consumer groups reading from topics
• …​
6
Every Service that uses Kafka is a Distributed System
Orders
Service
Stock
Service
Fulfilment
Service
Fraud Detection
Service
Mobile App
Kafka
7
It is all CRITICAL to your business
• Real-time applications mean very little room for errors
• Is Kafka available and performing well? You need to know before your users do.
• You must detect and act on small problems before they escalate
• The business cares a lot about accuracy and SLAs
• It is 8:05am, does the dashboard reflect the status of the system up to 8am?
• Continuously improve performance
• Monitor Kafka cluster performance
• Identify and act on leading indicators of future problems
• Quick triage – can you identify likely causes of a problem quickly and effectively?
8
So you may need a bit of help
• Operators must have visibility into the health
of the Kafka cluster
• The business must have visibility into
completeness and latency of message
delivery
• Everyone needs to focus on the most
meaningful metrics
9
Types of monitoring
• Tailing logs
• OS metrics
• Kafka / Client metrics
• Tracing applications
• Event level sampling
• APM – Application performance from user perspective
• …
10
Types of monitoring
• Tailing logs
• OS metrics
• Kafka / Client metrics
• Tracing applications
• Event level sampling
• APM – Application performance from user perspective
• …
11
Monitor System Health of Your Cluster
12
The basics
• Whatever else you do: Check that the broker process is running
• External agent
• Or alert on stale metrics
• Don’t alert on everything. Fewer, high level alerts are better.
13
First Things First
14
Under-replicated partitions
• If you can monitor just one thing…
• Is it a specific broker?
• Cluster wide:
• Out of resources
• Imbalance
• Broker:
• Hardware
• Noisy neighbor
• Configuration
15
Drill Down into Broker and Topic: Do we see a problem right here?
16
Check partition placement - is the issue specific to one broker?
17
Don’t watch the dashboard
• Control Center detects anomalous events in monitoring data
• Users can define triggers
• Control Center performs customizable actions when triggers occur
• When troubleshooting Kafka issues, users can view previous alerts and historical message delivery
data at the time the alert occurred
18
Capacity Planning – Be Proactive
• Capacity planning ensures that your cluster can continue to meet business demands
• Control Center provides indicators if a cluster may need more brokers
• Key metrics that indicate a cluster is near capacity:
• CPU
• Network and thread pool usage
• Request latencies
• Network utilization - Throughput, per broker and per cluster
• Disk utilization - Disk space used by all log segments, per broker
19
Multi-Cluster Deployments
• Monitor all clusters in one place
20
Monitor End to End Message Delivery
21
Are You Meeting SLAs?
• Stream monitoring helps you determine if all messages are delivered end-to-end in a timely manner
• This is important for several reasons:
• Ensure producers and consumers are not losing messages
• Check if consumers are consuming more than expected
• Verify low latency for real-time applications
• Identify slow consumers
22
How to monitor?
The infamous LinkedIn “Audit”:
• Count messages when they are produced
• Count messages when they are consumed
• Check timestamps when they are consumed
• Compare the results
23
Message delivery metrics
Streaming message delivery metrics are available:
• Aggregate
• Per-consumer group
• Per-topic
24
Under Consumption
• Reasons for under consumption:
• Producers not handling errors and retried correctly
• Misbehaving consumers, perhaps the consumer did not follow shutdown sequence
• Real-time apps intentionally skipping messages
• Red bars indicate some messages were not consumed
• Herringbone pattern can indicate error in measurement
• Usually improper shutdown of client
25
Over Consumption
• Reasons for over consumption
• Consumers may be processing a set of messages more than once, which may have impact on their
applications
• Consumption bars are higher than the expected consumption lines
• Latency may be higher
26
Slow Consumers
• Identify consumers and consumer groups that are not keeping up with data production
• Use the per-consumer and per-consumer group metrics
• Compare a slow, lagging consumer (left) to a good consumer (right)
• The slow consumer (left) is processing all the messages, but with high latency
• Slow consumers may also process fewer messages in a given time window, so monitor "Expected
consumption" (the top line)
27
Optimize Performance
28
Identify Performance Bottlenecks
• Real-time applications require high throughput or low latency
• Need to baseline where you are
• Monitor for changes to get ahead of the problem
• You may need to identify performance bottlenecks
• Break-down the times for the end-to-end dataflow to give you pointers where streams are taking the
most processing time
• The key metrics to look at include:
• Request latencies
• Network pool usage
• Request pool usage
29
Produce and Fetch Request Latencies
Breakdown produce and fetch latencies through the
entire request lifecycle
Request latency values can be shown at the median,
95th, 99th, or 99.9th percentile
30
Request Latencies Explained (1)
• Total request latency (center)
• Total time of an entire request lifecycle, from the broker point of view
• Request queue
• The time the request is in the request queue waiting for an IO thread
• A high value can indicate there are not enough IO threads or CPU is a bottleneck
• Also check: What are those IO threads doing?
• Request local
• The time the request is being processed locally by the leader
• A high value can imply slow disk so monitor broker disk IO
31
Request Latencies Explained (2)
• Response remote
• The time the request is waiting on other brokers
• Higher times are expected on high-reliability or high-throughput systems
• A high value can indicate a slow network connection, or the consumer is caught up to the end of the log
• Response queue
• The time the request is in the response queue waiting for a network thread
• A high value can imply there are not enough network threads
• Response send
• The time the request is being sent back to the consumer
• A high value can imply the CPU or network is a bottleneck
32
Network and Request Handler Threads
• Network pool usage
• Average network pool capacity usage across all brokers, i.e. the fraction of time the network processor
threads are not idle
• If network pool usage is above 70%, isolate bottleneck with the request latency breakdown
• Consider increasing the broker configuration parameter num.network.threads, especially if Response
queue metric is high and you have resources
• Request pool usage
• Average request handler capacity usage across all brokers, i.e. the fraction of time the request handler
threads are not idle
• If request pool usage is above 70%, isolate bottleneck with the request latency breakdown
• Consider increasing the broker configuration parameter num.io.threads, especially if Request queue
metric is high
• Why are all your handlers busy? Check GC, access patterns and disk IO
33
Summary
34
Few things to remember…
• Monitor Kafka
• Work with your developers to monitor critical applications end-to-end
• More data is better: Metrics + logs + OS + APM + …
• But fewer alerts are better
• Alert on what’s important – Under—Replicated Partitions is a good start
• DON’T JUST FIDDLE WITH STUFF
• AND DON’T RESTART KAFKA FOR LOLS
• If you don’t know what you are doing, it is ok. There’s support (and Cloud) for that.
35
And as you start your Production Kafka Journey…
Plan
Validate
Deploy
Observe
Analyze
36
Thank You!

More Related Content

What's hot

Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloudconfluent
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
MySQL For Oracle Developers
MySQL For Oracle DevelopersMySQL For Oracle Developers
MySQL For Oracle DevelopersRonald Bradford
 
Rabbitmq & Kafka Presentation
Rabbitmq & Kafka PresentationRabbitmq & Kafka Presentation
Rabbitmq & Kafka PresentationEmre Gündoğdu
 
Nfs protocol sequence_diagram
Nfs protocol sequence_diagramNfs protocol sequence_diagram
Nfs protocol sequence_diagramRoberto Castro
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafkaconfluent
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019confluent
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformJean-Paul Azar
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedInGuozhang Wang
 
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataBuilding Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataTimothy Spann
 
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...confluent
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelinesSumant Tambe
 
Kafka Overview
Kafka OverviewKafka Overview
Kafka Overviewiamtodor
 
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...NoSQLmatters
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Kai Wähner
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 

What's hot (20)

Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
MySQL For Oracle Developers
MySQL For Oracle DevelopersMySQL For Oracle Developers
MySQL For Oracle Developers
 
Rabbitmq & Kafka Presentation
Rabbitmq & Kafka PresentationRabbitmq & Kafka Presentation
Rabbitmq & Kafka Presentation
 
Nfs protocol sequence_diagram
Nfs protocol sequence_diagramNfs protocol sequence_diagram
Nfs protocol sequence_diagram
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafka
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
 
Mongodb replication
Mongodb replicationMongodb replication
Mongodb replication
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataBuilding Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
 
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
 
Kafka Overview
Kafka OverviewKafka Overview
Kafka Overview
 
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 

Similar to Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications

Monitoring Apache Kafka
Monitoring Apache KafkaMonitoring Apache Kafka
Monitoring Apache Kafkaconfluent
 
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Ontico
 
OnPrem Monitoring.pdf
OnPrem Monitoring.pdfOnPrem Monitoring.pdf
OnPrem Monitoring.pdfTarekHamdi8
 
Application Performance Management
Application Performance ManagementApplication Performance Management
Application Performance ManagementNoriaki Tatsumi
 
Resilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackResilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackC4Media
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowTodd Palino
 
URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know confluent
 
Multi Layer Monitoring V1
Multi Layer Monitoring V1Multi Layer Monitoring V1
Multi Layer Monitoring V1Lahav Savir
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Lari Hotari
 
Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performanceconfluent
 
Introduction to dev ops
Introduction to dev opsIntroduction to dev ops
Introduction to dev opsLen Bass
 
Production Ready Microservices at Scale
Production Ready Microservices at ScaleProduction Ready Microservices at Scale
Production Ready Microservices at ScaleRajeev Bharshetty
 
Visibility-from web application interface to the database
Visibility-from web application interface to the databaseVisibility-from web application interface to the database
Visibility-from web application interface to the databaseManageEngine, Zoho Corporation
 
Adding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsAdding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsRonny López
 
Tokyo AK Meetup Speedtest - Share.pdf
Tokyo AK Meetup Speedtest - Share.pdfTokyo AK Meetup Speedtest - Share.pdf
Tokyo AK Meetup Speedtest - Share.pdfssuser2ae721
 
Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...
Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...
Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...NETWAYS
 
Fixing Domino Server Sickness
Fixing Domino Server SicknessFixing Domino Server Sickness
Fixing Domino Server SicknessGabriella Davis
 

Similar to Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications (20)

Monitoring Apache Kafka
Monitoring Apache KafkaMonitoring Apache Kafka
Monitoring Apache Kafka
 
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
 
OnPrem Monitoring.pdf
OnPrem Monitoring.pdfOnPrem Monitoring.pdf
OnPrem Monitoring.pdf
 
Application Performance Management
Application Performance ManagementApplication Performance Management
Application Performance Management
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Resilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackResilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes Back
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
 
URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know
 
Multi Layer Monitoring V1
Multi Layer Monitoring V1Multi Layer Monitoring V1
Multi Layer Monitoring V1
 
ADF Performance Monitor
ADF Performance MonitorADF Performance Monitor
ADF Performance Monitor
 
10135 b 11
10135 b 1110135 b 11
10135 b 11
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014
 
Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performance
 
Introduction to dev ops
Introduction to dev opsIntroduction to dev ops
Introduction to dev ops
 
Production Ready Microservices at Scale
Production Ready Microservices at ScaleProduction Ready Microservices at Scale
Production Ready Microservices at Scale
 
Visibility-from web application interface to the database
Visibility-from web application interface to the databaseVisibility-from web application interface to the database
Visibility-from web application interface to the database
 
Adding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsAdding Real-time Features to PHP Applications
Adding Real-time Features to PHP Applications
 
Tokyo AK Meetup Speedtest - Share.pdf
Tokyo AK Meetup Speedtest - Share.pdfTokyo AK Meetup Speedtest - Share.pdf
Tokyo AK Meetup Speedtest - Share.pdf
 
Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...
Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...
Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...
 
Fixing Domino Server Sickness
Fixing Domino Server SicknessFixing Domino Server Sickness
Fixing Domino Server Sickness
 

More from confluent

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flinkconfluent
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsconfluent
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flinkconfluent
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...confluent
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluentconfluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkconfluent
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloudconfluent
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Diveconfluent
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluentconfluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Meshconfluent
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservicesconfluent
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3confluent
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernizationconfluent
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataconfluent
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2confluent
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023confluent
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesisconfluent
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023confluent
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streamsconfluent
 

More from confluent (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 

Recently uploaded

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications

  • 1. 1 Metrics Are Not Enough Gwen Shapira, Product Manager @gwenshap Monitoring Apache Kafka and Streaming Applications
  • 2. 2 Monitoring Distributed Systems is hard “Google SRE team with 10–12 members typically has one or sometimes two members whose primary assignment is to build and maintain monitoring systems for their service.” https://www.oreilly.com/ideas/monitoring-distributed-systems
  • 3. 3 Apache Kafka is a distributed system and has many components
  • 4. 4 Many Moving Parts to Watch • Producers • Consumers • Consumer Groups • Brokers • Controller • Zookeeper • Topics • Partitions • Messages • …..
  • 5. 5 And many metrics to monitor • Broker throughput • Topic throughput • Disk utilization • Unclean leader elections • Network pool usage • Request pool usage • Request latencies – 30 request types, 5 phases each • Topic partition status counts: online, under replicated, offline • Log flush rates • ZK disconnects • Garbage collection pauses • Message delivery • Consumer groups reading from topics • …​
  • 6. 6 Every Service that uses Kafka is a Distributed System Orders Service Stock Service Fulfilment Service Fraud Detection Service Mobile App Kafka
  • 7. 7 It is all CRITICAL to your business • Real-time applications mean very little room for errors • Is Kafka available and performing well? You need to know before your users do. • You must detect and act on small problems before they escalate • The business cares a lot about accuracy and SLAs • It is 8:05am, does the dashboard reflect the status of the system up to 8am? • Continuously improve performance • Monitor Kafka cluster performance • Identify and act on leading indicators of future problems • Quick triage – can you identify likely causes of a problem quickly and effectively?
  • 8. 8 So you may need a bit of help • Operators must have visibility into the health of the Kafka cluster • The business must have visibility into completeness and latency of message delivery • Everyone needs to focus on the most meaningful metrics
  • 9. 9 Types of monitoring • Tailing logs • OS metrics • Kafka / Client metrics • Tracing applications • Event level sampling • APM – Application performance from user perspective • …
  • 10. 10 Types of monitoring • Tailing logs • OS metrics • Kafka / Client metrics • Tracing applications • Event level sampling • APM – Application performance from user perspective • …
  • 11. 11 Monitor System Health of Your Cluster
  • 12. 12 The basics • Whatever else you do: Check that the broker process is running • External agent • Or alert on stale metrics • Don’t alert on everything. Fewer, high level alerts are better.
  • 14. 14 Under-replicated partitions • If you can monitor just one thing… • Is it a specific broker? • Cluster wide: • Out of resources • Imbalance • Broker: • Hardware • Noisy neighbor • Configuration
  • 15. 15 Drill Down into Broker and Topic: Do we see a problem right here?
  • 16. 16 Check partition placement - is the issue specific to one broker?
  • 17. 17 Don’t watch the dashboard • Control Center detects anomalous events in monitoring data • Users can define triggers • Control Center performs customizable actions when triggers occur • When troubleshooting Kafka issues, users can view previous alerts and historical message delivery data at the time the alert occurred
  • 18. 18 Capacity Planning – Be Proactive • Capacity planning ensures that your cluster can continue to meet business demands • Control Center provides indicators if a cluster may need more brokers • Key metrics that indicate a cluster is near capacity: • CPU • Network and thread pool usage • Request latencies • Network utilization - Throughput, per broker and per cluster • Disk utilization - Disk space used by all log segments, per broker
  • 19. 19 Multi-Cluster Deployments • Monitor all clusters in one place
  • 20. 20 Monitor End to End Message Delivery
  • 21. 21 Are You Meeting SLAs? • Stream monitoring helps you determine if all messages are delivered end-to-end in a timely manner • This is important for several reasons: • Ensure producers and consumers are not losing messages • Check if consumers are consuming more than expected • Verify low latency for real-time applications • Identify slow consumers
  • 22. 22 How to monitor? The infamous LinkedIn “Audit”: • Count messages when they are produced • Count messages when they are consumed • Check timestamps when they are consumed • Compare the results
  • 23. 23 Message delivery metrics Streaming message delivery metrics are available: • Aggregate • Per-consumer group • Per-topic
  • 24. 24 Under Consumption • Reasons for under consumption: • Producers not handling errors and retried correctly • Misbehaving consumers, perhaps the consumer did not follow shutdown sequence • Real-time apps intentionally skipping messages • Red bars indicate some messages were not consumed • Herringbone pattern can indicate error in measurement • Usually improper shutdown of client
  • 25. 25 Over Consumption • Reasons for over consumption • Consumers may be processing a set of messages more than once, which may have impact on their applications • Consumption bars are higher than the expected consumption lines • Latency may be higher
  • 26. 26 Slow Consumers • Identify consumers and consumer groups that are not keeping up with data production • Use the per-consumer and per-consumer group metrics • Compare a slow, lagging consumer (left) to a good consumer (right) • The slow consumer (left) is processing all the messages, but with high latency • Slow consumers may also process fewer messages in a given time window, so monitor "Expected consumption" (the top line)
  • 28. 28 Identify Performance Bottlenecks • Real-time applications require high throughput or low latency • Need to baseline where you are • Monitor for changes to get ahead of the problem • You may need to identify performance bottlenecks • Break-down the times for the end-to-end dataflow to give you pointers where streams are taking the most processing time • The key metrics to look at include: • Request latencies • Network pool usage • Request pool usage
  • 29. 29 Produce and Fetch Request Latencies Breakdown produce and fetch latencies through the entire request lifecycle Request latency values can be shown at the median, 95th, 99th, or 99.9th percentile
  • 30. 30 Request Latencies Explained (1) • Total request latency (center) • Total time of an entire request lifecycle, from the broker point of view • Request queue • The time the request is in the request queue waiting for an IO thread • A high value can indicate there are not enough IO threads or CPU is a bottleneck • Also check: What are those IO threads doing? • Request local • The time the request is being processed locally by the leader • A high value can imply slow disk so monitor broker disk IO
  • 31. 31 Request Latencies Explained (2) • Response remote • The time the request is waiting on other brokers • Higher times are expected on high-reliability or high-throughput systems • A high value can indicate a slow network connection, or the consumer is caught up to the end of the log • Response queue • The time the request is in the response queue waiting for a network thread • A high value can imply there are not enough network threads • Response send • The time the request is being sent back to the consumer • A high value can imply the CPU or network is a bottleneck
  • 32. 32 Network and Request Handler Threads • Network pool usage • Average network pool capacity usage across all brokers, i.e. the fraction of time the network processor threads are not idle • If network pool usage is above 70%, isolate bottleneck with the request latency breakdown • Consider increasing the broker configuration parameter num.network.threads, especially if Response queue metric is high and you have resources • Request pool usage • Average request handler capacity usage across all brokers, i.e. the fraction of time the request handler threads are not idle • If request pool usage is above 70%, isolate bottleneck with the request latency breakdown • Consider increasing the broker configuration parameter num.io.threads, especially if Request queue metric is high • Why are all your handlers busy? Check GC, access patterns and disk IO
  • 34. 34 Few things to remember… • Monitor Kafka • Work with your developers to monitor critical applications end-to-end • More data is better: Metrics + logs + OS + APM + … • But fewer alerts are better • Alert on what’s important – Under—Replicated Partitions is a good start • DON’T JUST FIDDLE WITH STUFF • AND DON’T RESTART KAFKA FOR LOLS • If you don’t know what you are doing, it is ok. There’s support (and Cloud) for that.
  • 35. 35 And as you start your Production Kafka Journey… Plan Validate Deploy Observe Analyze