SlideShare a Scribd company logo
1 of 18
Troubleshooting Kafka Common
Issues
Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Who am I?
2
● My name is BADAI AQRANDISTA
● I work for Confluent
● My job title is Technical Support Engineer
● My work is basically to “troubleshoot Kafka issues and fix them”
Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Kafka Overview
3
PRODUCER CONSUMER
Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Common Issues
4
● Under Replicated Partitions
● Consumer Rebalancing
Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Under Replicated Partitions (URP)
5
● Kafka organise its data in “topic”
● Each “topic” has one or more “partition”
● Each “partition” has one or more “replica”
● Each “replica” is hosted by one Kafka broker
● One “replica” acts as a “Leader” and the other “replica” act as a
“Follower”
● All Produce and Consume goes to the “Leader” replica
● “Follower” replicates data from the “Leader” continuously
● When “Follower” is in sync with the “Leader”, the “Follower” will be
listed in the “In-Sync Replicas” or ISR.
● When ISR count is less than the “replica” count, the “partition” is
called “Under Replicated Partition”
Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Under Replicated Partitions (URP)
6
● JMX Metrics:
○ “kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions”
○ “kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount”
Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Under Replicated Partition (URP)
7
Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Under Replicated Partitions (URP)
8
Broker 1 Broker 2 Broker 3
P0 P0 P0
P1 P1 P1
Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Temporary URP - Definition and Symptoms
9
● URP that disappears by itself
● Can cause producer failure with NotEnoughReplicasException if:
○ Replication factor is 3
○ Topic or broker configuration “min.insync.replicas” is set to 2
○ “retries” is set to 0 - defaults to 2147483647 since AK 2.1.0
● Transient issue and should not affect client application
● Broker log contains a group of “Shrinking ISR” messages and then a group of
“Expanding ISR” messages
[2020-11-18 19:19:13,751] INFO [Partition _confluent-metrics-4 broker=2] Shrinking ISR from 2,1,3 to 2,3. Leader:
(highWatermark: 14459, endOffset: 14460). Out of sync replicas: (brokerId: 1, endOffset: 14459). (kafka.cluster.Partition)
[2020-11-18 19:19:13,754] INFO [Partition _confluent-metrics-4 broker=2] ISR updated to [2,3] and zkVersion updated to [7]
(kafka.cluster.Partition)
[2020-11-18 19:19:13,786] INFO [Partition _confluent-metrics-4 broker=2] Expanding ISR from 2,3 to 2,3,1
(kafka.cluster.Partition)
Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Temporary URP - Causes and Solutions
10
● Intermittent network latency between brokers
○ Confirm by doing ping for 24 hours: “ping -c 86400 {broker}”
○ Increase broker config: “replica.lag.time.max.ms” from 10000 (default in < AK 2.5) to
30000 (default in >= AK 2.5.0)
○ Related configs: “offsets.commit.timeout.ms” and “zookeeper.session.timeout.ms”
● Produce rate > Replication rate
○ “kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce|Fet
chConsumer|FetchFollower}”
○ “kafka.network:type=RequestMetrics,name=RequestBytes,request={Produce|Fetch
Consumer|FetchFollower}”
○ Caused by high throughput producers using “acks=0” or “acks=1”
○ Set producer config: “acks=all” - This will force producers to align its throughput
with replication throughput
○ Increase broker config: “num.replica.fetchers” from 1 (default) to 2 or 3
○ Use faster host
Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Permanent URP - Definition and Symptoms
11
● URP that requires broker restart to resolve
● URP count stays at non-zero continuously and sometime increases
● Can cause producer failure with NotEnoughReplicasException if:
○ Replication factor is 3
○ Topic or broker configuration “min.insync.replicas” is set to 2
○ “retries” is set to 0 - defaults to 2147483647 since AK 2.1.0
● Should not affect client application except in the above case
● Broker log contains a group of “Shrinking ISR” messages without any “Expanding ISR”
messages
[2020-11-18 19:19:13,252] INFO [Partition _confluent-metrics-7 broker=2] Shrinking ISR from 2,1,3 to 2,3. Leader:
(highWatermark: 14373, endOffset: 14374). Out of sync replicas: (brokerId: 1, endOffset: 14373). (kafka.cluster.Partition)
[2020-11-18 19:19:13,257] INFO [Partition _confluent-metrics-7 broker=2] ISR updated to [2,3] and zkVersion updated to [7]
(kafka.cluster.Partition)
Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Permanent URP - Causes and Solutions
12
● Disk issue (e.g. disk full or h/w failure)
○ Confirm at the disk level (with “df” or other commands)
○ If it is disk full, you can:
■ Remove old segment files on follower replicas
■ Delete files outside Kafka’s “log.dirs”
■ Expand the disk/filesystems
○ If the disk failed, you can:
■ Replace the disk
○ Data loss is possible if:
■ Replication factor is 1
■ Producer uses “acks=1” or “acks=0”
○ Solution: After fixing the disk issue, restart the broker.
Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Permanent URP - Causes and Solutions
13
● ReplicaFetcherThread crashes due to bug
○ KAFKA-6649
■ Error message: org.apache.kafka.common.errors.OffsetOutOfRangeException:
Cannot increment the log start offset to 2098535 of partition testtopic-84 since
it is larger than the high watermark -1
■ Error message (KAFKA-7635):
kafka.common.UnexpectedAppendOffsetException: Unexpected offset in
append to topic.a-0. First offset 3389 is less than the next offset 3395. First 10
offsets in append: List(3389, 3390, 3391, 3392, 3393, 3394, 3395, 3396, 3397, 3398),
last offset in append: 4945. Log start offset = 3353
■ Fixed in AK 1.1.0, with the fix for KAFKA-3978
○ KAFKA-8255
■ Error message: org.apache.kafka.common.errors.OffsetOutOfRangeException:
Cannot increment the log start offset to 4808819 of partition
__consumer_offsets-46 since it is larger than the high watermark 18761
■ Fixed in AK 2.2.1, with the fix for KAFKA-8306
Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Permanent URP - Causes and Solutions
14
● ReplicaFetcherThread crashes due to bug
○ Solution: Depends on the bug. But in general, deleting the follower replica’s
segment files fix the issue.
○ If the leader goes offline, you may need to accept some data loss by enabling
“unclean.leader.election=true” at the topic level and move the controller
● Broker down
○ Obviously, if a broker is down, you will see non-zero URP
○ Solution: Start the broker
Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Consumer Rebalancing
15
● KafkaConsumer goes through the following steps:
○ Join the consumer group and subscribe to one or more topics
○ Assign topic partitions to all KafkaConsumer in the group
○ Start consuming by calling “poll”
Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Consumer Rebalancing
16
● When the following happens, KafkaConsumer will have to reallocate the topic partitions
to KafkaConsumer in the group at that time, this is called “Rebalancing”
○ New KafkaConsumer joins the consumer group
○ Existing KafkaConsumer leaves the consumer group
○ Existing KafkaConsumer does not send any heartbeat
○ Existing KafkaConsumer spends too much time between “poll”
● Existing KafkaConsumer does not send any heartbeat
○ “heartbeat.interval.ms” - defaults to 3000 ms
○ “session.timeout.ms” - defaults to 10000 ms
○ Need to ensure that the coordinator receives one heartbeat message within the
session timeout
○ May need to increase “session.timeout.ms” if connection to the broker is bad
Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Consumer Rebalancing
17
● Existing KafkaConsumer spends too much time between “poll”
○ “max.poll.records” - defaults to 500
○ “max.poll.interval.ms” - defaults to 300000 ms (5 minutes)
○ This must always be true to avoid rebalancing:
“max.poll.records” x “processing time per record” < “max.poll.interval.ms”
● So your application must control the processing time per record. If you cannot control
this, it can go into rebalancing state continuously.
● If you do not know how long it takes to process each record, e.g. call REST API that may
fail and needs to retry indefinitely, then you can call “KafkaConsumer#pause” method
and then call “poll()” regularly. This will make “poll()” to return an empty list of records.
Once you are ready to process next record you call “KafkaConsumer#resume” then
“poll()”
Thank you!
Any questions?

More Related Content

More from confluent

Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Diveconfluent
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluentconfluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Meshconfluent
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservicesconfluent
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3confluent
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernizationconfluent
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataconfluent
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2confluent
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023confluent
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesisconfluent
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023confluent
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streamsconfluent
 
The Journey to Data Mesh with Confluent
The Journey to Data Mesh with ConfluentThe Journey to Data Mesh with Confluent
The Journey to Data Mesh with Confluentconfluent
 
Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performanceconfluent
 
Confluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with ReplyConfluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with Replyconfluent
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Diveconfluent
 
Citi Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid CloudCiti Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid Cloudconfluent
 
Partner Tech Talk Q3: Q&A with PS - Migration and Upgrade
Partner Tech Talk Q3: Q&A with PS - Migration and UpgradePartner Tech Talk Q3: Q&A with PS - Migration and Upgrade
Partner Tech Talk Q3: Q&A with PS - Migration and Upgradeconfluent
 
Confluent Partner Tech Talk with QLIK
Confluent Partner Tech Talk with QLIKConfluent Partner Tech Talk with QLIK
Confluent Partner Tech Talk with QLIKconfluent
 
Real-time Streaming for Government and the Public Sector
Real-time Streaming for Government and the Public SectorReal-time Streaming for Government and the Public Sector
Real-time Streaming for Government and the Public Sectorconfluent
 

More from confluent (20)

Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 
The Journey to Data Mesh with Confluent
The Journey to Data Mesh with ConfluentThe Journey to Data Mesh with Confluent
The Journey to Data Mesh with Confluent
 
Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performance
 
Confluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with ReplyConfluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with Reply
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Dive
 
Citi Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid CloudCiti Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid Cloud
 
Partner Tech Talk Q3: Q&A with PS - Migration and Upgrade
Partner Tech Talk Q3: Q&A with PS - Migration and UpgradePartner Tech Talk Q3: Q&A with PS - Migration and Upgrade
Partner Tech Talk Q3: Q&A with PS - Migration and Upgrade
 
Confluent Partner Tech Talk with QLIK
Confluent Partner Tech Talk with QLIKConfluent Partner Tech Talk with QLIK
Confluent Partner Tech Talk with QLIK
 
Real-time Streaming for Government and the Public Sector
Real-time Streaming for Government and the Public SectorReal-time Streaming for Government and the Public Sector
Real-time Streaming for Government and the Public Sector
 

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Troubleshooting Apache Kafka® common issues

  • 2. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Who am I? 2 ● My name is BADAI AQRANDISTA ● I work for Confluent ● My job title is Technical Support Engineer ● My work is basically to “troubleshoot Kafka issues and fix them”
  • 3. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Kafka Overview 3 PRODUCER CONSUMER
  • 4. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Common Issues 4 ● Under Replicated Partitions ● Consumer Rebalancing
  • 5. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Under Replicated Partitions (URP) 5 ● Kafka organise its data in “topic” ● Each “topic” has one or more “partition” ● Each “partition” has one or more “replica” ● Each “replica” is hosted by one Kafka broker ● One “replica” acts as a “Leader” and the other “replica” act as a “Follower” ● All Produce and Consume goes to the “Leader” replica ● “Follower” replicates data from the “Leader” continuously ● When “Follower” is in sync with the “Leader”, the “Follower” will be listed in the “In-Sync Replicas” or ISR. ● When ISR count is less than the “replica” count, the “partition” is called “Under Replicated Partition”
  • 6. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Under Replicated Partitions (URP) 6 ● JMX Metrics: ○ “kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions” ○ “kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount”
  • 7. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Under Replicated Partition (URP) 7
  • 8. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Under Replicated Partitions (URP) 8 Broker 1 Broker 2 Broker 3 P0 P0 P0 P1 P1 P1
  • 9. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Temporary URP - Definition and Symptoms 9 ● URP that disappears by itself ● Can cause producer failure with NotEnoughReplicasException if: ○ Replication factor is 3 ○ Topic or broker configuration “min.insync.replicas” is set to 2 ○ “retries” is set to 0 - defaults to 2147483647 since AK 2.1.0 ● Transient issue and should not affect client application ● Broker log contains a group of “Shrinking ISR” messages and then a group of “Expanding ISR” messages [2020-11-18 19:19:13,751] INFO [Partition _confluent-metrics-4 broker=2] Shrinking ISR from 2,1,3 to 2,3. Leader: (highWatermark: 14459, endOffset: 14460). Out of sync replicas: (brokerId: 1, endOffset: 14459). (kafka.cluster.Partition) [2020-11-18 19:19:13,754] INFO [Partition _confluent-metrics-4 broker=2] ISR updated to [2,3] and zkVersion updated to [7] (kafka.cluster.Partition) [2020-11-18 19:19:13,786] INFO [Partition _confluent-metrics-4 broker=2] Expanding ISR from 2,3 to 2,3,1 (kafka.cluster.Partition)
  • 10. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Temporary URP - Causes and Solutions 10 ● Intermittent network latency between brokers ○ Confirm by doing ping for 24 hours: “ping -c 86400 {broker}” ○ Increase broker config: “replica.lag.time.max.ms” from 10000 (default in < AK 2.5) to 30000 (default in >= AK 2.5.0) ○ Related configs: “offsets.commit.timeout.ms” and “zookeeper.session.timeout.ms” ● Produce rate > Replication rate ○ “kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce|Fet chConsumer|FetchFollower}” ○ “kafka.network:type=RequestMetrics,name=RequestBytes,request={Produce|Fetch Consumer|FetchFollower}” ○ Caused by high throughput producers using “acks=0” or “acks=1” ○ Set producer config: “acks=all” - This will force producers to align its throughput with replication throughput ○ Increase broker config: “num.replica.fetchers” from 1 (default) to 2 or 3 ○ Use faster host
  • 11. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Permanent URP - Definition and Symptoms 11 ● URP that requires broker restart to resolve ● URP count stays at non-zero continuously and sometime increases ● Can cause producer failure with NotEnoughReplicasException if: ○ Replication factor is 3 ○ Topic or broker configuration “min.insync.replicas” is set to 2 ○ “retries” is set to 0 - defaults to 2147483647 since AK 2.1.0 ● Should not affect client application except in the above case ● Broker log contains a group of “Shrinking ISR” messages without any “Expanding ISR” messages [2020-11-18 19:19:13,252] INFO [Partition _confluent-metrics-7 broker=2] Shrinking ISR from 2,1,3 to 2,3. Leader: (highWatermark: 14373, endOffset: 14374). Out of sync replicas: (brokerId: 1, endOffset: 14373). (kafka.cluster.Partition) [2020-11-18 19:19:13,257] INFO [Partition _confluent-metrics-7 broker=2] ISR updated to [2,3] and zkVersion updated to [7] (kafka.cluster.Partition)
  • 12. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Permanent URP - Causes and Solutions 12 ● Disk issue (e.g. disk full or h/w failure) ○ Confirm at the disk level (with “df” or other commands) ○ If it is disk full, you can: ■ Remove old segment files on follower replicas ■ Delete files outside Kafka’s “log.dirs” ■ Expand the disk/filesystems ○ If the disk failed, you can: ■ Replace the disk ○ Data loss is possible if: ■ Replication factor is 1 ■ Producer uses “acks=1” or “acks=0” ○ Solution: After fixing the disk issue, restart the broker.
  • 13. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Permanent URP - Causes and Solutions 13 ● ReplicaFetcherThread crashes due to bug ○ KAFKA-6649 ■ Error message: org.apache.kafka.common.errors.OffsetOutOfRangeException: Cannot increment the log start offset to 2098535 of partition testtopic-84 since it is larger than the high watermark -1 ■ Error message (KAFKA-7635): kafka.common.UnexpectedAppendOffsetException: Unexpected offset in append to topic.a-0. First offset 3389 is less than the next offset 3395. First 10 offsets in append: List(3389, 3390, 3391, 3392, 3393, 3394, 3395, 3396, 3397, 3398), last offset in append: 4945. Log start offset = 3353 ■ Fixed in AK 1.1.0, with the fix for KAFKA-3978 ○ KAFKA-8255 ■ Error message: org.apache.kafka.common.errors.OffsetOutOfRangeException: Cannot increment the log start offset to 4808819 of partition __consumer_offsets-46 since it is larger than the high watermark 18761 ■ Fixed in AK 2.2.1, with the fix for KAFKA-8306
  • 14. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Permanent URP - Causes and Solutions 14 ● ReplicaFetcherThread crashes due to bug ○ Solution: Depends on the bug. But in general, deleting the follower replica’s segment files fix the issue. ○ If the leader goes offline, you may need to accept some data loss by enabling “unclean.leader.election=true” at the topic level and move the controller ● Broker down ○ Obviously, if a broker is down, you will see non-zero URP ○ Solution: Start the broker
  • 15. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Consumer Rebalancing 15 ● KafkaConsumer goes through the following steps: ○ Join the consumer group and subscribe to one or more topics ○ Assign topic partitions to all KafkaConsumer in the group ○ Start consuming by calling “poll”
  • 16. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Consumer Rebalancing 16 ● When the following happens, KafkaConsumer will have to reallocate the topic partitions to KafkaConsumer in the group at that time, this is called “Rebalancing” ○ New KafkaConsumer joins the consumer group ○ Existing KafkaConsumer leaves the consumer group ○ Existing KafkaConsumer does not send any heartbeat ○ Existing KafkaConsumer spends too much time between “poll” ● Existing KafkaConsumer does not send any heartbeat ○ “heartbeat.interval.ms” - defaults to 3000 ms ○ “session.timeout.ms” - defaults to 10000 ms ○ Need to ensure that the coordinator receives one heartbeat message within the session timeout ○ May need to increase “session.timeout.ms” if connection to the broker is bad
  • 17. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Consumer Rebalancing 17 ● Existing KafkaConsumer spends too much time between “poll” ○ “max.poll.records” - defaults to 500 ○ “max.poll.interval.ms” - defaults to 300000 ms (5 minutes) ○ This must always be true to avoid rebalancing: “max.poll.records” x “processing time per record” < “max.poll.interval.ms” ● So your application must control the processing time per record. If you cannot control this, it can go into rebalancing state continuously. ● If you do not know how long it takes to process each record, e.g. call REST API that may fail and needs to retry indefinitely, then you can call “KafkaConsumer#pause” method and then call “poll()” regularly. This will make “poll()” to return an empty list of records. Once you are ready to process next record you call “KafkaConsumer#resume” then “poll()”