Troubleshooting Apache Kafka® common issues

Troubleshooting Kafka Common
Issues

Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Who am I?
2
● My name is BADAI AQRANDISTA
● I work for Confluent
● My job title is Technical Support Engineer
● My work is basically to “troubleshoot Kafka issues and fix them”

Kafka Overview
3
PRODUCER CONSUMER

Common Issues
4
● Under Replicated Partitions
● Consumer Rebalancing

Under Replicated Partitions (URP)
5
● Kafka organise its data in “topic”
● Each “topic” has one or more “partition”
● Each “partition” has one or more “replica”
● Each “replica” is hosted by one Kafka broker
● One “replica” acts as a “Leader” and the other “replica” act as a
“Follower”
● All Produce and Consume goes to the “Leader” replica
● “Follower” replicates data from the “Leader” continuously
● When “Follower” is in sync with the “Leader”, the “Follower” will be
listed in the “In-Sync Replicas” or ISR.
● When ISR count is less than the “replica” count, the “partition” is
called “Under Replicated Partition”

6
● JMX Metrics:
○ “kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions”
○ “kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount”

Under Replicated Partition (URP)
7

8
Broker 1 Broker 2 Broker 3
P0 P0 P0
P1 P1 P1

Temporary URP - Deﬁnition and Symptoms
9
● URP that disappears by itself
● Can cause producer failure with NotEnoughReplicasException if:
○ Replication factor is 3
○ Topic or broker conﬁguration “min.insync.replicas” is set to 2
○ “retries” is set to 0 - defaults to 2147483647 since AK 2.1.0
● Transient issue and should not affect client application
● Broker log contains a group of “Shrinking ISR” messages and then a group of
“Expanding ISR” messages
[2020-11-18 19:19:13,751] INFO [Partition _confluent-metrics-4 broker=2] Shrinking ISR from 2,1,3 to 2,3. Leader:
(highWatermark: 14459, endOffset: 14460). Out of sync replicas: (brokerId: 1, endOffset: 14459). (kafka.cluster.Partition)
[2020-11-18 19:19:13,754] INFO [Partition _confluent-metrics-4 broker=2] ISR updated to [2,3] and zkVersion updated to [7]
(kafka.cluster.Partition)
[2020-11-18 19:19:13,786] INFO [Partition _confluent-metrics-4 broker=2] Expanding ISR from 2,3 to 2,3,1

Temporary URP - Causes and Solutions
10
● Intermittent network latency between brokers
○ Confirm by doing ping for 24 hours: “ping -c 86400 {broker}”
○ Increase broker config: “replica.lag.time.max.ms” from 10000 (default in < AK 2.5) to
30000 (default in >= AK 2.5.0)
○ Related configs: “offsets.commit.timeout.ms” and “zookeeper.session.timeout.ms”
● Produce rate > Replication rate
○ “kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce|Fet
chConsumer|FetchFollower}”
○ “kafka.network:type=RequestMetrics,name=RequestBytes,request={Produce|Fetch
Consumer|FetchFollower}”
○ Caused by high throughput producers using “acks=0” or “acks=1”
○ Set producer config: “acks=all” - This will force producers to align its throughput
with replication throughput
○ Increase broker config: “num.replica.fetchers” from 1 (default) to 2 or 3
○ Use faster host

Permanent URP - Deﬁnition and Symptoms
11
● URP that requires broker restart to resolve
● URP count stays at non-zero continuously and sometime increases
● Can cause producer failure with NotEnoughReplicasException if:
○ Replication factor is 3
○ Topic or broker conﬁguration “min.insync.replicas” is set to 2
○ “retries” is set to 0 - defaults to 2147483647 since AK 2.1.0
● Should not affect client application except in the above case
● Broker log contains a group of “Shrinking ISR” messages without any “Expanding ISR”
messages
[2020-11-18 19:19:13,252] INFO [Partition _confluent-metrics-7 broker=2] Shrinking ISR from 2,1,3 to 2,3. Leader:
(highWatermark: 14373, endOffset: 14374). Out of sync replicas: (brokerId: 1, endOffset: 14373). (kafka.cluster.Partition)
[2020-11-18 19:19:13,257] INFO [Partition _confluent-metrics-7 broker=2] ISR updated to [2,3] and zkVersion updated to [7]

Permanent URP - Causes and Solutions
12
● Disk issue (e.g. disk full or h/w failure)
○ Confirm at the disk level (with “df” or other commands)
○ If it is disk full, you can:
■ Remove old segment files on follower replicas
■ Delete files outside Kafka’s “log.dirs”
■ Expand the disk/filesystems
○ If the disk failed, you can:
■ Replace the disk
○ Data loss is possible if:
■ Replication factor is 1
■ Producer uses “acks=1” or “acks=0”
○ Solution: After fixing the disk issue, restart the broker.

13
● ReplicaFetcherThread crashes due to bug
○ KAFKA-6649
■ Error message: org.apache.kafka.common.errors.OffsetOutOfRangeException:
Cannot increment the log start offset to 2098535 of partition testtopic-84 since
it is larger than the high watermark -1
■ Error message (KAFKA-7635):
kafka.common.UnexpectedAppendOffsetException: Unexpected offset in
append to topic.a-0. First offset 3389 is less than the next offset 3395. First 10
offsets in append: List(3389, 3390, 3391, 3392, 3393, 3394, 3395, 3396, 3397, 3398),
last offset in append: 4945. Log start offset = 3353
■ Fixed in AK 1.1.0, with the ﬁx for KAFKA-3978
○ KAFKA-8255
■ Error message: org.apache.kafka.common.errors.OffsetOutOfRangeException:
Cannot increment the log start offset to 4808819 of partition
__consumer_offsets-46 since it is larger than the high watermark 18761
■ Fixed in AK 2.2.1, with the ﬁx for KAFKA-8306

14
● ReplicaFetcherThread crashes due to bug
○ Solution: Depends on the bug. But in general, deleting the follower replica’s
segment files fix the issue.
○ If the leader goes offline, you may need to accept some data loss by enabling
“unclean.leader.election=true” at the topic level and move the controller
● Broker down
○ Obviously, if a broker is down, you will see non-zero URP
○ Solution: Start the broker

Consumer Rebalancing
15
● KafkaConsumer goes through the following steps:
○ Join the consumer group and subscribe to one or more topics
○ Assign topic partitions to all KafkaConsumer in the group
○ Start consuming by calling “poll”

16
● When the following happens, KafkaConsumer will have to reallocate the topic partitions
to KafkaConsumer in the group at that time, this is called “Rebalancing”
○ New KafkaConsumer joins the consumer group
○ Existing KafkaConsumer leaves the consumer group
○ Existing KafkaConsumer does not send any heartbeat
○ Existing KafkaConsumer spends too much time between “poll”
● Existing KafkaConsumer does not send any heartbeat
○ “heartbeat.interval.ms” - defaults to 3000 ms
○ “session.timeout.ms” - defaults to 10000 ms
○ Need to ensure that the coordinator receives one heartbeat message within the
session timeout
○ May need to increase “session.timeout.ms” if connection to the broker is bad

17
● Existing KafkaConsumer spends too much time between “poll”
○ “max.poll.records” - defaults to 500
○ “max.poll.interval.ms” - defaults to 300000 ms (5 minutes)
○ This must always be true to avoid rebalancing:
“max.poll.records” x “processing time per record” < “max.poll.interval.ms”
● So your application must control the processing time per record. If you cannot control
this, it can go into rebalancing state continuously.
● If you do not know how long it takes to process each record, e.g. call REST API that may
fail and needs to retry indeﬁnitely, then you can call “KafkaConsumer#pause” method
and then call “poll()” regularly. This will make “poll()” to return an empty list of records.
Once you are ready to process next record you call “KafkaConsumer#resume” then
“poll()”

Troubleshooting Apache Kafka® common issues

Recommended

Recommended

More Related Content

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Troubleshooting Apache Kafka® common issues