Badai Aqrandista, Confluent, Technical Support Engineer
This talk will show few common issues when working with Apache Kafka and how to troubleshoot and fix them. We will discuss Under Replicated Partitions reported by Kafka brokers: what it is, possible causes and how to fix them. We will also discuss Kafka Consumer rebalancing: what it is, possible causes and how to fix them.
https://www.meetup.com/Singapore-Kafka-Meetup/events/274421645/
2. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Who am I?
2
● My name is BADAI AQRANDISTA
● I work for Confluent
● My job title is Technical Support Engineer
● My work is basically to “troubleshoot Kafka issues and fix them”
3. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Kafka Overview
3
PRODUCER CONSUMER
4. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Common Issues
4
● Under Replicated Partitions
● Consumer Rebalancing
5. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Under Replicated Partitions (URP)
5
● Kafka organise its data in “topic”
● Each “topic” has one or more “partition”
● Each “partition” has one or more “replica”
● Each “replica” is hosted by one Kafka broker
● One “replica” acts as a “Leader” and the other “replica” act as a
“Follower”
● All Produce and Consume goes to the “Leader” replica
● “Follower” replicates data from the “Leader” continuously
● When “Follower” is in sync with the “Leader”, the “Follower” will be
listed in the “In-Sync Replicas” or ISR.
● When ISR count is less than the “replica” count, the “partition” is
called “Under Replicated Partition”
6. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Under Replicated Partitions (URP)
6
● JMX Metrics:
○ “kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions”
○ “kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount”
7. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Under Replicated Partition (URP)
7
8. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Under Replicated Partitions (URP)
8
Broker 1 Broker 2 Broker 3
P0 P0 P0
P1 P1 P1
9. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Temporary URP - Definition and Symptoms
9
● URP that disappears by itself
● Can cause producer failure with NotEnoughReplicasException if:
○ Replication factor is 3
○ Topic or broker configuration “min.insync.replicas” is set to 2
○ “retries” is set to 0 - defaults to 2147483647 since AK 2.1.0
● Transient issue and should not affect client application
● Broker log contains a group of “Shrinking ISR” messages and then a group of
“Expanding ISR” messages
[2020-11-18 19:19:13,751] INFO [Partition _confluent-metrics-4 broker=2] Shrinking ISR from 2,1,3 to 2,3. Leader:
(highWatermark: 14459, endOffset: 14460). Out of sync replicas: (brokerId: 1, endOffset: 14459). (kafka.cluster.Partition)
[2020-11-18 19:19:13,754] INFO [Partition _confluent-metrics-4 broker=2] ISR updated to [2,3] and zkVersion updated to [7]
(kafka.cluster.Partition)
[2020-11-18 19:19:13,786] INFO [Partition _confluent-metrics-4 broker=2] Expanding ISR from 2,3 to 2,3,1
(kafka.cluster.Partition)
10. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Temporary URP - Causes and Solutions
10
● Intermittent network latency between brokers
○ Confirm by doing ping for 24 hours: “ping -c 86400 {broker}”
○ Increase broker config: “replica.lag.time.max.ms” from 10000 (default in < AK 2.5) to
30000 (default in >= AK 2.5.0)
○ Related configs: “offsets.commit.timeout.ms” and “zookeeper.session.timeout.ms”
● Produce rate > Replication rate
○ “kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce|Fet
chConsumer|FetchFollower}”
○ “kafka.network:type=RequestMetrics,name=RequestBytes,request={Produce|Fetch
Consumer|FetchFollower}”
○ Caused by high throughput producers using “acks=0” or “acks=1”
○ Set producer config: “acks=all” - This will force producers to align its throughput
with replication throughput
○ Increase broker config: “num.replica.fetchers” from 1 (default) to 2 or 3
○ Use faster host
11. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Permanent URP - Definition and Symptoms
11
● URP that requires broker restart to resolve
● URP count stays at non-zero continuously and sometime increases
● Can cause producer failure with NotEnoughReplicasException if:
○ Replication factor is 3
○ Topic or broker configuration “min.insync.replicas” is set to 2
○ “retries” is set to 0 - defaults to 2147483647 since AK 2.1.0
● Should not affect client application except in the above case
● Broker log contains a group of “Shrinking ISR” messages without any “Expanding ISR”
messages
[2020-11-18 19:19:13,252] INFO [Partition _confluent-metrics-7 broker=2] Shrinking ISR from 2,1,3 to 2,3. Leader:
(highWatermark: 14373, endOffset: 14374). Out of sync replicas: (brokerId: 1, endOffset: 14373). (kafka.cluster.Partition)
[2020-11-18 19:19:13,257] INFO [Partition _confluent-metrics-7 broker=2] ISR updated to [2,3] and zkVersion updated to [7]
(kafka.cluster.Partition)
12. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Permanent URP - Causes and Solutions
12
● Disk issue (e.g. disk full or h/w failure)
○ Confirm at the disk level (with “df” or other commands)
○ If it is disk full, you can:
■ Remove old segment files on follower replicas
■ Delete files outside Kafka’s “log.dirs”
■ Expand the disk/filesystems
○ If the disk failed, you can:
■ Replace the disk
○ Data loss is possible if:
■ Replication factor is 1
■ Producer uses “acks=1” or “acks=0”
○ Solution: After fixing the disk issue, restart the broker.
13. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Permanent URP - Causes and Solutions
13
● ReplicaFetcherThread crashes due to bug
○ KAFKA-6649
■ Error message: org.apache.kafka.common.errors.OffsetOutOfRangeException:
Cannot increment the log start offset to 2098535 of partition testtopic-84 since
it is larger than the high watermark -1
■ Error message (KAFKA-7635):
kafka.common.UnexpectedAppendOffsetException: Unexpected offset in
append to topic.a-0. First offset 3389 is less than the next offset 3395. First 10
offsets in append: List(3389, 3390, 3391, 3392, 3393, 3394, 3395, 3396, 3397, 3398),
last offset in append: 4945. Log start offset = 3353
■ Fixed in AK 1.1.0, with the fix for KAFKA-3978
○ KAFKA-8255
■ Error message: org.apache.kafka.common.errors.OffsetOutOfRangeException:
Cannot increment the log start offset to 4808819 of partition
__consumer_offsets-46 since it is larger than the high watermark 18761
■ Fixed in AK 2.2.1, with the fix for KAFKA-8306
14. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Permanent URP - Causes and Solutions
14
● ReplicaFetcherThread crashes due to bug
○ Solution: Depends on the bug. But in general, deleting the follower replica’s
segment files fix the issue.
○ If the leader goes offline, you may need to accept some data loss by enabling
“unclean.leader.election=true” at the topic level and move the controller
● Broker down
○ Obviously, if a broker is down, you will see non-zero URP
○ Solution: Start the broker
15. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Consumer Rebalancing
15
● KafkaConsumer goes through the following steps:
○ Join the consumer group and subscribe to one or more topics
○ Assign topic partitions to all KafkaConsumer in the group
○ Start consuming by calling “poll”
16. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Consumer Rebalancing
16
● When the following happens, KafkaConsumer will have to reallocate the topic partitions
to KafkaConsumer in the group at that time, this is called “Rebalancing”
○ New KafkaConsumer joins the consumer group
○ Existing KafkaConsumer leaves the consumer group
○ Existing KafkaConsumer does not send any heartbeat
○ Existing KafkaConsumer spends too much time between “poll”
● Existing KafkaConsumer does not send any heartbeat
○ “heartbeat.interval.ms” - defaults to 3000 ms
○ “session.timeout.ms” - defaults to 10000 ms
○ Need to ensure that the coordinator receives one heartbeat message within the
session timeout
○ May need to increase “session.timeout.ms” if connection to the broker is bad
17. Copyright 2020, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Consumer Rebalancing
17
● Existing KafkaConsumer spends too much time between “poll”
○ “max.poll.records” - defaults to 500
○ “max.poll.interval.ms” - defaults to 300000 ms (5 minutes)
○ This must always be true to avoid rebalancing:
“max.poll.records” x “processing time per record” < “max.poll.interval.ms”
● So your application must control the processing time per record. If you cannot control
this, it can go into rebalancing state continuously.
● If you do not know how long it takes to process each record, e.g. call REST API that may
fail and needs to retry indefinitely, then you can call “KafkaConsumer#pause” method
and then call “poll()” regularly. This will make “poll()” to return an empty list of records.
Once you are ready to process next record you call “KafkaConsumer#resume” then
“poll()”