Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
3. Kafka adoption in enterprises
6 of the top 10
travel companies
8 of the top 10
insurance companies
7 of the top 10
global banks
9 of the top 10
telecom companies
5. High Level Data Flow in Replication
broker 1
producer
leader
broker 2
follower
broker 3
follower
4
2
2
3
commit
ack
topic1-part1 topic1-part1 topic1-part1
consumer
1
6. What’s controller
6
• One broker in a cluster acts as controller
• Monitor the liveness of brokers
• Elect new leaders on broker failure
• Communicate new leaders to brokers
10. Issues with controlled shutdown (pre 1.1)
Zookeeper
Controller
3
5
broker 0
4
Writes to ZK
are serial
Impact:
longer
shutdown
time
Communication of new
leaders not batched
Impact: client timeout
broker 2
part t-0: leader
part t-1: leader
broker 1
part t-0:
part t-1:
/topics/t/0 à 2
/topics/t/1 à 2
14. Issues with controller failover (pre 1.1)
Controller
broker 0 broker 3broker 2broker 1
1 2
3
Controller
Reads from ZK are serial
Impact: availability
Zombie old controller
Impact: inconsistency
Zookeeper
/controller à broker 2
/topic/t1/0 à leader:1
/topic/t1/1 à leader:3
…
/topic/t1/9 à leader:2
15. Performance improvements in 1.1
15
• Controller uses async ZK api for reads/writes
• Controller communicates new leaders to brokers in batches
part 1 part 2 part 3 part 4
part 1
part 2
part 3
part 4
Old (serial):
New (pipelined):
16. /topics/t/0 à 2
/topics/t/1 à 2
Controlled shutdown (post 1.1)
Zookeeper
Controller
3
5
broker 0
4
Writes to ZK
pipelined
Communication of new
leaders batched
broker 2
part t-0: leader
part t-1: leader
broker 1
part t-0:
part t-1:
17. Controller failover (post 1.1)
Controller
broker 0 broker 3broker 2broker 1
1 2
3
Controller
Reads from ZK pipelined
Zookeeper
/controller à broker 2
/topic/t1/0 à leader:1
/topic/t1/1 à leader:3
…
/topic/t1/9 à leader:2
18. Results for controlled shutdown
18
• 5 ZK nodes and 5 brokers on different racks
• 25K topics, 1 partition, 2 replicas
• 10K partitions per broker
Kafka 1.0.0 Kafka 1.1.0
Controlled shutdown time 6.5 minutes 3 seconds
19. Results for controller failover
19
• 5 ZK nodes and 5 brokers on different racks
• 2K topics, 50 partitions, 1 replica
• Controller failover: reload100K partitions from ZK
Kafka 1.0.0 Kafka 1.1.0
State reload time 28 seconds 14 seconds
20. Fencing zombie controller
20
• ZK session expiration
• Better handling in the controller (1.1)
• Controller path deletion
• Writes to ZK conditioned on controller epoch (to be in 2.1)
21. Controller failover (expected in 2.1)
Controller
broker 0 broker 3broker 2broker 1
1 2
Controller
Zombie old controller
fenced
Zookeeper
/controller à broker 2
22. Summary
• Significant performance improvement in controller in 1.1
• Allow 10X more partitions in a Kafka cluster
• Better fencing of zombie controller in 1.1 and 2.1
• More details in KAFKA-5027
23. Future work in controller
• Further improvement on controller failover
• Standby controller
• Better handling of quick broker restart (KAFKA-1120)
• Broker generation