2. Architecture
BK1 BK2 BK3 BK4 BK5
ZK1 ZK2 ZK3
Producer
- Serialize Data
- Identify Partition
- Send data - a/sync
- Wait for sync ack
Consumer
- Assigned partition
- Polls for message
- Grp Coord heartbeat
- Partition rebalance
- Consumer group lead
- Partition will be assigned to mount points with less number of partitions
- Partition restricted to one disk / mount point : log.dirs
- Max message size : message.max.bytes
Brokers
- All brokers registers to ZK
- First broker becomes the
controller
- Responsible for partition leader
election
- Controller failure and election
uses ZK watch notification
- Broker creating /controller
Znode becomes the controller
- Controller epoch through ZK
conditional increment operation
is used
- First consumer is group lead
- One broker assigned as group
coordinator
- Change in # of consumer initiates
partition rebalance
- All reads go through part leader
- Leader keeps track of ISR
/brokers/ids – ephemeral nodes – subscribed by kafka components
/controller – ZooKeeper watch
- All writes go through leader
- Writes will block if < min ISR
- Broker failure results in next
ISR being assigned part leader
- Part leader keeps track of ISR
Fetch
Replica
9. Fetch Request
• Topic, partition, offset
• Limit the data returned
– Size or # of messages
• Uses Zero-Copy for performance
– File system cache to network cache
• Can set min size to minimize network traffic
– Also set time in ms to send data
• Only sees data which has been replicated – high
water mark
• replica.lag.time.max.ms
10. Partition Allocation
• Rack awareness
• Equal distribution
• Leader at node A, followers in A+1, A+2
• Partition assigned to directory with least number
of partitions
• Partitions are divided into segments
– Segments store 1 GB or 1 weeks worth of data
• File handlers open to all segments in all partitions
– OS ulimit need to be changed for open file handlers
11. Files
• Indexes
– Index to segment files -> positions within segments
– Indexes correspond to data segments
– Indexes will be purged along with data
• Compaction
– Retention policy : delete|compact
– log.cleaner.enabled
– To delete message; generate message with NULL values
– No compaction of active segments
– Compaction on topics with 50% of records being dirty
Offset
Magic/
Checksum
Compression
Codec
Timestamp Key Size Key Value Size Value
Data File
12. Consumer
• Consumer Group
– Consumers -> Partition
• More consumers than partition -> idle consumers
• Adding/dropping consumers -> partition rebalancing
– While rebalancing can’t consume messages
• Group membership maintained by heartbeat to group
coordinator
– Heartbeats are send during poll()
• Commit records offset of message consumed
• Consumer crash leads to no processing of messages from
the assigned partition
– session.timeout.ms / max.poll.interval.ms
13. Consumer Group
• One broker acts as Group Coordinator
• Consumers make JoinGroup request
• First consumer becomes group leader
• Leader receives all details about all consumers
• Assigns partitions to consumer using
“PartitionAssignor”
• Sends assignments to Group Coordinator
• Group coordinator sends relevant information to
each consumer like the assigned partition
16. Commits & Offsets
• Topics __consumer_offset to store offsets
– enable.auto.commit = true
– Auto.commit.interval.ms = 5
• During rebalancing, data can be processed twice
or missed with auto commit
• Disable auto commit – auto.commit.offset = false
– poll() and process all the records
– Commit
• commitSync()
• commitAsync()
• commitAsync(new OffsetCommitCallback())