• Leader is elected among Kafka Streams Application Instances using
Apache Curator Leader Recipe.
• Leader pushes any changes to StreamsMetadata to zookeeper
• Client Watches Zookeeper Node and is notified of metadata changes.
• Client builds up partition -> (Node, port) map and uses it to fetch state
related to a given key.
OneHop Queries Implementation Details
• Currently State Store is queryable only when task is in RUNNING state.
• When Primary task fails, Standby is promoted to Primary. But before it
can start processing messages / queries it needs to build up state by
reading from local state store and changelog kafka topic.
• As standby can be arbitrarily behind primary, the amount of changelog
to be read can be huge. During this time state store remains non
• Made changes such that store is queryable even when task is in
RESTORATION (PARTITION_ASSIGNED) state.
• This increases availability of micro-services built on top of Kafka
Queryable State during Restoration
• RocksDB is used for storing state in Kafka Streams, and RocksDB
works best with SSD’s.
• As state gets huge, it becomes costlier to store entire data in SSD’s.
• We need to move not-so-recent data to HDD’s, where it is still
queryable but does not occupy space on SSD’s.
• We implemented configurable storage policies, like Archival Policy, TTL
• Archival Policy moves data that is not touched for long time
(configurable) from SSD to HDD.
• On top of this user can also select to use TTL policy to completely
remove state from HDD that is not modified for long time.
• In addition to actual data in store, we also store time when a particular
key is modified. This information is stored in RocksDB.
• Data in RocksDB is of the format <timestamp>#<key> -> key
• Since RocksDB supports efficient range queries, this data allows us to
find keys that are not modified for long time.
• This helps us in enforcing storage policies.
Storage Policies Implementation Details
• Currently, Kafka Streams does not support rack-awareness while
• This means, it is possible that both Primary and Standby tasks are
allocated on the same rack.
• This would result in poor fault-tolerance in case of whole rack failures.
• We have a config in StreamsConfig called "RACK_ID_CONFIG" which
helps StickyPartitionAssignor to assign tasks in such a way that no two
replica tasks are on same rack if possible.
Rack Aware Task allocation
• Changelog topics are log-compacted kafka topics with infinite retention
• As amount of state of application increases, changelog topics also grow
in size and infinite leads to storage consumption on kafka cluster.
• To reduce space pressure on kafka cluster we implemented mechanism
that allows for configurable amount of retention time in Changelog kafka
• When standby task fails and restarted on different machine, it tries to
copy state directly from the machine on which primary task is running.
And it also replays changelog kafka topic so that state is up-to-date on
standby task node.
Finite Retention in Changelog Topic
• Kafka Streams library allows application developers to write streaming
• Kafka Streams borrowed few ideas from Apace Samza and provided
new features like Standby Tasks and Interactive Query support.
• We implemented few features which helps towards improving
performance(one-hop queries), availability (queryable state during
RESTORATION phase), fault-tolerance(rack-aware task allocation) etc.