Kafka Streams Architecture and Key Features

•Download as PPTX, PDF•

3 likes•517 views

Giridhar Addepalli

Streaming applications using Kafka Streams

Software

Kakfa Streams Architecture
Active / Standby tasks
Interactive Queries
One Hop Queries
Queryable State during Restoration
Storage Policies
Rack-Aware task allocation.
Finite Retention in Changelog Topic
Conclusion
Q/A
Agenda
01
02
03
04
05
06
07
08
09
10

• Leader is elected among Kafka Streams Application Instances using
Apache Curator Leader Recipe.
• Leader pushes any changes to StreamsMetadata to zookeeper
• Client Watches Zookeeper Node and is notified of metadata changes.
• Client builds up partition -> (Node, port) map and uses it to fetch state
related to a given key.
OneHop Queries Implementation Details

• Currently State Store is queryable only when task is in RUNNING state.
• When Primary task fails, Standby is promoted to Primary. But before it
can start processing messages / queries it needs to build up state by
reading from local state store and changelog kafka topic.
• As standby can be arbitrarily behind primary, the amount of changelog
to be read can be huge. During this time state store remains non
queryable.
• Made changes such that store is queryable even when task is in
RESTORATION (PARTITION_ASSIGNED) state.
• This increases availability of micro-services built on top of Kafka
Streams.
Queryable State during Restoration

• RocksDB is used for storing state in Kafka Streams, and RocksDB
works best with SSD’s.
• As state gets huge, it becomes costlier to store entire data in SSD’s.
• We need to move not-so-recent data to HDD’s, where it is still
queryable but does not occupy space on SSD’s.
• We implemented configurable storage policies, like Archival Policy, TTL
policy.
• Archival Policy moves data that is not touched for long time
(configurable) from SSD to HDD.
• On top of this user can also select to use TTL policy to completely
remove state from HDD that is not modified for long time.
Storage Policies

• In addition to actual data in store, we also store time when a particular
key is modified. This information is stored in RocksDB.
• Data in RocksDB is of the format <timestamp>#<key> -> key
• Since RocksDB supports efficient range queries, this data allows us to
find keys that are not modified for long time.
• This helps us in enforcing storage policies.
Storage Policies Implementation Details

• Currently, Kafka Streams does not support rack-awareness while
allocating tasks.
• This means, it is possible that both Primary and Standby tasks are
allocated on the same rack.
• This would result in poor fault-tolerance in case of whole rack failures.
• We have a config in StreamsConfig called "RACK_ID_CONFIG" which
helps StickyPartitionAssignor to assign tasks in such a way that no two
replica tasks are on same rack if possible.
Rack Aware Task allocation

• Changelog topics are log-compacted kafka topics with infinite retention
time.
• As amount of state of application increases, changelog topics also grow
in size and infinite leads to storage consumption on kafka cluster.
• To reduce space pressure on kafka cluster we implemented mechanism
that allows for configurable amount of retention time in Changelog kafka
topic.
• When standby task fails and restarted on different machine, it tries to
copy state directly from the machine on which primary task is running.
And it also replays changelog kafka topic so that state is up-to-date on
standby task node.
Finite Retention in Changelog Topic

• Kafka Streams library allows application developers to write streaming
applications.
• Kafka Streams borrowed few ideas from Apace Samza and provided
new features like Standby Tasks and Interactive Query support.
• We implemented few features which helps towards improving
performance(one-hop queries), availability (queryable state during
RESTORATION phase), fault-tolerance(rack-aware task allocation) etc.
Conclusion

What's hot

State management in Structured Streamingdatamantra

Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent

SAP on Azure Web Dispatcher High AvailabilityGary Jackson MBCS

Transform your DBMS to drive engagement innovation with Big DataAshnikbiz

Migrating Oracle database to PostgreSQLUmair Mansoob

Building real time Data Pipeline using Spark Streamingdatamantra

Ingesting Data from Kafka to JDBC with Transformation and EnrichmentApache Apex

How Kafka and MemSQL Became the Dynamic Duo (Sarung Tripathi, MemSQL) Kafka S...HostedbyConfluent

State transfer With GaleraMydbops

2016 may-countdown-to-postgres-v96-parallel-queryAshnikbiz

HBaseConEast2016: Splice machine open source rdbmsMichael Stack

HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkMichael Stack

Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data TransformationsApache Apex

Apache Kafka at LinkedInGuozhang Wang

SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)Gary Jackson MBCS

Backup and Recovery in MySQL Clusterpriyanka_sangam

SAP OS/DB Migration using Azure Storage AccountGary Jackson MBCS

Event Streaming Architectures with Confluent and ScyllaDBScyllaDB

Distributed Kafka Architecture Taboola ScaleApache Kafka TLV

Scalablity and benchmark in mysql performanceAmrendra Kumar

What's hot (20)

State management in Structured Streaming

Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...

SAP on Azure Web Dispatcher High Availability

Transform your DBMS to drive engagement innovation with Big Data

Migrating Oracle database to PostgreSQL

Building real time Data Pipeline using Spark Streaming

Ingesting Data from Kafka to JDBC with Transformation and Enrichment

How Kafka and MemSQL Became the Dynamic Duo (Sarung Tripathi, MemSQL) Kafka S...

State transfer With Galera

2016 may-countdown-to-postgres-v96-parallel-query

HBaseConEast2016: Splice machine open source rdbms

HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark

Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations

Apache Kafka at LinkedIn

SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)

Backup and Recovery in MySQL Cluster

SAP OS/DB Migration using Azure Storage Account

Event Streaming Architectures with Confluent and ScyllaDB

Distributed Kafka Architecture Taboola Scale

Scalablity and benchmark in mysql performance

Similar to Kafka Streams Architecture and Key Features

Stateful streaming and the challenge of stateYoni Farin

Real time data pipline with kafka streamsYoni Farin

Kafka streams decoupling with storesYoni Farin

Healthcare Claim Reimbursement using Apache SparkDatabricks

It's Time To Stop Using Lambda ArchitectureYaroslav Tkachenko

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs

Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...HostedbyConfluent

Apache Kafka StreamsApache Kafka TLV

Introducing Oxia: A Scalable Zookeeper AlternativeHostedbyConfluent

From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky

Leveraging Databricks for Spark PipelinesRose Toomey

Leveraging Databricks for Spark pipelinesRose Toomey

Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard

Performance Tuning RocksDB for Kafka Streams’ State Storesconfluent

It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyHostedbyConfluent

Kafka tiered-storage-meetup-2022-final-presentedSumant Tambe

Work with hundred of hot terabytes in JVMsMalin Weiss

Alluxio - Scalable Filesystem Metadata ServicesAlluxio, Inc.

Strata London 2019 Scaling ImpalaManish Maheshwari

Strata London 2019 Scaling Impala.pptxManish Maheshwari

Similar to Kafka Streams Architecture and Key Features (20)

Stateful streaming and the challenge of state

Real time data pipline with kafka streams

Kafka streams decoupling with stores

Healthcare Claim Reimbursement using Apache Spark

It's Time To Stop Using Lambda Architecture

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data

Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...

Apache Kafka Streams

Introducing Oxia: A Scalable Zookeeper Alternative

From cache to in-memory data grid. Introduction to Hazelcast.

Leveraging Databricks for Spark Pipelines

Leveraging Databricks for Spark pipelines

Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

Performance Tuning RocksDB for Kafka Streams’ State Stores

It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify

Kafka tiered-storage-meetup-2022-final-presented

Work with hundred of hot terabytes in JVMs

Alluxio - Scalable Filesystem Metadata Services

Strata London 2019 Scaling Impala

Strata London 2019 Scaling Impala.pptx

Recently uploaded

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

DNT_Corporate presentation know about usDynamic Netsoft

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden

EY_Graph Database Powered SustainabilityNeo4j

5 Signs You Need a Fashion PLM Software.pdfWave PLM

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08

Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

What is Binary Language? Computer Number SystemsJheuzeDellosa

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

Recently uploaded (20)

Advancing Engineering with AI through the Next Generation of Strategic Projec...

Unlocking the Future of AI Agents with Large Language Models

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

Hand gesture recognition PROJECT PPT.pptx

Optimizing AI for immediate response in Smart CCTV

DNT_Corporate presentation know about us

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Engage Usergroup 2024 - The Good The Bad_The Ugly

EY_Graph Database Powered Sustainability

5 Signs You Need a Fashion PLM Software.pdf

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Unit 1.1 Excite Part 1, class 9, cbse...

Cloud Management Software Platforms: OpenStack

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...

Exploring iOS App Development: Simplifying the Process

What is Binary Language? Computer Number Systems

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

Der Spagat zwischen BIAS und FAIRNESS (2024)

Kafka Streams Architecture and Key Features

1. KAFKA STREAMS

2. Kakfa Streams Architecture Active / Standby tasks Interactive Queries One Hop Queries Queryable State during Restoration Storage Policies Rack-Aware task allocation. Finite Retention in Changelog Topic Conclusion Q/A Agenda 01 02 03 04 05 06 07 08 09 10

3. Kafka Streams Architecture

4. Active / StandBy Tasks

5. Interactive Queries

6. OneHop Queries

7. • Leader is elected among Kafka Streams Application Instances using Apache Curator Leader Recipe. • Leader pushes any changes to StreamsMetadata to zookeeper • Client Watches Zookeeper Node and is notified of metadata changes. • Client builds up partition -> (Node, port) map and uses it to fetch state related to a given key. OneHop Queries Implementation Details

8. • Currently State Store is queryable only when task is in RUNNING state. • When Primary task fails, Standby is promoted to Primary. But before it can start processing messages / queries it needs to build up state by reading from local state store and changelog kafka topic. • As standby can be arbitrarily behind primary, the amount of changelog to be read can be huge. During this time state store remains non queryable. • Made changes such that store is queryable even when task is in RESTORATION (PARTITION_ASSIGNED) state. • This increases availability of micro-services built on top of Kafka Streams. Queryable State during Restoration

9. • RocksDB is used for storing state in Kafka Streams, and RocksDB works best with SSD’s. • As state gets huge, it becomes costlier to store entire data in SSD’s. • We need to move not-so-recent data to HDD’s, where it is still queryable but does not occupy space on SSD’s. • We implemented configurable storage policies, like Archival Policy, TTL policy. • Archival Policy moves data that is not touched for long time (configurable) from SSD to HDD. • On top of this user can also select to use TTL policy to completely remove state from HDD that is not modified for long time. Storage Policies

10. • In addition to actual data in store, we also store time when a particular key is modified. This information is stored in RocksDB. • Data in RocksDB is of the format <timestamp>#<key> -> key • Since RocksDB supports efficient range queries, this data allows us to find keys that are not modified for long time. • This helps us in enforcing storage policies. Storage Policies Implementation Details

11. • Currently, Kafka Streams does not support rack-awareness while allocating tasks. • This means, it is possible that both Primary and Standby tasks are allocated on the same rack. • This would result in poor fault-tolerance in case of whole rack failures. • We have a config in StreamsConfig called "RACK_ID_CONFIG" which helps StickyPartitionAssignor to assign tasks in such a way that no two replica tasks are on same rack if possible. Rack Aware Task allocation

12. • Changelog topics are log-compacted kafka topics with infinite retention time. • As amount of state of application increases, changelog topics also grow in size and infinite leads to storage consumption on kafka cluster. • To reduce space pressure on kafka cluster we implemented mechanism that allows for configurable amount of retention time in Changelog kafka topic. • When standby task fails and restarted on different machine, it tries to copy state directly from the machine on which primary task is running. And it also replays changelog kafka topic so that state is up-to-date on standby task node. Finite Retention in Changelog Topic

13. • Kafka Streams library allows application developers to write streaming applications. • Kafka Streams borrowed few ideas from Apace Samza and provided new features like Standby Tasks and Interactive Query support. • We implemented few features which helps towards improving performance(one-hop queries), availability (queryable state during RESTORATION phase), fault-tolerance(rack-aware task allocation) etc. Conclusion

Kafka Streams Architecture and Key Features

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kafka Streams Architecture and Key Features

Similar to Kafka Streams Architecture and Key Features (20)

Recently uploaded

Recently uploaded (20)

Kafka Streams Architecture and Key Features