ksqlDB: A Stream-Relational Database System

ksqlDB: A Stream-Relational
Database System
Matthias J. Sax | Software Engineer
@MatthiasJSax

The Vision of a Streaming Database
2
SQL abstraction over mutable TABLEs and immutable append-only STREAMs
• OLTP?
• transaction, concurrent updates
• OLAP?
• complex ad-hoc queries
• Online Stream Processing:
• Continuous queries over STREAMs and TABLEs
• Simple state lookups into TABLEs
• Subscribe to data streams
• Streaming ETL
• Materialized View maintenance

Apache Kafka and ksqlDB
3
Broker Cluster
“Storage Layer”
ksqlDB Cluster
“Compute Layer”
Network

Replicated & partitioned, immutable append-only log of messages
Apache Kafka: Data Model
4

Messages:
• plain bytes: key (for partitioning), value
• long timestamp
Topic cleanup policies:
• retention: purge old messages (time or size based)
• compaction: keep the latest message per key
Apache Kafka: Data Model (cont.)
5

ksqlDB: Data Model
Streams and tables of structured records
7
CREATE STREAM clickstream (
time BIGINT,
url VARCHAR,
status INTEGER,
bytes INTEGER,
user_id VARCHAR,
agent VARCHAR)
WITH (
kafka_topic = ‘cs_topic',
value_format = 'JSON'
);
CREATE TABLE users (
user_id INTEGER PRIMARY KEY,
registered_at LONG,
username VARCHAR,
name VARCHAR,
city VARCHAR,
level VARCHAR)
WITH (
kafka_topic = ‘users_topic',
value_format = 'AVRO’
);

ksqlDB: Queries
Persistent continuous queries (CQ)
• Take one or more input STREAMs/TABLEs and compute a result STREAM or TABLE
• Deployed in the ksqlDB servers
• Executed in a data-parallel manner
• Scalable
• Fault-tolerant
8
CREATE STREAM resultStream
AS SELECT...
CREATE TABLE resultTable
AS SELECT...

ksqlDB: Queries (cont.)
Transient client queries
• CLI / Java client / etc.
• Pull queries: simple state lookups against TABLEs
• “classic” queries
• Limited in ksqlDB (no aggregations or joins)
• Push queries: subscription to a result STREAM
9

ksqlDB: Overview
10
Kafka Cluster
(stores STREAMs/TABLEs as partitioned logs)
ksqlDB cluster
Upstream
Producers
CQ
Downstream
Consumers
App App
CQ
Push (Subscription) Pull (Query)

Queries Semantics
12
Streaming queries have temporal semantics based on event-time
TABLE queries:
• Like regular SQL
• However: TABLEs are treated as “versioned” tables that evolve over time
How to query a STREAM?

Simple Stream Queries
13
Filter, projection etc (stateless)
CREATE STREAM user_clicks AS
SELECT user_id, status, ucase(agent)
FROM clickstream
WHERE user_id = ‘mjsax’;

Stream Aggregation
14
Windowed / non-windowed: returns a continuously updating TABLE
CREATE TABLE clicks AS
SELECT user_id, COUNT(url)
FROM clickstream
WHERE bytes > 1024
WINDOW TUMBLIND
(size 30 seconds)
GROUP BY user_id
HAVING COUNT(url) > 20;

Stream-Stream Join
15
Sliding-window join:
14:041 14:162 14:083
14:01A 14:11B 14:23C
14:041⨝A 14:162⨝B 14:113⨝B
max(l.ts; r.ts)
CREATE STREAM joinedStream AS
SELECT *
FROM leftStream AS l JOIN rightStream AS r
WITHIN 5 minutes ON l.id = r.id;

Chaining stream-stream joins is not associative!
• Order matters: ⨝(s1,s2,s3) != (s1 ⨝ s2) ⨝ s3 != (s1 ⨝ s3) ⨝ s2 != (s2 ⨝ s3) ⨝ s1
N-way Stream-Stream Join
16

17
14:06X 14:21Y
14:212⨝Y⨝b
14:16b14:11a
14:011 14:26314:162
* window size=5min

18
14:06X 14:21Y
14:011 14:26314:162
14:212⨝Y14:061⨝X 14:263⨝Y
* window size=5min

19
14:16b14:11a
14:212⨝Y⨝b14:111⨝Y⨝a
14:212⨝Y14:061⨝X 14:263⨝Y
* window size=5min

Table Queries
20
As in regular SQL, however…

2121
Do you think that’s a table you are querying ?

Table Queries
22
Filters, projections, and aggregations are executed on changelog streams
Query

Stream-Table Join
23
Data enrichment via table lookups

Stream-Table Join (cont.)
24
Stream-Table join is a temporal join
14:01a 14:03b 14:05c 14:08b 14:11a
14:02… 14:04… 14:07…14:06… 14:10…
14:01a
14:03b
14:05c
14:05
14:01a
14:08b
14:05c
14:08
14:11a
14:08b
14:05c
14:11
14:01a
14:03b
14:03
14:01a
14:01
14:06 14:07 14:1014:0414:02

Table-Table Join
25
Table-Table join is a temporal join
users
details
profile
v1 v5 v6
v2 v6
v2 v5 v6

Persistent Queries
Executed as Kafka Streams programs
Kafka Streams:
• Java client library (part of the Apache Kafka project)
• High level DSL to process Kafka topics as KStreams and KTables
• Executes a dataflow program, represented as a directed graph of operators
• filter()/map()
• groupBy()/windowedBy()
• aggregate()
• join()
• Etc.
• Scalable
• Fault-Tolerant
27

Persistent Queries
Kafka Stream topologies:
28
Parallel execution:

Kafka Streams Internals
Consumers, Producers, and RocksDB
29

Kafka Streams Internals (cont.)
Data repartitioning and fault-tolerance
30

Fault-Tolerance and Hot Standbys
31
Recovery: replying the
changelog topic
Hot standbys: eagerly
replaying the changelog
topic
Hot standbys allow for
instant fail-over.
Hot standbys allow for HA
pull queries.

Elasticity
32
Dynamic scaling as special case of store recovery

Work in Progress
Streaming SQL
• Concise and more powerful language
• Improved time/operator semantics
Consistency guarantees
• ksqlDB is an async system: vector clock approach for improved time tracking
Applications and transient queries
• More efficient and more scalable pull/push queries
• Improved INSERT/UPDATE/DELETE support?
34

Future Work
Query optimization
• Currently state: rule based
• filter push down
• merging of repartition topic
• merging of input/output/changelog topics
• Streaming cost model and cost-based optimization
• Adaptive re-optimization at runtime
• Query merging/splitting
Runtime improvements:
• Internal and optimized (binary) data format to avoid expensive (de)serialization costs
• Task assignment: load balancing vs stickyness
Richer SQL:
• Sub-query support
35

References
• KSQL: Streaming SQL Engine for Apache Kafka
Hojjat Jafarpour, Rohan Desai, Damian Guy
EDBT '19: Proceedings of the 22nd International Conference on Extending Database Technology, 2019
• Streams and Tables: Two Sides of the Same Coin
Matthias J. Sax, Guozhang Wang, Matthias Weidlich, Johann-Christoph Freytag
BIRTE '18: Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics, 2018
• https://kafka.apache.org/books-and-papers
• https://ksqldb.io/
36

Thanks! We are hiring!
@MatthiasJSax
matthias@confluent.io | mjsax@apache.org

ksqlDB: A Stream-Relational Database System

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ksqlDB: A Stream-Relational Database System

Similar to ksqlDB: A Stream-Relational Database System (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

ksqlDB: A Stream-Relational Database System