Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019

1
Kafka 102:
Streams and Tables All the Way Down
Kafka Summit San Francisco, Sep 2019
Michael G. Noll
Technologist, Oﬃce of the CTO, Conﬂuent
@miguno

22
Streams and Tables
A First Look

33
@miguno
Streams Tables
Event Streaming

44
@miguno
Streams Tables
Event Streaming

5
An Event Streaming Platform
gives you three key functionalities
5
Publish & Subscribe
to Events
Store
Events
Process & Analyze
Events
@miguno

6
An Event
records the fact that something happened
6
A good
was sold
An invoice
was issued
A payment
was made
A new customer
registered
@miguno

7
A Stream
records history as a sequence of Events
7
@miguno

88
@miguno
Event Streaming Paradigm
Highly Scalable
Durable
Persistent
Maintains Order
Fast (Low Latency)
Kafka = Source of Truth,
stores every article since 1851
Denormalized into
“Content View”
Normalized assets
(images, articles,
bylines, etc.)
https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/
Streams record history, even hundreds of years

99
“The ledger of sales.” “The sales totals.”
Streams
record history
Tables
represent state
@miguno

1010
1. e4 e5
2. Nf3 Nc6
3. Bc4 Bc5
4. d3 Nf6
5. Nbd2
“The sequence of moves.” “The state of the board.”
Streams
record history
Tables
represent state
@miguno

11
Streams = INSERT only
Immutable, append-only
Tables = INSERT, UPDATE, DELETE
Mutable, row key (event.key) identiﬁes which row
11
@miguno

12
The key to mutability is … the event.key!
12
@miguno
Stream Table
Has unique key constraint? No Yes
First event with key ‘alice’ arrives INSERT INSERT
Another event with key ‘alice’ arrives INSERT UPDATE
Event with key ‘alice’ and value == null arrives INSERT DELETE
Event with key == null arrives INSERT <ignored>
RDBMS analogy: A Stream is ~ a Table that has no unique key and is append-only.

13
Creating a table from a stream or topic
streams

1414
@miguno
Stream
(facts)
Table
(dims)
alice Berlin
bob Lima
alice Berlin
alice Rome
bob Lima
alice Paris
bob Sydney
alice Berlin
alice Rome
bob Lima
alice Paris
bob Sydney
90°

1515
aggregation
(like SUM, COUNT)
table changes
*See Streams and Tables: Two Sides of the Same Coin, M. Sax et al., BIRTE ’18
Streams
record history
Tables
represent state
Duality
@miguno

16
Aggregating a stream (COUNT example)
streams
@miguno

17
Aggregating a stream (COUNT example)
streams
@miguno

1818
Kafka Topics
The storage foundation of Streams and Tables

19
Data storage of a Kafka topic is partitioned
which impacts data processing as we see later
19
...
...
...
...
P1
P2
P3
P4
storage
Topic
@miguno

20
Producers determine target partition of an event
through a partitioning function ƒ(event.key)
20
...
...
...
...
P1
P2
P3
P4
storage
Topic
Producer client 1
Producer client 2
event sent and appended
to partition 1
@miguno

21
Events with same key should be in same partition
to ensure proper ordering of related events
to ensure processing by Consumers returns expected results
21
...
...
...
...
P1
P2
P3
P4
Producer client 1
Producer client 2
Yellow events should always
be stored in partition 3
@miguno

22
Top causes for same key in different partitions
1. You increased/decreased number of partitions
2. A producer uses a custom partitioner
→ Be careful in this situation!
22
@miguno

23
Processing Layer
(KStreams, KSQL, etc.)
Storage Layer
(Brokers)
Partitions play a central role in Kafka
23
@miguno
Topics are partitioned. Partitions enable scalability, elasticity, fault-tolerance.
joined based on
stored in
replicated based on
log-compacted based on
read from and written to
processed based on
partitionsData is

2424
From Topics to Streams and Tables

25
Topics
live in the Kafka storage layer, ‘ﬁlesystem’ (Brokers)
Streams and Tables
live in the Kafka processing layer (KStreams, KSQL)
25
@miguno

26
Processing Layer
(KSQL, KStreams)
26
00100 11101 11000 00011 00100 00110Topic
alice Paris bob Sydney alice RomeStream
plus schema (serdes)
alice Rome
bob Sydney
Table
plus aggregation
Storage Layer
(Brokers)
Topics vs. Streams and Tables
@miguno

27
Kafka Processing
Data is processed per-partition
27
...
...
...
...
P1
P2
P3
P4
storage processing state
Stream Task 1
Stream Task 2
Stream Task 3
Stream Task 4
read via
network
Application Instance 1Topic Application Instance 1
Application Instance 2
@miguno

28
Streams and Tables are partitioned, too
And so is their processing!
28
...
...
...
...
P1
P2
P3
P4
Stream Task 1
Stream Task 2
Stream Task 3
Stream Task 4
KTable / TABLE
2 GB
3 GB
5 GB
2 GB
@miguno

29
Global Tables give complete data to every task
Great for joins without re-keying the input or to broadcast info
29
...
...
...
...
P1
P2
P3
P4
Stream Task 1
Stream Task 2
Stream Task 3
Stream Task 4
GlobalKTable
2 + 3 + 5 + 2 = 12 GB
12 GB
12 GB
12 GB
@miguno

30
Tables are cached in ‘state stores’ on local disk
Tables and other state don’t need to ﬁt into RAM.
Enables large-scale state. Saves $$$ on cloud instances.
30
But: The ‘source of truth’ of a table is its underlying topic!
@miguno
Stream Task 1 Cached on local disk under /var
(default storage engine: RocksDB)
Global
Table
Table

31
streaming restore
via network
Stream Task 1
On machine B
Tables and other state are always fault-tolerant
because backed by Kafka topics (cf. event sourcing)
31
streaming backup
via network
table’s changelog topic
(log-compacted)
Kafka
Storage
KSQL,
Kafka Streams
On machine A
Stream Task 1
@miguno

32
Standby Replicas speed up application recovery
App instances can optionally maintain copies of another
instance’s local state stores to minimize failover times
32
@miguno
num.standby.replicas = 1
Stream Task 1
Stream Task 2
App Instance 1
App Instance 2
num.standby.replicas = 0 (default)
Stream Task 1
Stream Task 2
Stream Task 2
failover

33
Elastic scaling
Stream tasks are migrated, including their state (via Kafka)
33
@miguno
App Instance 1
App Instance 2
Stream Task 1
Stream Task 3
Stream Task 2
Stream Task 4
App Instance 3
App Instance 4
Stream Task 2
Stream Task 4

34
bob Zurich
alice Rome
bob Zurich
bob Sydney alice Romealice Bern alice Paris
Tables <-> topic log-compaction
A table’s underlying topic is compacted by default
to save Kafka storage space
to speed up failover and recovery for processing
34
Note: Compaction intentionally removes part of a table’s history. If you need the full
history and don’t have the historic data elsewhere, consider disabling compaction.
@miguno

3535
@miguno
TL;DR for Log Compaction
Have a Stream?
→ Disable log compaction for its topic (= default)
Have a Table?
→ Enable log compaction for its topic (= default)
Disable only when needed, see previous slide

3636
Concept Schema Partitioned Unbounded Ordering Mutable Unique key constraint Fault-tolerant
Storage Layer
Topic No (raw bytes) Yes Yes Yes No No Yes
Processing Layer
Stream Yes Yes Yes Yes No No Yes
Table Yes Yes No* No Yes Yes Yes
Global Table Yes No No* No Yes Yes Yes
*Generally speaking the answer is Yes but, in practice, tables are almost always bounded due to ﬁnite key space.
Topics vs. Streams and Tables
@miguno

3737
Working with Streams and Tables

38
Max processing parallelism = #input partitions
38
...
...
...
...
P1
P2
P3
P4
Topic Application Instance 1
Application Instance 5 *** idle ***
Application Instance 6 *** idle ***
→ Need higher parallelism? Increase the original topic’s partition count.
→ Higher parallelism for just one use case? Derive a new topic from the
original with higher partition count. Lower its retention to save storage.
@miguno

39
How to increase number of partitions when needed
KSQL example: statement below creates a new stream with
the desired number of partitions.
39
CREATE STREAM products_repartitioned
WITH (PARTITIONS=30) AS
SELECT * FROM products
@miguno

40
‘Hot’ partitions can be problematic, often caused by
1. Events not evenly distributed across partitions
2. Events evenly distributed but certain events take longer to process
40
Strategies to address hot partitions include
1a. Ingress: Find better partitioning function ƒ(event.key) for producers
1b. Storage: Re-partition data into new topic if you can’t change the original
2. Scale processing vertically, e.g. more powerful CPU instances
...
...
...
...
P1
P2
P3
P4
@miguno

41
Joining Streams and Tables
Data must be ‘co-partitioned’
41
TableStream
Join Output
(Stream)
@miguno

42
Data must be ‘co-partitioned’
42
bob male
alice female
alex male
alice Paris
Table
P1
P2
P3
zoie female
andrew male
mina female
natalie female
blake male
alice Paris
Stream
P2
(alice, Paris) from stream’s
P2 has a matching entry
for alice in the table’s P2.
female
@miguno

43
Data is looked up in same partition number
43
alice Paris alice male
alice female
alice Paris
Stream Table
P2 P1
P2
P3
Here, key ‘alice’ exists in
multiple partitions.
But entry in P2 (female) is
used because the
stream-side event is from
stream’s partition P2.
female
Scenario 2
@miguno

44
Data is looked up in same partition number
44
alice Paris alice male
alice Paris
Stream Table
P2 P1
P2
P3
Here, key ‘alice’ exists only
in the table’s P1 != P2.
null
no
match!
Scenario 3
@miguno

45
Data co-partitioning requirements in detail
1. Same keying scheme for both input sides
2. Same number of partitions
3. Same partitioning function ƒ(event.key)
45
Further Reading on Joining Streams and Tables:
https://www.conﬂuent.io/kafka-summit-sf18/zen-and-the-art-of-streaming-joins
https://docs.conﬂuent.io/current/ksql/docs/developer-guide/partition-data.html
@miguno

46
Why is that so?
Because of how input data is mapped to stream tasks
46
...
...
...
P1
P2
P3
storage
processing state
Stream Task 2
read via
network
Stream
Topic
...
...
...
P1
P2
P3
Table
Topic
from stream’s P2
from table’s P2
@miguno

47
How to re-partition your data when needed
KSQL example: statement below creates a new stream with
changed number of partitions and a new ﬁeld as event.key
(so that its data is now correctly co-partitioned for joining)
47
CREATE STREAM products_repartitioned
WITH (PARTITIONS=42) AS
SELECT * FROM products
PARTITION BY product_id;
@miguno

48
Joining Streams and Global Tables
No need to worry about co-partitioning!
48
Global TableStream
Join Output
(Stream)
@miguno
Stream Task 1 2 + 3 + 5 + 2 = 12 GB
That’s because each stream task has the full data from
all the table’s partitions:

49
Capacity Planning and Sizing
Sorry, not covered here! I recommend:
KSQL Performance Tuning for Fun and Proﬁt, by Nick Dearden
October 1, 2019 @ 2:50-3:30 pm, Stream Processing track
https://kafka-summit.org/sessions/ksql-performance-tuning-fun-proﬁt/
49
@miguno

50
Streams and Tables in KSQL
Stream 01
Stream 02
Stream 03
Table
Process event streams to create new, continuously updated streams or tables
QueryQuery
Push
Query
CREATE TABLE OrderTotals AS SELECT * FROM ... EMIT CHANGES
@miguno

51
Query tables similar to a relational database
Table
QueryQuery
Pull
Query
SELECT * FROM OrderTotals WHERE region = ‘Europe’
Result
New feature
@miguno

52
Query tables from other apps with push or pull queries
Other Applications
(Java, Go, Python, etc.)
can directly query tables
Result
via network
(KSQL REST API)
Table
SELECT * FROM OrderTotals WHERE region = ‘Europe’
@miguno

53
Streaming import/export of Tables
KSQL integrates with Kafka Connect
CREATE SOURCE CONNECTOR my-postgres-jdbc WITH (
connector.class = "io.confluent.connect.jdbc.jdbcSourceConnector",
connection.url = "jdbc:postgresql://dbserver:5432/my-db", ...);
New feature
controls controls
@miguno

54
KSQL example use case
Creating an event-driven dashboard from a customer database
@miguno
customers
table
Table
Kafka Connect is
streaming change events
Stream
Aggregations are
computed in real-time
Table
Results are
continuously updating
Elasticsearch
Table (Index)Stream

55
THANK YOU
@miguno
michael@confluent.io
cnfl.io/meetups cnfl.io/blog cnfl.io/slack

Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019

Similar to Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019 (20)

More from Michael Noll

More from Michael Noll (9)

Recently uploaded

Recently uploaded (20)

Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019