ksqlDB is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. ksqlDB is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly.
ksqlDB offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using ksqlDB for most part. This will be done in a live demo on a fictitious IoT sample.
2. Guido
Working at Trivadis for more than 23 years
Consultant, Trainer, Platform Architect for Java,
Oracle, SOA and Big Data / Fast Data
Oracle Groundbreaker Ambassador & Oracle ACE
Director
@gschmutz guidoschmutz.wordpress.com
210th
edition
5. Kafka Streams - Overview
• Designed as a simple and lightweight library in
Apache Kafka
• Part of Apache Kafka project
(https://kafka.apache.org/documentation/streams/)
• no other dependencies than Kafka
• Supports fault-tolerant local state
• Supports Windowing (Fixed, Sliding and Session)
and Stream-Stream / Stream-Table Joins
• Millisecond processing latency, no micro-batching
• At-least-once and exactly-once processing
guarantees
BD-STREAM – Introducing Stream Processing5
KTable<Integer, Customer> customers =
builder.stream(”customer");
KStream<Integer, Order> orders =
builder.stream(”order");
KStream<Integer, String> enriched =
orders.leftJoin(customers, …);
enriched.to(”orderEnriched");
trucking_
driver
Kafka Cluster
Java Application
Kafka Streams
6. ksqlDB: Streaming SQL Engine for Apache Kafka
• Separate open source project, not part of Apache
Kafka (http://ksqldb.io)
• simplest way to process streams of data in real-time
• Enables stream processing with zero coding
• Use a familiar language (SQL dialect)
• Powered by Kafka and Kafka Streams
• scalable, distributed, mature
• Create materialized views over streams
• Receive real-time push updates or pull current state
on demand
• Kafka native - All you need is Kafka
ksqlDB
trucking_
driver
Kafka Cluster
KSQL Engine
Kafka Streams
KSQL CLI Commands
8. Terminology
Stream
• an unbounded sequence of structured data
(“facts”)
• Facts in a stream are immutable: new facts can
be inserted to a stream, but existing facts can
never be updated or deleted
• Streams can be created from a Kafka topic or
derived from an existing stream
• A stream’s underlying data is durably stored
(persisted) within a Kafka topic on the Kafka
brokers
Table
• materialized View of events with only the
latest value for a key
• a view of a stream, or another table, and
represents a collection of evolving facts
• the equivalent of a traditional database table
but enriched by streaming semantics such as
windowing
• Facts in a table are mutable: new facts can be
inserted to the table, and existing facts can be
updated or deleted
• Tables can be created from a Kafka topic or
derived from existing streams and tables
13. Demo 1 – Create a new STREAM with refinement to AVRO
ksql> CREATE STREAM IF NOT EXISTS vehicle_tracking_refined_s
WITH (kafka_topic='vehicle_tracking_refined’,
value_format='AVRO’,
value_avro_schema_full_name=
'com.trivadis.avro.VehicleTrackingRefined’)
AS SELECT truckId AS ROWKEY
, 'Tracking_SysA' AS source
, timestamp
, AS_VALUE(truckId) AS vehicleId
, driverId
, routeId
, eventType
, latitude
, longitude
, correlationId
FROM vehicle_tracking_sysA_s
PARTITION BY truckId
EMIT CHANGES;
14. Demo 1 – Create a STREAM on vehicle_tracking_sysB
ksql> CREATE STREAM IF NOT EXISTS vehicle_tracking_sysB_s (
ROWKEY VARCHAR KEY,
system VARCHAR,
timestamp VARCHAR,
vehicleId VARCHAR,
driverId BIGINT,
routeId BIGINT,
eventType VARCHAR,
latLong VARCHAR,
correlationId VARCHAR)
WITH (kafka_topic='vehicle_tracking_sysB’,
value_format='DELIMITED');
15. Demo 1 – INSERT INTO existing stream SELECT form other
stream
ksql> INSERT INTO vehicle_tracking_refined_s
SELECT ROWKEY
, 'Tracking_SysB' AS source
, timestamp
, vehicleId
, driverId
, routeId
, eventType
, cast(split(latLong,':')[1] as DOUBLE) as latitude
, CAST(split(latLong,':')[2] AS DOUBLE) as longitude
, correlationId
FROM vehicle_tracking_sysB_s
EMIT CHANGES;
17. CREATE STREAM
Create a new stream, backed by a Kafka topic, with the specified columns and properties
Supported column data types:
• BOOLEAN, INTEGER, BIGINT, DOUBLE, VARCHAR or STRING
• ARRAY<ArrayType>
• MAP<VARCHAR, ValueType>
• STRUCT<FieldName FieldType, ...>
Supports the following serialization formats: CSV, JSON, AVRO
• KSQL adds the implicit columns ROWTIME and ROWKEY to every stream
CREATE STREAM stream_name ( { column_name data_type } [, ...] )
WITH ( property_name = expression [, ...] );
19
18. SELECT (Push Query)
Push a continuous stream of updates to the ksqlDB stream or table
Result of this statement will not be persisted in a Kafka topic and will only be printed out in the console
This is a continuous query, to stop the query in the CLI press CTRL-C
• from_item is one of the following: stream_name, table_name
SELECT select_expr [, ...]
FROM from_item
[ LEFT JOIN join_table ON join_criteria ]
[ WINDOW window_expression ]
[ WHERE condition ]
[ GROUP BY grouping_expression ]
[ HAVING having_expression ]
EMIT output_refinement
[ LIMIT count ];
19. Functions
Scalar Functions
• ABS, ROUND, CEIL, FLOOR
• ARRAYCONTAINS
• CONCAT, SUBSTRING, TRIM
• EXTRACJSONFIELD
• GEO_DISTANCE
• LCASE, UCASE
• MASK, MASK_KEEP_LEFT, MASK_KEEP_RIGHT,
MASK_LEFT, MASK_RIGHT
• RANDOM
• STRINGTOTIMESTAMP, TIMESTAMPTOSTRING
Aggregate Functions
• COUNT
• MAX
• MIN
• SUM
• TOPK
• TOPKDISTINCT
User-Defined Functions (UDF) and User-Defined
Aggregate Functions (UDAF)
• Currently only supported using Java
21
https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/functions/
20. CREATE STREAM … AS SELECT …
Create a new KSQL table along with the corresponding Kafka topic and stream the result of the SELECT
query as a changelog into the topic
WINDOW clause can only be used if the from_item is a stream
CREATE STREAM stream_name
[WITH ( property_name = expression [, ...] )]
AS SELECT select_expr [, ...]
FROM from_stream [ LEFT | FULL | INNER ]
JOIN [join_table | join_stream]
[ WITHIN [(before TIMEUNIT, after TIMEUNIT) | N TIMEUNIT] ] ON join_criteria
[ WHERE condition ]
[PARTITION BY column_name];
21. INSERT INTO … AS SELECT …
Stream the result of the SELECT query into an existing stream and its underlying topic
schema and partitioning column produced by the query must match the stream’s schema and key
If the schema and partitioning column are incompatible with the stream, then the statement will
return an error
stream_name and from_item must both
refer to a Stream. Tables are not supported!
CREATE STREAM stream_name ...;
INSERT INTO stream_name
SELECT select_expr [., ...]
FROM from_stream
[ WHERE condition ]
[ PARTITION BY column_name ];
22. Two Types of Queries
Pull queries
• allow you to fetch the current state of a materialized
view
• Because materialized views are updated incrementally
as new events arrive, pull queries run with predictably
low latency
Push queries
• enable you to subscribe to materialized view updates
and stream changes
• When new events arrive, push queries emit
refinements, so your event streaming applications can
react to new information in real-time
SELECT …
FROM vehicle_position_s
EMIT CHANGES;
SELECT …
FROM vehicle_position_t
WHERE vehicleId = 10;
24. Demo 2 – Pull Query on vehicle_tracking
ksql> CREATE TABLE IF NOT EXISTS vehicle_tracking_refined_t
WITH (kafka_topic = 'vehicle_tracking_refined_t’)
AS SELECT CAST(vehicleId AS BIGINT) vehicleId
, latest_by_offset(driverId) driverId
, latest_by_offset(source) source
, latest_by_offset(eventType) eventType
, latest_by_offset(latitude) latitude
, latest_by_offset(longitude) longitude
FROM vehicle_tracking_refined_s
GROUP BY CAST(vehicleId AS BIGINT)
EMIT CHANGES;
ksql> SELECT * FROM vehicle_tracking_refined_t
WHERE vehicleId = 42;
25. SELECT (Pull Query)
Pulls the current value from the materialized table and terminates
The result of this statement isn't persisted in a Kafka topic and is printed out only in the console
Pull queries enable to fetch the current state of a materialized view
They're a great match for request/response flows and can be used with ksqlDB REST API
SELECT select_expr [, ...]
FROM aggregate_table
WHERE key_column = key
[ AND window_bounds ];
30. Demo 4 – Create a TABLE on logisticsdb_driver
ksql> CREATE TABLE IF NOT EXISTS driver_t (
id BIGINT PRIMARY KEY,
first_name VARCHAR,
last_name VARCHAR,
available VARCHAR,
birthdate VARCHAR)
WITH (kafka_topic='logisticsdb_driver’,
value_format='JSON');
ksql> SELECT * FROM driver_t EMIT CHANGES;
+----------+-------------------+---------------------+-------------------+-------------------------------+
|ID |FIRST_NAME |LAST_NAME |AVAILABLE |BIRTHDATE |
+----------+-------------------+---------------------+-------------------+-------------------------------+
|28 |Della |Mcdonald |Y |3491 |
|31 |Rosemarie |Ruiz |Y |3917 |
|12 |Laurence |Lindsey |Y |3060 |
|22 |Patricia |Coleman |Y |3875 |
|11 |Micky |Isaacson |Y |973 |
31. Demo 4 – Create a Stream with Enrichment by Driver
ksql> CREATE STREAM IF NOT EXISTS problematic_driving_and_driver_s
WITH (kafka_topic='problematic_driving_and_driver’,
value_format='AVRO’, partitions=8)
AS SELECT pd.driverId
, d.first_name
, d.last_name
, d.available
, pd.vehicleId
, pd.routeId
, pd.eventType
FROM problematic_driving_s pd
LEFT JOIN driver_t d
ON pd.driverId = d.id;
ksql> select * from problematic_driving_and_driver_s EMIT CHANGES;
1539713095921 | 11 | 11 | Micky | Isaacson | 67 | 160405074 | Lane Departure |
39.01 | -93.85
1539713113254 | 11 | 11 | Micky | Isaacson | 67 | 160405074 | Unsafe following
distance | 39.0 | -93.65
32. CREATE CONNECTOR
Create a new connector in the Kafka Connect cluster
with the configuration passed in the WITH clause
Kafka Connect is an open source component of
Apache Kafka that simplifies loading and exporting
data between Kafka and external systems
ksqlDB provides functionality to manage and
integrate with Connect
CREATE SOURCE | SINK CONNECTOR [IF NOT EXISTS] connector_name
WITH( property_name = expression [, ...]);
Source: ksqlDB Documentation
33. CREATE TABLE
Create a new table with the specified columns and properties
Supports same data types as CREATE STREAM
KSQL adds the implicit columns ROWTIME and ROWKEY to every table as well
KSQL has currently the following requirements for creating a table from a Kafka topic
• message key must also be present as a field/column in the Kafka message value
• message key must be in VARCHAR aka STRING format
CREATE TABLE table_name ( { column_name data_type } [, ...] )
WITH ( property_name = expression [, ...] );
35. Windowing
• Tumbling Window
• Hopping Window
• Session Window
SELECT item_id, SUM(quantity)
FROM orders
WINDOW TUMBLING (SIZE 20 SECONDS)
GROUP BY item_id
SELECT item_id, SUM(quantity)
FROM orders
WINDOW SESSION (20 SECONDS)
GROUP BY item_id
SELECT item_id, SUM(quantity)
FROM orders
WINDOW HOPPING (SIZE 20 SECONDS,
ADVANCE BY 5 SECONDS)
GROUP BY item_id;
36. Demo 5 – SELECT COUNT … GROUP BY
ksql> CREATE TABLE event_type_by_1hour_tumbl_t
WITH (kafka_topic = 'event_type_by_1hour_tumbl_t’)
AS SELECT windowstart AS winstart
, windowend AS winend
, eventType
, count(*) AS nof
FROM problematic_driving_s
WINDOW TUMBLING (SIZE 60 minutes)
GROUP BY eventType;
ksql> SELECT TIMESTAMPTOSTRING(WINDOWSTART,'yyyy-MM-dd HH:mm:SS','CET') wsf
, TIMESTAMPTOSTRING(WINDOWEND,'yyyy-MM-dd HH:mm:SS','CET') wef
, eventType
, nof
FROM event_type_by_1hour_tumbl_t
EMIT CHANGES;
+----------------------------+---------------------------+---------------------------------+-------+
|WSF |WEF |EVENTTYPE |NOF |
+----------------------------+---------------------------+---------------------------------+-------+
|2020-11-16 21:00:00 |2020-11-16 22:00:00 |Unsafe following distance |1 |
|2020-11-16 21:00:00 |2020-11-16 22:00:00 |Lane Departure |1 |
|2020-11-16 21:00:00 |2020-11-16 22:00:00 |Unsafe tail distance |1 |
|2020-11-16 21:00:00 |2020-11-16 22:00:00 |Overspeed |1 |
|2020-11-16 21:00:00 |2020-11-16 22:00:00 |Overspeed |3 |
37. ksqlDB REST API
• The /status endpoint lets you poll the status of the command
• The /info resource gives you information about the status of a ksqlDB Server
• The /ksql resource runs a sequence of SQL statements
• The /query resource lets you stream the output records of a SELECT statement via a chunked
transfer encoding
curl -X POST -H 'Content-Type: application/vnd.ksql.v1+json’
-i http://dataplatform:8088/query
--data '{ "ksql":
"SELECT * FROM problematic_driving_s EMIT CHANGES;",
"streamsProperties": {
"ksql.streams.auto.offset.reset": "latest" }
}'
38. ksqlDB Native Client
• ksqlDB ships with a lightweight Java client
• enables sending requests easily to a ksqlDB server from within your Java application
• alternative to using the REST API
• Supports
• pull and push queries
• inserting new rows of data into existing ksqlDB streams
• creation and management of new streams and tables
• persistent queries
• admin operations such as listing streams, tables, and topics
https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-clients/java-client/
39. Choosing the Right API
• Java, c#, c++, scala,
phyton, node.js,
go, php …
• subscribe()
• poll()
• send()
• flush()
• Anything Kafka
• Fluent Java API
• mapValues()
• filter()
• flush()
• Stream Analytics
• SQL dialect
• SELECT … FROM …
• JOIN ... WHERE
• GROUP BY
• Stream Analytics
Consumer,
Producer API
Kafka Streams KSQL
• Declarative
• Configuration
• REST API
• Out-of-the-box
connectors
• Stream Integration
Kafka Connect
Flexibility Simplicity
Source: adapted from Confluent
40. You are welcome to join us at the Expo area.
We're looking forward to meeting you.
Link to the Expo area:
https://www.vinivia-event-
manager.io/e/DOAG/portal/expo/29731
My other talks at DOAG 2020:
18.11 – 10:00 - Big Data, Data Lake, Datenserialisierungsformate
18.11 – 13:00 – Rolle des Event Hubs in einer modernen Daten Architektur
19.11 – 13:00 – Kafka Livedemo: Umsetzung einer Streaminglösung #slideless