Streaming SQL for Apache Kafka

1Confidential
KSQL
An Open Source Streaming SQL Engine for Apache Kafka
Kai Waehner
Technology Evangelist
kontakt@kai-waehner.de
LinkedIn
@KaiWaehner
www.confluent.io
www.kai-waehner.de

2KSQL- Streaming SQL for Apache Kafka
Agenda
1) Apache Kafka Ecosystem
2) Motivation for KSQL
3) KSQL Concepts
4) Live Demo
5) KSQL Architecture
6) Use Case: Clickstream Analysis
7) Getting Started

Agenda
3) KSQL Concepts
4) Live Demo
7) Getting Started

Apache Kafka - A Distributed, Scalable Commit Log

The Log ConnectorsConnectors
Producer Consumer
Streaming Engine
Apache Kafka – The Rise of a Streaming Platform

Apache Kafka – The Rise of a Streaming Platform

KSQL – A Streaming SQL Engine for Apache Kafka

Agenda
3) KSQL Concepts
4) Live Demo
7) Getting Started

Why KSQL?
Population
CodingSophistication
Realm of Stream Processing
New, Expanded Realm
BI
Analysts
Core
Developers
Data
Engineers
Core Developers
who don’t like
Java

Trade-Offs
• subscribe()
• poll()
• send()
• flush()
• mapValues()
• filter()
• punctuate()
• Select…from…
• Join…where…
• Group by..
Flexibility Simplicity
Kafka Streams KSQL
Consumer
Producer

What is it for ?
Streaming ETL
• Kafka is popular for data pipelines
• KSQL enables easy transformations of data within the pipe
CREATE STREAM vip_actions AS
SELECT userid, page, action FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';

What is it for ?
Simple Derivations of Existing Topics
• One-liner to re-partition and / or re-key a topic for new uses
CREATE STREAM views_by_userid
WITH (PARTITIONS=6,
VALUE_FORMAT=‘JSON’,
TIMESTAMP=‘view_time’) AS
SELECT *
FROM clickstream
PARTITION BY user_id;

What is it for ?
Analytics, e.g. Anomaly Detection
• Identifying patterns or anomalies in real-time data, surfaced in milliseconds
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 MINUTES)
GROUP BY card_number
HAVING count(*) > 3;

What is it for ?
Real Time Monitoring
• Log data monitoring, tracking and alerting
• Sensor / IoT data
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;

Where is KSQL not such a great fit (at least today)?
Powerful ad-hoc query
○ Limited span of time usually
retained in Kafka
○ No indexes
BI reports (Tableau etc.)
○ No indexes
○ No JDBC (most Bi tools are not
good with continuous results!)

Agenda
3) KSQL Concepts
4) Live Demo
7) Getting Started

KSQL – A Streaming SQL Engine for Apache Kafka

KSQL Concepts
● No need for source code
• Zero, none at all, not even one line.
• No SerDes, no generics, no lambdas, ...
● All the Kafka Streams “magic” out-of-the-box
• Exactly Once Semantics
• Windowing
• Event-time aggregation
• Late-arriving data
• Distributed, fault-tolerant, scalable, ...

STREAM and TABLE as first-class citizens

CREATE STREAM AS syntax
CREATE STREAM `stream_name`
[WITH (`property = expression` [, …] ) ]
AS SELECT `select_expr` [, ...]
FROM `from_item` [, ...]
[ WHERE `condition` ]
[ PARTITION BY `column_name` ]
● where property can be any of the following:
KAFKA_TOPIC = name - what to call the sink topic
FORMAT = DELIMITED | JSON | AVRO - defaults to format of input stream
AVROSCHEMAFILE = path/to/file - if FORMAT=AVRO, where the output schema file will be written to
PARTITIONS = # - number of partitions in sink topic
TIMESTAMP = column - The name of the column to use as the timestamp. This can be used to define the
event time.

CREATE TABLE AS syntax
CREATE TABLE `stream_name`
[WITH ( `property_name = expression` [, ...] )]
[ WINDOW `window_expression` ]
[ GROUP BY `grouping expression` ]
[ HAVING `having_expression` ]
● where property values are same as for ‚Create Streams as Select‘

SELECT statement syntax
SELECT `select_expr` [, ...]
[ LIMIT n ]
where from_item is one of the following:
stream_or_table_name [ [ AS ] alias]
from_item LEFT JOIN from_item ON join_condition

WINDOWing
● Not ANSI SQL ! à Continuous Queries
● Three types supported (same as Kafka Streams):
• TUMBLING (= SLIDING)
• SELECT appname, ip, COUNT(appname) AS problem_count FROM
logstream WINDOW TUMBLING (size 1 minute) WHERE loglevel='ERROR'
GROUP BY appname, ip;
• HOPPING
• SELECT itemid, SUM(arraycol[0]) FROM orders WINDOW HOPPING (
size 20 second, advance by 5 second) GROUP BY itemid;
• SESSION
• SELECT itemid, SUM(sales_price) FROM orders WINDOW SESSION (20
second) GROUP BY itemid;

Agenda
3) KSQL Concepts
4) Live Demo
7) Getting Started

Create a STREAM and a TABLE from Kafka Topics
ksql> CREATE STREAM pageviews_original (viewtime bigint, userid varchar, pageid varchar) WITH (kafka_topic='pageviews',
value_format='DELIMITED');
ksql> CREATE TABLE users_original (registertime bigint, gender varchar, regionid varchar, userid varchar) WITH
(kafka_topic='users', value_format='JSON');
ksql> SELECT pageid FROM pageviews_original LIMIT 3;
ksql> CREATE STREAM pageviews_female AS SELECT users_original.userid AS userid, pageid, regionid, gender FROM
pageviews_original LEFT JOIN users_original ON pageviews_original.userid = users_original.userid WHERE gender = 'FEMALE';

Live Demo – KSQL Hello World

Agenda
3) KSQL Concepts
4) Live Demo
7) Getting Started

KSQL - Components
KSQL has 3 main components:
1. The CLI, designed to be familiar to users of MySQL, Postgres etc.
2. The Engine which actually runs the Kafka Streams topologies
3. The REST server interface enables an Engine to receive instructions from the CLI
(Note that you also need a Kafka Cluster… KSQL is deployed independently)

Kafka Cluster
JVM
KSQL EngineRESTKSQL>
#1 STAND-ALONE AKA ‘LOCAL MODE’

#1 STAND-ALONE AKA ‘LOCAL MODE’
Starts a CLI, an Engine,
and a REST server all
in the same JVM
Ideal for laptop development
• Start with default settings:
• > bin/ksql-cli local
Or with customized settings:
• > bin/ksql-cli local –-properties-file foo/bar/ksql.properties

#2 CLIENT-SERVER
Kafka Cluster
JVM
KSQL Engine
REST
KSQL>
JVM
KSQL Engine
REST
JVM
KSQL Engine
REST

#2 CLIENT-SERVER
Start any number
of Server nodes
• > bin/ksql-server-start
Start any number of CLIs and
specify ‘remote’ server address
• >bin/ksql-cli remote http://myserver:8090
All running Engines share the processing load
• Technically, instances of the same Kafka Streams Applications
• Scale up / down without restart

#3 AS PRE-DEFINED APP
Kafka Cluster
JVM
KSQL Engine
JVM
KSQL Engine
JVM
KSQL Engine

#3 AS PRE-DEFINED APP
Running the KSQL server
with a pre-defined set of
instructions/queries
• Version control your queries and
transformations as code
Start any number of Engine
instances
• Pass a file of KSQL statements to execute
• > bin/ksql-node query-file=foo/bar.sql
All running Engines share the processing load
• Technically, instances of the same Kafka Streams Applications
• Scale up/down without restart

Dedicating resources

How do you deploy applications?

Where to develop and operate your applications?

Agenda
3) KSQL Concepts
4) Live Demo
7) Getting Started

Demo: Clickstream Analysis
Kafka
Producer
Elastic
search
Grafana
Kafka
Cluster
Kafka
Connect
KSQL
Stream of
Log Events

Demo: Clickstream Analysis
• https://github.com/confluentinc/ksql/tree/0.1.x/ksql-clickstream-demo#clickstream-analysis
• Leverages Apache Kafka, Kafka Connect, KSQL, Elasticsearch and Grafana
• 5min screencast: https://www.youtube.com/watch?v=A45uRzJiv7I
• Setup in 5 minutes (with or without Docker)
SELECT STREAM
CEIL(timestamp TO HOUR) AS timeWindow, productId,
COUNT(*) AS hourlyOrders, SUM(units) AS units
FROM Orders GROUP BY CEIL(timestamp TO HOUR),
productId;
timeWindow | productId | hourlyOrders | units
------------+-----------+--------------+-------
08:00:00 | 10 | 2 | 5
08:00:00 | 20 | 1 | 8
09:00:00 | 10 | 4 | 22
09:00:00 | 40 | 1 | 45
... | ... | ... | ...

Live Demo – KSQL Clickstream Analysis

Agenda
3) KSQL Concepts
4) Live Demo
7) Getting Started

KSQL Quick Start
github.com/confluentinc/ksql
Local runtime
or
Docker container

Remember: Developer Preview!
Caveats of Developer Preview
• No ORDER BY yet
• No Stream-stream joins yet
• Limited function library
• Avro support only via workaround
• Breaking API / Syntax changes still possible
BE EXCITED, BUT BE ADVISED

Resources and Next Steps
Get Involved
• Try the Quickstart on GitHub
• Check out the code
• Play with the examples
The point of a developer preview is to improve things—together!
https://github.com/confluentinc/ksql
http://confluent.io/ksql
https://slackpass.io/confluentcommunity #ksql

Kai Waehner
Technology Evangelist
kontakt@kai-waehner.de
@KaiWaehner
www.confluent.io
www.kai-waehner.de
LinkedIn
Questions? Feedback?
Please contact me…
Come to our booth…
Come to Kafka Summit London in April 2018…

Appendix

KSQL Concepts
● STREAM and TABLE as first-class citizens
● Interpretations of topic content
● STREAM - data in motion
● TABLE - collected state of a stream
• One record per key (per window)
• Current values (compacted topic) not yet
• Changelog
● STREAM – TABLE Joins

Schema & Format
● A Kafka broker knows how to move bytes
• Technically a key-value message (byte[], byte[])
● To enable declarative SQL-like queries and transformations we have to define
a richer structure
● Structural metadata maintained in an in-memory catalog
• DDL is recorded in a special topic

Schema & Format
Start with message (value) format
● JSON - the simplest choice
● DELIMITED - in this preview, the implicit delimiter is a comma and the escaping rules are built-in.
Will be expanded.
● AVRO - requires that you also supply a schema-file (.avsc)
Pseudo-columns are automatically provided
• ROWKEY, ROWTIME - for querying the message key and timestamp
• (PARTITION, OFFSET coming soon)
• CREATE STREAM pageview (viewtime bigint, userid varchar, pageid varchar) WITH
(value_format = 'delimited', kafka_topic='my_pageview_topic');

Schema & Datatypes
● varchar / string
● boolean / bool
● integer / int
● bigint / long
● double
● array(of_type) - of-type must be primitive (no nested Array or Map yet)
● map(key_type, value_type) - key-type must be string, value-type must be
primitive

Interactive Querying
● Great for iterative development
● LIST (or SHOW) STREAMS / TABLES
● DESCRIBE STREAM / TABLE
● SELECT
• Selects rows from a KSQL stream or table.
• The result of this statement will be printed out in the console.
• To stop the continuous query in the CLI press Ctrl+C.

SELECT statement syntax
SELECT `select_expr` [, ...]
[ LIMIT n ]
where from_item is one of the following:
stream_or_table_name [ [ AS ] alias]
from_item LEFT JOIN from_item ON join_condition

WINDOWing
● Not ANSI SQL ! à Continuous Queries :-)
● Three types supported (same as KStreams):
• TUMBLING (= SLIDING)
• SELECT appname, ip, COUNT(appname) AS problem_count FROM
logstream WINDOW TUMBLING (size 1 minute) WHERE loglevel='ERROR'
GROUP BY appname, ip;
• HOPPING
• SELECT itemid, SUM(arraycol[0]) FROM orders WINDOW HOPPING (
size 20 second, advance by 5 second) GROUP BY itemid;
• SESSION
• SELECT itemid, SUM(sales_price) FROM orders WINDOW SESSION (20
second) GROUP BY itemid;

CREATE STREAM AS SELECT
● Once your query is ready and you want to run your query non-interactively
• CREATE STREAM AS SELECT ...;
● Creates a new KSQL Stream along with the corresponding Kafka topic and
streams the result of the SELECT query into the topic
● To find what streams are already running:
• SHOW QUERIES;
● If you need to stop one:
• TERMINATE query_id;

CREATE STREAM AS syntax
CREATE STREAM `stream_name`
[WITH (`property = expression` [, …] ) ]
[ PARTITION BY `column_name` ]
● where property can be any of the following:
KAFKA_TOPIC = name - what to call the sink topic
FORMAT = DELIMITED | JSON | AVRO - defaults to format of input stream
AVROSCHEMAFILE = path/to/file - if FORMAT=AVRO, where the output schema file will be written to
PARTITIONS = # - number of partitions in sink topic
TIMESTAMP = column - The name of the column to use as the timestamp. This can be used to define the
event time.

CREATE TABLE AS SELECT
● Once your query is ready and you want to run it non-interactively
● CREATE TABLE AS SELECT ...;
● Just like ‚CREATE STREAM AS SELECT‘ but for aggregations

CREATE TABLE AS syntax
CREATE TABLE `stream_name`
[WITH ( `property_name = expression` [, ...] )]
● where property values are same as for ‚Create Streams as Select‘

Functions
● Scalar Functions:
• CONCAT, IFNULL, LCASE, LEN, SUBSTRING,TRIM, UCASE
• ABS, CEIL, FLOOR, RANDOM, ROUND
• StringToTimestamp, TimestampToString
• GetStringFromJSON
• CAST
● Aggregate Functions:
• SUM, COUNT, MIN, MAX
● User- defined Functions:
• Java Interface

Session Variables
● Just as in MySQL, ORCL etc. there are settings to control how your CLI
behaves
● Set any property the KStreams consumers/producers will understand
● Defaults can be set in the ksql.properties file
● To see a list of currently set or default variable values:
• ksql> show properties;
● Useful examples:
• num.stream.threads=4
• commit.interval.ms=1000
• cache.max.bytes.buffering=2000000
● TIP! - Your new best friend for testing or building a demo is:
• ksql> set ‘auto.offset.reset’ = ‘earliest’;

Streaming SQL for Apache Kafka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Streaming SQL for Apache Kafka

Similar to Streaming SQL for Apache Kafka (20)

More from Kai Wähner

More from Kai Wähner (20)

Recently uploaded

Recently uploaded (20)

Streaming SQL for Apache Kafka