2. 2Confidential
August 2018, Kafka Summit SF Announcement
A Developer Preview of
KSQL
A Streaming SQL Engine for Apache KafkaTM
from Confluent
3. 3Confidential
Agenda
● What is KSQL for?
● Why KSQL?
● KSQL concepts
● Demo: Working with KSQL to process and visualize data
● Core concepts: Stream and Table
● Understand the KSQL ecosystem
● Roadmap
4. 4Confidential
What is it for ?
● Streaming ETL
○ Kafka is popular for data pipelines.
○ KSQL enables easy transformations of
data within the pipe
● Anomaly Detection
○ Identifying patterns or anomalies in
real-time data, surfaced in milliseconds
● Monitoring
○ Log data monitoring, tracking
and alerting
○ Sensor / IoT data
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM auth_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number HAVING count(*) > 3;
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_strm
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;
5. 5Confidential
Why KSQL?
Stream processing development is hard - you need developer skills; not good if you
are a data-scientist, analyst, or non-developer.
● SQL based; simple & intuitive
● SQL simplifies deployment - no jars, no artifacts or binaries; just run SQL
● Interact and access your data via the CLI: SELECT * from XXX where A,B,C
● Easily get data-in and out of Kafka (and process it)
● Use SQL to process your data by leveraging Kafka Streams
● Built on Kafka and its Streams API: distributed, scalable, reliable, and real-time.
6. 6Confidential
KSQL Concepts
● STREAM and TABLE as first-class citizens
● Interpretations of Topic content
● STREAM - data in motion
● TABLE - collected state of a stream (aggregations)
○ One record per key (per window)
○ Current values (compacted topic) ← Not yet in KSQL
○ Changelog
11. 11Confidential
Building a Stream
STREAM: A stream is an unbounded sequence of structured data (“facts”).
For example, a stream of financial transactions such as “Alice sent $100 to Bob, then
Charlie sent $50 to Bob”.
Facts in a stream are immutable, new facts can be inserted to a stream, existing facts can
never be updated or deleted.
CREATE STREAM clickstream (_time bigint,time varchar, ip varchar, request
varchar, status int, userid varchar, bytes bigint, agent varchar)
WITH
(kafka_topic = 'clickstream', value_format = 'json');
12. 12Confidential
KSQL> Working with Streams
1. ksql> list TOPICS;
2. ksql> CREATE STREAM clickstream (_time bigint,time varchar, ip varchar,
request varchar, status int, userid varchar, bytes bigint, agent varchar)
with (kafka_topic = 'clickstream', value_format = 'json');
3. ksql> list STREAMS;
4. ksql> DESCRIBE CLICKSTREAM;
5. ksql> SELECT * from CLICKSTREAM limit 10;
6. ksql> SELECT * from CLICKSTREAM WHERE request like ‘%html%’;
13. 13Confidential
Create and Interact with a Table
TABLE: A table is a view of a STREAM and represents a collection of evolving facts.
We could have a table that contains the latest financial information such as:
“Bob’s current account balance is $150”.
Similar to a traditional database table but enriched by streaming semantics such as
windowing.
● Facts in a table are mutable, new facts can be inserted to the table, and
existing facts can be updated or deleted.
● Tables can be created from a Kafka topic or derived from streams and tables.
CREATE TABLE IP_SUM as SELECT ip, sum(bytes)/1024 as kbytes
FROM CLICKSTREAM WINDOW SESSION (300 second) GROUP BY ip;
14. 14Confidential
KSQL CLI> Build a TABLE using SELECT
kql> SELECT ip, sum(bytes)/1024 as kbytes FROM CLICKSTREAM WINDOW SESSION (300 second)
GROUP BY ip;
111.145.8.144 | 4
222.245.174.248 | 5
233.90.225.227 | 39
<<snip>>
ksql> CREATE TABLE IP_SUM as SELECT ip, sum(bytes)/1024 as kbytes FROM CLICKSTREAM
window SESSION (300 second) GROUP BY ip;
ksql> SELECT * from IP_SUM limit 10;
1504788602258 | 233.173.215.103 : Window{start=1504788556778 end=-} | 233.173.215.103 | 374
<<snip>>
15. 15Confidential
KSQL CLI> Build a TABLE using SELECT
kql> LIST TABLES;
Table Name | Kafka Topic | Format | Windowed
----------------------------------------------
IP_SUM | IP_SUM | JSON | true
ksql> DESCRIBE IP_SUM;
Field | Type
---------------------------
ROWTIME | BIGINT
ROWKEY | VARCHAR(STRING)
IP | VARCHAR(STRING)
KBYTES | BIGINT
ksql> SELECT * from IP_SUM where IP like ‘%33%’ limit 10;
1505314606146 | 233.203.236.146 : Window{start=1505314602405 end=-} | 233.203.236.146 | 4
16. 16Confidential
Visualize the Table in Grafana
1. Build a timestamped TABLE from a table. We need timestamped data for ES
ksql> CREATE TABLE IP_SUM_TS as SELECT rowTime as event_ts, * FROM IP_SUM;
2. Start Elasticsearch
$ /etc/init.d/elasticsearch start
[....] Starting Elasticsearch Server
3. Start Grafana
$ /etc/init.d/grafana-server start
4. Connect the Stream IP_SUM_TS to Elastic and add the datasource to Grafana
# cd /usr/share/doc/ksql-clickstream-demo/
# ./ksql-connect-es-grafana.sh ip_sum_ts
22. 22Confidential
Recap: Stream-Table duality
● STREAM and TABLE as first-class citizens
● Interpretations of topic content
● STREAM - data in motion
● TABLE - collected state of a stream (aggregations)
○ One record per key (per window)
○ Current values (compacted topic) ← Not yet in KSQL
○ Changelog
23. 23Confidential
Window Aggregations
Three types supported (same as KStreams):
● TUMBLING: Fixed-size, non-overlapping, gap-less windows
○ SELECT ip, count(*) AS hits FROM clickstream
WINDOW TUMBLING (size 1 minute) GROUP BY ip;
● HOPPING: Fixed-size, overlapping windows
○ SELECT ip, SUM(bytes) AS bytes_per_ip_and_bucket FROM clickstream
WINDOW HOPPING ( size 20 second, advance by 5 second) GROUP BY ip;
● SESSION: Dynamically-sized, non-overlapping, data-driven window
○ SELECT ip, SUM(bytes) AS bytes_per_ip FROM clickstream
WINDOW SESSION (20 second) GROUP BY ip;
More: http://docs.confluent.io/current/streams/developer-guide.html#windowing
24. 24Confidential
Resources and Admin
● LIST TOPICS;
● LIST STREAMS;
● LIST TABLES;
● SHOW PROPERTIES;
● LIST QUERIES;
● If you need to stop one:
○ TERMINATE <query-id>;
26. 26Confidential
Developing in KSQL
● Interactive development using the CLI
● Capture SQL commands in a stream-application.sql
● Automate setup into your CI
● <more tools coming>
set 'commit.interval.ms'='2000';
set 'cache.max.bytes.buffering'='10000000';
set 'auto.offset.reset'='earliest';
DROP STREAM clickstream;
CREATE STREAM clickstream (_time bigint,time varchar, ip varchar, request varchar,
status int, userid int, bytes bigint, agent varchar) with (kafka_topic = 'clickstream',
value_format = 'json');
DROP TABLE events_per_min;
create table events_per_min as select userid, count(*) as events from clickstream
window TUMBLING (size 10 second) group by
userid;
-- VIEW - Enrich with rowTime
DROP TABLE events_per_min_ts;
CREATE TABLE events_per_min_ts as select rowTime as event_ts, * from
events_per_min;
28. 28Confidential
Mode #1 Stand-alone aka ‘local mode’
● Starts a CLI, an Engine, and a REST server all in
the same JVM
● Ideal for laptop development etc.
○ Use with default settings:
> bin/ksql-cli local
○ Or with customized settings:
> bin/ksql-cli local --properties-file
foo/bar/ksql.properties
● Careful with service and command topic naming!
(more on this in a moment...)
29. 29Confidential
Mode #2 Client-Server
● Start any number of Server nodes
○ > bin/ksql-server-start
○ > bin/ksql-server-start --properties-file
foo.properties
● Start any number of CLIs, specifying a server
address as ‘remote’ endpoint
○ > bin/ksql-cli remote http://server:8090
● All Engines share the work
○ Instances of the same KStreams Apps
○ Scale up/down without restarting
30. 30Confidential
KSQL Session Variables
● Just as in MySQL, ORACLE etc. there are settings to control how your CLI behaves
● Defaults can be set in the ksql.properties file
● To see a list of currently set or default variable values:
○ ksql> show properties;
● Useful examples:
○ num.stream.threads=4
○ commit.interval.ms=1000
○ cache.max.bytes.buffering=2000000
● TIP! - Your new best friend for testing or building a demo is:
○ ksql> set ‘auto.offset.reset’ = ‘earliest’;
31. 31Confidential
Roadmap, 2018
● GA of current feature set. Improved quality, stability, and operations
● Complete our view of what a SQL streaming platform should provide for
Streams and Tables
● Additional aggregate functions. We will continue to expand the set of analytics
functions
● Testing tools. Many data-platforms suffer from an inherent inability to test. With
KSQL testing capability is a primary focus and we will provide frameworks to
support continuous integration and unit test
[subject to change]
32. Kafka Summit is coming to London!
April 23-24, 2018
Subscribe for updates on CFP, sponsorships and more at
www.kafka-summit.org