This document discusses KSQL, a streaming SQL engine for Apache Kafka. It begins by explaining some key KSQL concepts like streams, tables, queries, and windowing. It then provides examples of creating streams and tables from Kafka topics, exploring data in streams and tables, and joining streams to tables. Finally, it outlines some common usage patterns for KSQL like streaming ETL, anomaly detection, real-time monitoring, and data transformation. Throughout there are code examples demonstrating how to execute these patterns using the KSQL declarative language.
2. Neil is a senior engineer and technologist at Confluent, the
company founded by the creators of Apache Kafka®. He has over
20 years of expertise of working on distributed computing,
messaging and stream processing. He has built or redesigned
commercial messaging platforms, distributed caching products as
well as developed large scale bespoke systems for tier-1 banks.
After a period at ThoughtWorks, he went on to build some of the
first distributed risk engines in financial services. In 2008 he
launched a startup that specialised in distributed data analytics and
visualization. Then prior to joining Confluent he was the CTO at a
fintech consultancy.
Neil Avery
Senior Engineer and Technologist, Confluent
7. KSQL Concepts
• Streams are first-class citizens
• Tables are first-class citizens
• 2 types of Queries (Pers, Trans)
• Some queries are persistent
• Persistent Queries populate Streams and Tables
• Transient Queries are for users interaction
• All queries run until terminated
8. CREATE STREAM clickstream
WITH (
value_format = ‘JSON’,
kafka_topic=‘my_clickstream_topic’
);
Creating a Stream
• “Data in motion”
• Let’s say we have a topic called my_clickstream_topic
• The topic contains JSON data
• KSQL now knows about that topic
9. Exploring that Stream
SELECT status, bytes
FROM clickstream
WHERE user_agent =
‘Mozilla/5.0 (compatible; MSIE 6.0)’;
• Now that the stream exists, we can examine its contents
• Simple, declarative filtering
• A non-persistent query
10. CREATE TABLE users
(user_id int, registered_At long …)
WITH (
key = ‘user_id',
kafka_topic=‘clickstream_users’,
value_format=‘JSON’
);
Creating a Table
• “A stateful view of a data in motion”
• Can be built from a kafka topic OR another KSQL Stream
• We have a topic called my_clickstream_topic
• The topic contains JSON data, The topic contains changelog data
11. CREATE TABLE events_per_min
AS SELECT user_id, count(*)AS events
FROM clickstream WINDOW TUMBLING(
size 10 seconds= ‘user_id'
) GROUP BY user_id;
Creating a Table
• Derived from a stream
• Windowed aggregate
12. Inspecting that Table
SELECT userid, username
FROM users
WHERE level = ‘Platinum’;
• Now that the table exists, we can examine its contents
• Simple, declarative filtering
• A non-persistent query
13. Joining a Stream to a Table
• Now that we have clickstream and users, we can join them
• This allows us to do filtering of clicks on a user attribute
CREATE STREAM vip_actions AS
SELECT userid, page, action FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
15. KSQL for Streaming ETL
• Kafka is popular for data pipelines.
• KSQL enables easy transformations of data within the pipe.
• Transforming data while moving from Kafka to another system.
CREATE STREAM vip_actions AS
SELECT userid, page, action FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
16. KSQL for Anomaly Detection
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
Identifying patterns or anomalies in real-time data,
surfaced in milliseconds
17. KSQL for Real-Time
Monitoring• Log data monitoring, tracking and alerting
• Sensor / IoT data
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;
18. KSQL for Data Transformation
CREATE STREAM views_by_userid
WITH (PARTITIONS=6,
VALUE_FORMAT=‘JSON’,
TIMESTAMP=‘view_time’) AS
SELECT * FROM clickstream PARTITION BY user_id;
Make simple derivations of existing topics from the command line
19. Stream Patterns
Stream
Stream with partitioning
(scaling out)
Stream
(fork)
Table view
(windowed - from stream)
Table
(scaled out – across nodes)
Stream table join
(left-join)
22. • Streams, Tables and Queries
• Many streaming patterns
• The clickstream demo is a good place to ‘grok'
Recap
23. Resources and Next Steps
https://github.com/confluentinc/ksql
http://confluent.io/ksql
https://slackpass.io/confluentcommunity #ksql
Signup for
- Part 2: Development
- Part 3 Deployment
https://www.confluent.io/empowering-streams-through-ksql