SlideShare a Scribd company logo
1 of 39
Download to read offline
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
Kafka Summit 2020
Working at Trivadis for more than 23 years
Consultant, Trainer, Platform Architect for Java,
Oracle, SOA and Big Data / Fast Data
Oracle Groundbreaker Ambassador & Oracle ACE
1. What is a Data Lake?
2. Four Architecture Blueprints for “treating Kafka as a Data Lake”
3. Summary
Demo environment and code samples available here:
What is a Data Lake?
Bulk Source
Data Consumer
What is a Data Lake? Traditional Data Lake
File Import / SQL Import
“Native” Raw
Hadoop ClusterdHadoop ClusterBig Data Platform
Initial Idea of Data Lake
• Single store of all data (incl. raw data) in the enterprise
• Put an end to data silos
• Reporting, Visualization, Analytics and Machine
• Focus on Schema-on-Read
Tech for 1st Gen Data Lake
• HDFS, MapReduce, Pig, Hive, Impala, Flume,
Tech for 2nd Gen Data Lake (Cloud native)
• Object Store (S3, Azure Blob Storage, …), Spark,
Flink, Presto, StreamSets, …
SQL / Search
Engine BI Apps
Data Science
high latency
Traditional Data Lake Zones
”Streaming Data Lake” – aka. Kappa Architecture
Stream Processing Platform
Stream Processor V1.0 State V1.0
Event Hub
Reply Bulk Data Flow
Hadoop ClusterdHadoop Cluster(Big) Data Platform
Data Flow
BI Apps
Stream Processor V2.0 State V2.0
Result V1.0
Result V2.0
{ }
SQL / Search
“Native” Raw
Data Science
Stream Source
Bulk Source
Event Source
[8] – Questioning the Lambda Architecture – by Jay Kreps
“Streaming Data Lake” Zones
Data Flow
SQL / Search
“Native” Raw
Stream Processing Platform
Stream Processor V1.0 State V1.0
Event Hub
Hadoop ClusterdHadoop Cluster(Big) Data Platform
BI Apps
Stream Processor V2.0 State V2.0
Result V1.0
Result V2.0
{ }
Data Science
Reply Bulk Data Flow
[1] Turning the database inside out with Apache Samza – by Martin Kleppmann13
Bulk Source
Event Source
Moving the Source of Truth to Event Hub
Turning the
database inside-out!
Data Flow
SQL / Search
“Native” Raw
Stream Processing Platform
Stream Processor V1.0 State V1.0
Event Hub
Hadoop ClusterdHadoop Cluster(Big) Data Platform
BI Apps
Stream Processor V2.0 State V2.0
Result V1.0
Result V2.0
{ }
Data Science
Moving the Source of Truth to Event Hub
[2] – It’s Okay To Store Data In Apache Kafka – by Jay Kreps14
Bulk Source
Event Source
is it feasible?
Confluent Enterprise Tiered Storage
Data Retention
• Never
• Time (TTL) or Size-based
• Log-Compacted based
Tiered Storage uses two tiers of storage
• Local (same local disks on brokers)
• Remote (Object storage, currently AWS S3 only)
Enables Kafka to be a long-term storage
• Transparent (no ETL pipelines needed)
• Cheaper storage for cold data
• Better scalability and less complex operations
Broker 1
Broker 2
Broker 3
hot cold
[3] Infinite Storage in Confluent Platform – by Lucas Bradstreet, Dhruvil Shah, Manveer Chawla
[4] KIP-405: Kafka Tiered Storage – Kafka Improvement Proposal
Four Architecture Blueprints
for “treating Kafka as a Data
How can you access a Kafka topic?
Streaming Queries
• Latest - start at end and continuously consume new data
• Earliest – start at beginning and consume history and then continuously consume new data
• Seek to offset – start at a given offset, consume history, and continuously consume new data
• Seek to timestamp – start at given timestamp, consume history and continuously consume new data
Batch Queries
• From start offset to end offset – start at a given offset and consume until another offset
• From start timestamp to end timestamp – start at a given offset and consume until another offset
• Full scan – Scan the complete topic from start to end
All above access options can be applied on topic or on a set of partitions
BP-1: ”Streaming” Data Lake
• Using Stream Processing tools to
perform processing on ”data in
motion” instead of in batch
• Can consume from multiple sources
• Works well if no or limited history is
• Queryable State Stores, aka.
Interactive Queries or Pull Queries
[5] Streaming Machine Learning with Tiered Storage and Without a Data Lake – by Kai Waehner22
BP-1_1: ”Streaming” Data Lake with ksqlDB /
Kafka Streams
• Kafka Streams or ksqlDB fit perfectly
• Using ksqlDB pull queries to retrieve
current state of materialized views
• Store results in another Kafka topic to
persist state store information
• Can be combined with BP-4 to store
results/state in a database
Yes No
Yes No
Yes No
Supports Protobuf
Timestamp Filter Pushdown
Offset Filter Pushdown
Yes NoSupports Avro
Yes NoSupports JSON
Yes NoSchema Registry Integration
Yes noPartition Filter Pushdown
Yes NoSupports Produce Operation
Yes noSupports Exactly Once23
Demo Use Case – Vehicle Tracking
27, Walter, Ward, Y, 24-JUL-85, 2017-10-02 15:19:00
2020-06-02 14:39:56.605,98,27,803014426,
Wichita to Little Rock Route2,
2020-06-02 14:39:56.605,21,19,803014427,
Wichita to Little Rock Route3,
2020-06-02 14:39:56.605,21,19,803014427,
Wichita to Little Rock Route3,
aggregate by eventType
over time window
driving_agg Pull query
Pull query
Raw Refined Usage Opt
BP-2: Batch Processing with Event Hub as Source
• Using a Batch Processing framework
to process Event Hub data
retrospectively (full history available)
• Write back results to Event Hub
• Read and join multiple sources
• Can be combined with Advanced
Analytics capabilities (i.e. machine
learning / AI)
BP-2_1: Apache Spark with Kafka as Source
• Apache Spark is a unified analytics
engine for large-scale data processing
• Provides complex analytics through
MLlib and GraphX
• Can consume from/produce to Kafka
both in Streaming as well as Batch
• Use Data Frame / Dataset abstraction
as you would with other data sources
Yes No
Yes No
Yes No
Supports Protobuf
Timestamp Filter Pushdown
Offset Filter Pushdown
Yes NoSupports Avro
Yes NoSupports JSON
Yes NoSchema Registry Integration
Yes NoPartition Filter Pushdown
Yes NoSupports Produce Operation
Yes NoSupports Exactly Once27
BP-2_1: Apache Spark with Kafka as Source
truckPositionSchema = StructType().add("timestamp", TimestampType())
.add("driverId", LongType())
.add("routeId", LongType())
.add("eventType", StringType())
.add("latitude", DoubleType())
.add("longitude", DoubleType())
.add("correlationId", StringType())
rawDf ="kafka")
.option("kafka.bootstrap.servers", "kafka-1:19092,kafka-2:19093")
.option("subscribe", "truck_position")
jsonDf = rawDf.selectExpr("CAST(value AS string)")
jsonDf =, truckPositionSchema)
"cast(cast (json.timestamp as double) / 1000 as timestamp) as eventTime")29
BP-2_1: Apache Spark with Kafka as Source
BP-3: Batch Query with Event Hub as Source
• Using a Query Virtualization
framework to consume (query) Event
Hub data retrospectively (full history
• Optionally produce (insert) data into
Event Hub
• Read and join multiple sources
• Based on SQL and with the full power
of SQL at hand (functions and
optionally UDF/UDFA/UDTF)
• Batch SQL not Streaming SQL
BP-3_1: Presto with Kafka as Source
• Presto is a distributed SQL query
engine for big data
• Supports accessing data from multiple
systems within a single query
• Supports Kafka as a source (query)
and as a target (insert for raw & json)
• Does not yet support pushdown of
timestamp queries
• Starburst Enterprise Presto provides
fined grained access control
Yes No
Yes No
Yes No
Supports Protobuf
Timestamp Filter Pushdown
Offset Filter Pushdown
Yes NoSupports Avro
Yes NoSupports JSON
Yes NoSchema Registry Integration
Yes NoPartition Filter Pushdown
Yes NoSupports Produce Operation
Yes NoSupports Exactly Once32
BP-3_1: Presto with Kafka as Source
kafka.table-names=truck_position, truck_driver
select * from truck_position;
BP-3_1: Presto with Kafka as Source
"tableName": "truck_position",
"schemaName": "logistics",
"topicName": "truck_position",
"key": {
"dataFormat": "raw",
"fields": [
"name": "kafka_key",
"dataFormat": "BYTE",
"type": "VARCHAR",
"hidden": "false"
"message": {
"dataFormat": "json",
"fields": [
"name": "timestamp",
"mapping": "timestamp",
"type": "BIGINT"
"name": "truck_id",
"mapping": "truckId",
"type": "BIGINT"
etc/kafka/truck_position.json etc/kafka/truck_driver.json
"tableName": "truck_position",
"schemaName": "logistics",
"topicName": "truck_position",
"key": {
"dataFormat": "raw",
"fields": [
"name": "kafka_key",
"dataFormat": "BYTE",
"type": "VARCHAR",
"hidden": "false"
"message": {
"dataFormat": "json",
"fields": [
"name": "timestamp",
"mapping": "timestamp",
"type": "BIGINT"
"name": "truck_id",
"mapping": "truckId",
"type": "BIGINT"
BP-3_1: Presto with Kafka as Source
select * from truck_position
select * from truck_driver
BP-3_1: Presto with Kafka as Source
Join truck_position with truck_driver (removing non-compacted entries using Presto
WINDOW Function)
SELECT, d.first_name, d.last_name, t.*
FROM truck_position t
FROM truck_driver
WHERE (last_update) IN
(SELECT LAST_VALUE(last_update)
ORDER BY last_update
FROM truck_driver) ) d
ON t.driver_id =
WHERE t.event_type != 'Normal';
BP-3_2: Apache Drill with Kafka as Source
• Apache Drill is a schema-free SQL
Query Engine for Hadoop, NoSQL and
Cloud Storage
• Supports accessing data from multiple
systems within a single query
• Can push down filters on partitions,
timestamp and offset
Yes No
Yes No
Yes No
Supports Protobuf
Timestamp Filter Pushdown
Offset Filter Pushdown
Yes NoSupports Avro
Yes NoSupports JSON
Yes NoSchema Registry Integration
Yes NoPartition Filter Pushdown
Yes NoSupports Produce Operation
Yes NoSupports Exactly Once38
BP-3_3: Hive/Spark SQL with Kafka as Source
• Apache Hive facilitates reading,
writing, and managing large datasets
residing in distributed storage using
• Part of any Hadoop distribution
• A special storage handler allows
access to Kafka topic via Hive external
• Spark SQL on data frame as shown in
BP-2_1 or by integrating Hive
Yes No
Yes No
Yes No
Supports Protobuf
Timestamp Filter Pushdown
Offset Filter Pushdown
Yes NoSupports Avro
Yes NoSupports JSON
Yes NoSchema Registry Integration
Yes NoPartition Filter Pushdown
Yes NoSupports Produce Operation
Yes NoSupports Exactly Once39
BP-3_4: Oracle Access to Kafka with Kafka as
• Oracle SQL Access to Kafka is a PL/SQL
package that enables Oracle SQL to
query Kafka topics via DB views and
underlying external tables [6]
• Runs in the Oracle database
• Supports Kafka as a source (query) but
not (yet) as a target
• Use Oracle SQL to access the Kafka
topics and optionally join to RDBMS
Yes No
Yes No
Yes No
Supports Protobuf
Timestamp Filter Pushdown
Offset Filter Pushdown
Yes NoSupports Avro
Yes NoSupports JSON
Yes NoSchema Registry Integration
Yes NoPartition Filter Pushdown
Yes NoSupports Produce Operation
Yes NoSupports Exactly Once40
BP-4: Use any storage as “materialized view”
• Use any persistence
technology to provide a
“Materialized View” to the
Data Consumers
• Can be provided in
retrospective, run-once,
batch or streaming update
(in sync) mode
• “On Demand” use cases
• Provide a sandbox
environment for data
• Provide part of Kafka topics
materialized in object
Architecture Blueprints Overview
Batch Processing
Supports JSON 🟢 🟢 🟢 🟢 🟢 🟢
Supports Avro 🟢 🟢 🟢 🔴 🟢 🔴
Supports Protobuf 🟢 🔴 🔴 🔴 🔴 🔴
Schema Registry Integration 🟢 🟢 🔴 🔴 🔴 🔴
Timestamp Filter Pushdown ⚪ 🟢 🔴 🟢 🟢 🟢
Offset Filter Pushdown ⚪ 🟢 🔴 🟢 🟢 🟢
Partition Filter Pushdown ⚪ 🔴 🔴 🟢 🟢 🟢
Supports Produce Operation 🟢 🟢 🟢 🔴 🟢 🔴
Supports Exactly Once 🟢 🔴 🔴 🔴 🟢 🔴
• BP-1_1: Streaming Data Lake using Kafka Streams / ksqlDB
• BP-2_1: Apache Spark with Kafka as Source
• BP-3_1: Presto with Kafka as Source
• BP-3_2: Apache Drill with Kafka as Source
• BP-3_3: Hive/Spark SQL with Kafka as Source
• BP-3_4: Oracle Access to Kafka with Kafka as Source
• Move processing / analytics from batch to stream processing pipelines
• Event Hub (Kafka) as the single source of truth => turning the database inside out!
• everything else is just a “Materialized Views” of the Event Hub topics data
• Can still be HDFS, Object Store (S3, …) but also Kudu on Parquet
• NoSQL Databases & Relational Databases
• In-Memory Databases
• Confluent Platform Tiered Storage makes long-term storage feasible
• Does not apply for large, unstructured data (images, videos, …) => separate path around Event Hub
necessary, but sending metadata through Event Hub
• This is the result of a Proof-of-Concept: only functional test done so far, performance tests will
1. Turning the database inside out with Apache Samza – by Martin Kleppmann
2. It’s Okay To Store Data In Apache Kafka – by Jay Kreps
3. Infinite Storage in Confluent Platform – by Lucas Bradstreet, Dhruvil Shah, Manveer Chawla
4. KIP-405: Kafka Tiered Storage – Kafka Improvement Proposal
5. Streaming Machine Learning with Tiered Storage and Without a Data Lake – by Kai Waehner
6. Read data from Kafka topic using Oracle SQL Access to Kafka (OSAK) - by Mohammad H.
7. Demo environment and code samples - by Guido Schmutz (on GitHub)
8. Questioning the Lambda Architecture - by Jay Kreps
Updates to Slides
24.8.2020 – Presto supports Avro
24.8.2020 – Presto supports Insert for raw and json
20.7.2020 – intial version
Kafka as your Data Lake - is it Feasible?

More Related Content

What's hot

ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)NTT DATA Technology & Innovation
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...HostedbyConfluent
HBase スキーマ設計のポイント
HBase スキーマ設計のポイントHBase スキーマ設計のポイント
HBase スキーマ設計のポイントdaisuke-a-matsui
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)NTT DATA Technology & Innovation
Fast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniert
Fast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniertFast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniert
Fast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniertconfluent
KeycloakでFAPIに対応した高セキュリティなAPIを公開するHitachi, Ltd. OSS Solution Center.
Learn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleDatabricks
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformJean-Paul Azar
Apache Kafka in the Transportation and Logistics
Apache Kafka in the Transportation and LogisticsApache Kafka in the Transportation and Logistics
Apache Kafka in the Transportation and LogisticsKai Wähner
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)NTT DATA Technology & Innovation
Apache Kafka for Real-time Supply Chain in the Food and Retail Industry
Apache Kafka for Real-time Supply Chainin the Food and Retail IndustryApache Kafka for Real-time Supply Chainin the Food and Retail Industry
Apache Kafka for Real-time Supply Chain in the Food and Retail IndustryKai Wähner
AWS Black Belt Tech シリーズ 2015 - Amazon Redshift
AWS Black Belt Tech シリーズ 2015 - Amazon RedshiftAWS Black Belt Tech シリーズ 2015 - Amazon Redshift
AWS Black Belt Tech シリーズ 2015 - Amazon RedshiftAmazon Web Services Japan
Amazon Athena で実現する データ分析の広がり
Amazon Athena で実現する データ分析の広がりAmazon Athena で実現する データ分析の広がり
Amazon Athena で実現する データ分析の広がりAmazon Web Services Japan

What's hot (20)

ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
HBase スキーマ設計のポイント
HBase スキーマ設計のポイントHBase スキーマ設計のポイント
HBase スキーマ設計のポイント
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
Spark SQL - The internal -
Spark SQL - The internal -Spark SQL - The internal -
Spark SQL - The internal -
Fast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniert
Fast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniertFast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniert
Fast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniert
Learn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML Lifecycle
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform
Apache Kafka in the Transportation and Logistics
Apache Kafka in the Transportation and LogisticsApache Kafka in the Transportation and Logistics
Apache Kafka in the Transportation and Logistics
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
Apache Kafka for Real-time Supply Chain in the Food and Retail Industry
Apache Kafka for Real-time Supply Chainin the Food and Retail IndustryApache Kafka for Real-time Supply Chainin the Food and Retail Industry
Apache Kafka for Real-time Supply Chain in the Food and Retail Industry
Apache Spark の紹介(前半:Sparkのキホン)
Apache Spark の紹介(前半:Sparkのキホン)Apache Spark の紹介(前半:Sparkのキホン)
Apache Spark の紹介(前半:Sparkのキホン)
AWS Black Belt Tech シリーズ 2015 - Amazon Redshift
AWS Black Belt Tech シリーズ 2015 - Amazon RedshiftAWS Black Belt Tech シリーズ 2015 - Amazon Redshift
AWS Black Belt Tech シリーズ 2015 - Amazon Redshift
Amazon Athena で実現する データ分析の広がり
Amazon Athena で実現する データ分析の広がりAmazon Athena で実現する データ分析の広がり
Amazon Athena で実現する データ分析の広がり

Similar to Kafka as your Data Lake - is it Feasible?

Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...HostedbyConfluent
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureGuido Schmutz
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikHostedbyConfluent
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
Fundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureFundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureGuido Schmutz
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationKnoldus Inc.
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationKnoldus Inc.
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureGuido Schmutz
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebookAniket Mokashi
Data Ingestion in Big Data and IoT platforms
Data Ingestion in Big Data and IoT platformsData Ingestion in Big Data and IoT platforms
Data Ingestion in Big Data and IoT platformsGuido Schmutz
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingTimothy Spann

Similar to Kafka as your Data Lake - is it Feasible? (20)

Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data Architecture
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Fundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureFundamentals Big Data and AI Architecture
Fundamentals Big Data and AI Architecture
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
Data Ingestion in Big Data and IoT platforms
Data Ingestion in Big Data and IoT platformsData Ingestion in Big Data and IoT platforms
Data Ingestion in Big Data and IoT platforms
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming

More from Guido Schmutz

30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as CodeGuido Schmutz
Event Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureEvent Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureGuido Schmutz
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsGuido Schmutz
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!Guido Schmutz
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaGuido Schmutz
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaGuido Schmutz
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaGuido Schmutz
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaSolutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaGuido Schmutz
What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?Guido Schmutz
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaGuido Schmutz
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaGuido Schmutz
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming VisualisationGuido Schmutz
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Guido Schmutz
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaGuido Schmutz
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Guido Schmutz
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
Location Analytics - Real Time Geofencing using Apache Kafka
Location Analytics - Real Time Geofencing using Apache KafkaLocation Analytics - Real Time Geofencing using Apache Kafka
Location Analytics - Real Time Geofencing using Apache KafkaGuido Schmutz
Building Event-Driven (Micro) Services with Apache Kafka
Building Event-Driven (Micro) Services with Apache KafkaBuilding Event-Driven (Micro) Services with Apache Kafka
Building Event-Driven (Micro) Services with Apache KafkaGuido Schmutz
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz

More from Guido Schmutz (20)

30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code
Event Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureEvent Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data Architecture
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaSolutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using Kafka
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming Visualisation
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
Location Analytics - Real Time Geofencing using Apache Kafka
Location Analytics - Real Time Geofencing using Apache KafkaLocation Analytics - Real Time Geofencing using Apache Kafka
Location Analytics - Real Time Geofencing using Apache Kafka
Building Event-Driven (Micro) Services with Apache Kafka
Building Event-Driven (Micro) Services with Apache KafkaBuilding Event-Driven (Micro) Services with Apache Kafka
Building Event-Driven (Micro) Services with Apache Kafka
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing

Recently uploaded

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

Recently uploaded (20)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story

Kafka as your Data Lake - is it Feasible?

  • 2. BASEL | BERN | BRUGG | BUKAREST | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I.BR. | GENF HAMBURG | KOPENHAGEN | LAUSANNE | MANNHEIM | MÜNCHEN | STUTTGART | WIEN | ZÜRICH Guido Working at Trivadis for more than 23 years Consultant, Trainer, Platform Architect for Java, Oracle, SOA and Big Data / Fast Data Oracle Groundbreaker Ambassador & Oracle ACE Director @gschmutz 195th edition
  • 3. Agenda 1. What is a Data Lake? 2. Four Architecture Blueprints for “treating Kafka as a Data Lake” 3. Summary Demo environment and code samples available here:
  • 4. What is a Data Lake? 4
  • 5. Bulk Source Data Consumer DB Extract File DB What is a Data Lake? Traditional Data Lake Architecture File Import / SQL Import “Native” Raw Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt Initial Idea of Data Lake • Single store of all data (incl. raw data) in the enterprise • Put an end to data silos • Reporting, Visualization, Analytics and Machine Learning • Focus on Schema-on-Read Tech for 1st Gen Data Lake • HDFS, MapReduce, Pig, Hive, Impala, Flume, Sqoop Tech for 2nd Gen Data Lake (Cloud native) • Object Store (S3, Azure Blob Storage, …), Spark, Flink, Presto, StreamSets, … SQL / Search Parallel Processing Query Engine BI Apps Data Science Workbench 7 high latency
  • 7. ”Streaming Data Lake” – aka. Kappa Architecture Event Stream Stream Processing Platform Stream Processor V1.0 State V1.0 Event Hub Reply Bulk Data Flow Hadoop ClusterdHadoop Cluster(Big) Data Platform Storage Storage Raw Refined/ UsageOpt Bulk Data Flow Data Consumer BI Apps Dashboard Serving Stream Processor V2.0 State V2.0 Result V1.0 Result V2.0 API (Switcher) { } Parallel Processing Query Engine SQL / Search “Native” Raw Data Science Workbench Result Stream Source of Truth 11 Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social Change Data Capture Event Stream [8] – Questioning the Lambda Architecture – by Jay Kreps
  • 9. Bulk Data Flow Result Stream SQL / Search “Native” Raw Event Stream Stream Processing Platform Stream Processor V1.0 State V1.0 Event Hub Hadoop ClusterdHadoop Cluster(Big) Data Platform Storage Storage Raw Refined/ UsageOpt Data Consumer BI Apps Dashboard Serving Stream Processor V2.0 State V2.0 Result V1.0 Result V2.0 API (Switcher) { } Parallel Processing Query Engine Data Science Workbench Reply Bulk Data Flow Source of Truth [1] Turning the database inside out with Apache Samza – by Martin Kleppmann13 Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social Change Data Capture Event Stream Moving the Source of Truth to Event Hub Turning the database inside-out!
  • 10. Bulk Data Flow Result Stream SQL / Search “Native” Raw Event Stream Stream Processing Platform Stream Processor V1.0 State V1.0 Event Hub Hadoop ClusterdHadoop Cluster(Big) Data Platform Storage Storage Raw Refined/ UsageOpt Data Consumer BI Apps Dashboard Serving Stream Processor V2.0 State V2.0 Result V1.0 Result V2.0 API (Switcher) { } Data Science Workbench Source of Truth Moving the Source of Truth to Event Hub [2] – It’s Okay To Store Data In Apache Kafka – by Jay Kreps14 Parallel Processing Query Engine Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social Change Data Capture Event Stream is it feasible?
  • 11. Confluent Enterprise Tiered Storage Data Retention • Never • Time (TTL) or Size-based • Log-Compacted based Tiered Storage uses two tiers of storage • Local (same local disks on brokers) • Remote (Object storage, currently AWS S3 only) Enables Kafka to be a long-term storage solution • Transparent (no ETL pipelines needed) • Cheaper storage for cold data • Better scalability and less complex operations Broker 1 Broker 2 Broker 3 Object Storage hot cold [3] Infinite Storage in Confluent Platform – by Lucas Bradstreet, Dhruvil Shah, Manveer Chawla [4] KIP-405: Kafka Tiered Storage – Kafka Improvement Proposal 15
  • 12. Four Architecture Blueprints for “treating Kafka as a Data Lake” 20
  • 13. How can you access a Kafka topic? Streaming Queries • Latest - start at end and continuously consume new data • Earliest – start at beginning and consume history and then continuously consume new data • Seek to offset – start at a given offset, consume history, and continuously consume new data • Seek to timestamp – start at given timestamp, consume history and continuously consume new data Batch Queries • From start offset to end offset – start at a given offset and consume until another offset • From start timestamp to end timestamp – start at a given offset and consume until another offset • Full scan – Scan the complete topic from start to end All above access options can be applied on topic or on a set of partitions 21
  • 14. BP-1: ”Streaming” Data Lake • Using Stream Processing tools to perform processing on ”data in motion” instead of in batch • Can consume from multiple sources • Works well if no or limited history is needed • Queryable State Stores, aka. Interactive Queries or Pull Queries [5] Streaming Machine Learning with Tiered Storage and Without a Data Lake – by Kai Waehner22
  • 15. BP-1_1: ”Streaming” Data Lake with ksqlDB / Kafka Streams • Kafka Streams or ksqlDB fit perfectly • Using ksqlDB pull queries to retrieve current state of materialized views • Store results in another Kafka topic to persist state store information • Can be combined with BP-4 to store results/state in a database Yes No Yes No Yes No Supports Protobuf Timestamp Filter Pushdown Offset Filter Pushdown Yes NoSupports Avro Yes NoSupports JSON Yes NoSchema Registry Integration Yes noPartition Filter Pushdown Yes NoSupports Produce Operation Yes noSupports Exactly Once23
  • 16. Demo Use Case – Vehicle Tracking Truck-2 truck_ position Truck-n Refinement truck_ position_avro detect_proble matic_driving problematic_ driving Truck Driver jdbc-source truck_ driver join_problematic _driving_driver problematic_ driving_driver 27, Walter, Ward, Y, 24-JUL-85, 2017-10-02 15:19:00 console consumer {"id":19,"firstName":"Walter", "lastName":"Ward","available ":"Y","birthdate":"24-JUL- 85","last_update":150692305 2012} 2020-06-02 14:39:56.605,98,27,803014426, Wichita to Little Rock Route2, Normal,38.65,90.21,5187297736652502631 24 Truck-1 2020-06-02 14:39:56.605,21,19,803014427, Wichita to Little Rock Route3, Overspeed,32.35,91.21,5187297736652502632 2020-06-02 14:39:56.605,21,19,803014427, Wichita to Little Rock Route3, Overspeed,32.35,91.21,5187297736652502632 aggregate by eventType over time window problematic_ driving_agg Pull query Overspeed,10,10:00:00,10:00:059 Pull query Raw Refined Usage Opt
  • 18. BP-2: Batch Processing with Event Hub as Source • Using a Batch Processing framework to process Event Hub data retrospectively (full history available) • Write back results to Event Hub • Read and join multiple sources • Can be combined with Advanced Analytics capabilities (i.e. machine learning / AI) 26
  • 19. BP-2_1: Apache Spark with Kafka as Source • Apache Spark is a unified analytics engine for large-scale data processing • Provides complex analytics through MLlib and GraphX • Can consume from/produce to Kafka both in Streaming as well as Batch Mode • Use Data Frame / Dataset abstraction as you would with other data sources Yes No Yes No Yes No Supports Protobuf Timestamp Filter Pushdown Offset Filter Pushdown Yes NoSupports Avro Yes NoSupports JSON Yes NoSchema Registry Integration Yes NoPartition Filter Pushdown Yes NoSupports Produce Operation Yes NoSupports Exactly Once27
  • 20. 28
  • 21. BP-2_1: Apache Spark with Kafka as Source truckPositionSchema = StructType().add("timestamp", TimestampType()) .add("truckId",LongType()) .add("driverId", LongType()) .add("routeId", LongType()) .add("eventType", StringType()) .add("latitude", DoubleType()) .add("longitude", DoubleType()) .add("correlationId", StringType()) rawDf ="kafka") .option("kafka.bootstrap.servers", "kafka-1:19092,kafka-2:19093") .option("subscribe", "truck_position") .load() jsonDf = rawDf.selectExpr("CAST(value AS string)") jsonDf =, truckPositionSchema) .alias("json")) .selectExpr("json.*", "cast(cast (json.timestamp as double) / 1000 as timestamp) as eventTime")29
  • 22. BP-2_1: Apache Spark with Kafka as Source 30
  • 23. BP-3: Batch Query with Event Hub as Source • Using a Query Virtualization framework to consume (query) Event Hub data retrospectively (full history available) • Optionally produce (insert) data into Event Hub • Read and join multiple sources • Based on SQL and with the full power of SQL at hand (functions and optionally UDF/UDFA/UDTF) • Batch SQL not Streaming SQL 31
  • 24. BP-3_1: Presto with Kafka as Source • Presto is a distributed SQL query engine for big data • Supports accessing data from multiple systems within a single query • Supports Kafka as a source (query) and as a target (insert for raw & json) • Does not yet support pushdown of timestamp queries • Starburst Enterprise Presto provides fined grained access control Yes No Yes No Yes No Supports Protobuf Timestamp Filter Pushdown Offset Filter Pushdown Yes NoSupports Avro Yes NoSupports JSON Yes NoSchema Registry Integration Yes NoPartition Filter Pushdown Yes NoSupports Produce Operation Yes NoSupports Exactly Once32
  • 25. 33
  • 26. BP-3_1: Presto with Kafka as Source kafka.nodes=kafka-1:9092 kafka.table-names=truck_position, truck_driver kafka.default-schema=logistics kafka.hide-internal-columns=false kafka.table-description-dir=etc/kafka select * from truck_position; 34
  • 27. BP-3_1: Presto with Kafka as Source { "tableName": "truck_position", "schemaName": "logistics", "topicName": "truck_position", "key": { "dataFormat": "raw", "fields": [ { "name": "kafka_key", "dataFormat": "BYTE", "type": "VARCHAR", "hidden": "false" } ] }, "message": { "dataFormat": "json", "fields": [ { "name": "timestamp", "mapping": "timestamp", "type": "BIGINT" }, { "name": "truck_id", "mapping": "truckId", "type": "BIGINT" }, ... etc/kafka/truck_position.json etc/kafka/truck_driver.json { "tableName": "truck_position", "schemaName": "logistics", "topicName": "truck_position", "key": { "dataFormat": "raw", "fields": [ { "name": "kafka_key", "dataFormat": "BYTE", "type": "VARCHAR", "hidden": "false" } ] }, "message": { "dataFormat": "json", "fields": [ { "name": "timestamp", "mapping": "timestamp", "type": "BIGINT" }, { "name": "truck_id", "mapping": "truckId", "type": "BIGINT" }, ... 35
  • 28. BP-3_1: Presto with Kafka as Source select * from truck_position select * from truck_driver 36
  • 29. BP-3_1: Presto with Kafka as Source Join truck_position with truck_driver (removing non-compacted entries using Presto WINDOW Function) SELECT, d.first_name, d.last_name, t.* FROM truck_position t LEFT JOIN ( SELECT * FROM truck_driver WHERE (last_update) IN (SELECT LAST_VALUE(last_update) OVER (PARTITION BY id ORDER BY last_update RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_update FROM truck_driver) ) d ON t.driver_id = WHERE t.event_type != 'Normal'; 37
  • 30. BP-3_2: Apache Drill with Kafka as Source • Apache Drill is a schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage • Supports accessing data from multiple systems within a single query • Can push down filters on partitions, timestamp and offset Yes No Yes No Yes No Supports Protobuf Timestamp Filter Pushdown Offset Filter Pushdown Yes NoSupports Avro Yes NoSupports JSON Yes NoSchema Registry Integration Yes NoPartition Filter Pushdown Yes NoSupports Produce Operation Yes NoSupports Exactly Once38
  • 31. BP-3_3: Hive/Spark SQL with Kafka as Source • Apache Hive facilitates reading, writing, and managing large datasets residing in distributed storage using SQL • Part of any Hadoop distribution • A special storage handler allows access to Kafka topic via Hive external tables • Spark SQL on data frame as shown in BP-2_1 or by integrating Hive Metastore Yes No Yes No Yes No Supports Protobuf Timestamp Filter Pushdown Offset Filter Pushdown Yes NoSupports Avro Yes NoSupports JSON Yes NoSchema Registry Integration Yes NoPartition Filter Pushdown Yes NoSupports Produce Operation Yes NoSupports Exactly Once39
  • 32. BP-3_4: Oracle Access to Kafka with Kafka as Source • Oracle SQL Access to Kafka is a PL/SQL package that enables Oracle SQL to query Kafka topics via DB views and underlying external tables [6] • Runs in the Oracle database • Supports Kafka as a source (query) but not (yet) as a target • Use Oracle SQL to access the Kafka topics and optionally join to RDBMS tables Yes No Yes No Yes No Supports Protobuf Timestamp Filter Pushdown Offset Filter Pushdown Yes NoSupports Avro Yes NoSupports JSON Yes NoSchema Registry Integration Yes NoPartition Filter Pushdown Yes NoSupports Produce Operation Yes NoSupports Exactly Once40
  • 33. BP-4: Use any storage as “materialized view” • Use any persistence technology to provide a “Materialized View” to the Data Consumers • Can be provided in retrospective, run-once, batch or streaming update (in sync) mode • “On Demand” use cases • Provide a sandbox environment for data scientists • Provide part of Kafka topics materialized in object storage 41
  • 34. Architecture Blueprints Overview Blueprint Capability Streaming BP1_1 Batch Processing BP2_1 Query BP3_1 Query BP3_2 Query BP3_3 Query BP3_4 Supports JSON 🟢 🟢 🟢 🟢 🟢 🟢 Supports Avro 🟢 🟢 🟢 🔴 🟢 🔴 Supports Protobuf 🟢 🔴 🔴 🔴 🔴 🔴 Schema Registry Integration 🟢 🟢 🔴 🔴 🔴 🔴 Timestamp Filter Pushdown ⚪ 🟢 🔴 🟢 🟢 🟢 Offset Filter Pushdown ⚪ 🟢 🔴 🟢 🟢 🟢 Partition Filter Pushdown ⚪ 🔴 🔴 🟢 🟢 🟢 Supports Produce Operation 🟢 🟢 🟢 🔴 🟢 🔴 Supports Exactly Once 🟢 🔴 🔴 🔴 🟢 🔴 • BP-1_1: Streaming Data Lake using Kafka Streams / ksqlDB • BP-2_1: Apache Spark with Kafka as Source • BP-3_1: Presto with Kafka as Source • BP-3_2: Apache Drill with Kafka as Source • BP-3_3: Hive/Spark SQL with Kafka as Source • BP-3_4: Oracle Access to Kafka with Kafka as Source 42
  • 36. Summary • Move processing / analytics from batch to stream processing pipelines • Event Hub (Kafka) as the single source of truth => turning the database inside out! • everything else is just a “Materialized Views” of the Event Hub topics data • Can still be HDFS, Object Store (S3, …) but also Kudu on Parquet • NoSQL Databases & Relational Databases • In-Memory Databases • Confluent Platform Tiered Storage makes long-term storage feasible • Does not apply for large, unstructured data (images, videos, …) => separate path around Event Hub necessary, but sending metadata through Event Hub • This is the result of a Proof-of-Concept: only functional test done so far, performance tests will follow 44
  • 37. References 1. Turning the database inside out with Apache Samza – by Martin Kleppmann 2. It’s Okay To Store Data In Apache Kafka – by Jay Kreps 3. Infinite Storage in Confluent Platform – by Lucas Bradstreet, Dhruvil Shah, Manveer Chawla 4. KIP-405: Kafka Tiered Storage – Kafka Improvement Proposal 5. Streaming Machine Learning with Tiered Storage and Without a Data Lake – by Kai Waehner 6. Read data from Kafka topic using Oracle SQL Access to Kafka (OSAK) - by Mohammad H. AbdelQader 7. Demo environment and code samples - by Guido Schmutz (on GitHub) 8. Questioning the Lambda Architecture - by Jay Kreps 45
  • 38. Updates to Slides 24.8.2020 – Presto supports Avro 24.8.2020 – Presto supports Insert for raw and json 20.7.2020 – intial version 46