SlideShare a Scribd company logo
1 of 41
Download to read offline
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Spark (Structured) Streaming vs.
Kafka Streams
Two stream processing platforms compared
Guido Schmutz
3.12.2018
@gschmutz guidoschmutz.wordpress.com
Agenda
1. Introducing Stream Processing
2. Spark Streaming vs. Kafka Streams – Overview
3. Spark Structured Streaming vs. Kafka Streams – in Action
4. Summary
Guido Schmutz
Working at Trivadis for more than 22 years
Oracle Groundbreaker Ambassador & Oracle ACE Director
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
135th edition
Introducing Stream Processing
“Data at Rest” vs. “Data in Motion”
Data at Rest Data in Motion
Store
Act
Analyze
StoreAct
Analyze
11101
01010
10110
11101
01010
10110
Architekturen von Big Data Anwendungen
Hadoop Clusterd
Hadoop Cluster
Big Data
Reference Architecture for Modern Data Analytics
Service
BI Tools
Enterprise Data
Warehouse
Search / Explore
File Import / SQL Import
Event
Hub
D
ata
Flow
D
ata
Flow
Change DataCapture Parallel
Processing
Storage
Storage
RawRefined
SQL
Export
Microservice State
{ }
API
Event
Stream
Event
Stream
Search
Service
Microservices
Enterprise Apps
Logic
{ }
API
Edge Node
Rules
Event Hub
Storage
Bulk Source
Event Source
Location
DB
Extract
File
IoT
Data
Mobile
Apps
Social
Event Stream
Telemetry
Stream
Processor
State
{ }
API
Stream Analytics
Results
DB
Two Types of Stream Processing
(from Gartner)
Stream Data Integration
• Primarily cover streaming ETL
• Integration of data source and data sinks
• Filter and transform data
• (Enrich data)
• Route data
Stream Analytics
• calculating aggregates & detecting patterns
to generate higher-level, more relevant
summary information (complex events =>
used to be CEP)
• Complex events may signify threats or
opportunities that require a response
Stream Processing & Analytics Ecosystem
Stream Analytics
Event Hub
Open Source Closed Source
Stream Data Integration
Source: adapted from Tibco
Edge
Introduction to Stream Processing
Stream Processing & Analytics Ecosystem
Stream Analytics
Event Hub
Open Source Closed Source
Stream Data Integration
Source: adapted from Tibco
Edge
Introduction to Stream Processing
Example Use Case
Truck-2
Truck-1
Truck-3
truck_
position
detect_danger
ous_driving
Truck
Driver
jdbc-source
join_dangerous_driv
ing_driver
dangerous_dri
ving_driver
Count By Event Type
Window (1m, 30s)
count_by_event
_type
Spark Streaming vs. Kafka Streams
- Overview
Spark (Structured) Streaming
Spark Streaming
• 1st generation
• one of the first APIs to enable stream
processing using high-level functional
operators like map and reduce
• Like RDD API the DStreams API is based
on
relatively low-level operations on
Java/Python objects
• Used by many organizations in production
Spark Structured Streaming
• 2nd generation
• Structured API through DataFrames /
Datasets rather than RDDs
• Easier code reuse between batch and
streaming
• marked production ready in Spark 2.2.0
• Support for Java, Scala, Python, R and SQL
• Focus of this talk
Apache Spark Streaming as part of Spark Stack
Spark (Structured) Streaming
Resilient Distributed Dataset (RDD)
Spark
Standalone
MESOS /
Kubernetes
YARN HDFS S3
RDBMS &
NoSQL
Kafka
Libraries
Low Level API
Cluster Resource Managers Data Sources / Data Sinks
Advanced Analytics Libraries & Ecosystem
Data Frame
Structured API
Datasets SQL
Distributed Variables
Kafka Streams – part of Kafka Core
• Designed as a simple and lightweight
library in Apache Kafka
• no external dependencies on systems
other than Apache Kafka
• Part of open source Apache Kafka,
introduced in 0.10+
• Leverages Kafka as its internal
messaging layer
• Support for Java and SQL (KSQL)
Spark Structured Streaming vs.
Kafka Streams – in Action
Infrastructure
• Runs as part of a full Spark stack
• Cluster can be either Spark
Standalone, YARN-based or
container-based
• Many cloud options
• Just a Java library
• Runs anyware Java runs: Web
Container, Java Application, Container-
based …
Main Abstractions
Dataset/Data Frame API
• DataFrames and Datasets can represent
static, bounded data, as well as streaming,
unbounded data
• Use readStream() instead of read()
Transformation & Actions
• Almost all transformations from Spark
bounded data processing (Batch) are also
usable for streaming
Input Sources and Sinks
Triggers
• triggers define when data is output
• As soon as last group is finished
• Fixed interval between micro-batches
• One-time micro-batch
Output Mode
• Define how data is output
• Append – only add new records to
output
• Update – update changed records in
place
• Complete – rewrite full output
Main Abstractions
Topologyval schema = new StructType()
.add(...)
val inputDf = spark
.readStream
.format(...)
.option(...)
.load()
val filteredDf = inputDf.where(...)
val query = filteredDf
.writeStream
.format(...)
.option(...)
.start()
I
F
O
Main Abstractions
Stream Processing Application
• program that uses Kafka Streams library
Topology
• logic that needs to be performed by stream
processing
• functional DSL or low-level Processor API
Stream Processor
• a node in the processor topology
KStream
• Abstraction of a stream of records
• Interpreted as events
KTable
• Abstraction of a change log stream
• Interpreted as update of same record (by
key)
GlobalKTable
• Like KTable, but not partitioned => all data
is available on all parallel application
instances
Main Abstractions
Topologypublic static void main(String[] args) {
Properties streamsConfiguration = new Properties();
streamsConfiguration.put(...);
final StreamsBuilder builder = new StreamsBuilder();
KStream<..,..> stream = builder.stream(...);
KStream<..,..> filtered = stream.filter(…)
filtered.to(...)
KafkaStreams streams = new KafkaStreams(
builder.build(),streamsConfiguration);
streams.start();
}
I
F
O
Streaming Data Sources
• File Source
• Reads files as a stream of data
• Supports text, csv, json, orc parquet
• Files must be atomically placed
• Kafka Source
• Reads from Kafka Topic
• Supports Kafka broker > 0.10.x
• Socket Source (for testing)
• Reads UTF8 text from socket
connection
• Rate Source (for testing)
• Generate data at specified number of
rows per second
val rawDf = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker-1:9092")
.option("subscribe", "truck_position")
.load()
Streaming Data Sources
"Kafka only"
KStream from Topic
KTable from Topic
Use Kafka Connect s reading
other data sources into Kafka
first
KStream<String, TruckPosition> positions =
builder.stream("truck_position"
, Consumed.with(Serdes.String()
, truckPositionSerde));
KTable<String, Driver> driver =
builder.table("trucking_driver"
, Consumed.with(Serdes.String()
, driverSerde)
, Materialized.as("driver-store"));
Streaming Sinks
• File Sink – stores output to a directory
• Kafka Sink – publishes to Kafka
• Foreach Sink - Runs arbitrary computation on the records in the output
• Console Sink – for debugging, prints output to console
• Memory Sink – for debugging, stores output in-memory table
val query = jsonTruckPlusDriverDf
.selectExpr("to_json(struct(*)) AS value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker-1:9092")
.option("topic","dangerous_driving ")
.option("checkpointLocation", "/tmp")
.start()
Streaming Sinks
"Kafka only"
For testing only:
Use Kafka Connect for
writing out to other targets
KStream<String, TruckPosition> posDriver = ..
posDriver.to("dangerous_driving"
,Produced.with(Serdes.String()
, truckPositionDriverSerde));
KStream<String, TruckPosition> posDriver = ..
// print to system output
posDriver.print(Printed.toSysOut())
// shortcut for
posDriver.foreach((key,value) ->
System.out.println(key + "=" + value))
Processing Model: Event-at-a-time vs. Micro Batch
Introduction to Stream Processing
Micro-Batch Processing
• Splits incoming stream in
small batches
• Higher latency
• Fault tolerance easier
Event-at-a-time Processing
• Events processed as they arrive
• low-latency
• fault tolerance expensive
Stateless Operations – Selection & Projection
Most common operations on
DataFrame/Dataset are supported for
streaming as well
select, filter, map, flatMap, …
KStream and KTable interfaces support
variety of transformation operations
filter, filterNot, map, mapValues,
flatMap, flatMapValues, branch,
selectKey, groupByKey …
val filteredDf =
truckPosDf.where(
"eventType !='Normal'")
KStream<> filtered =
positions.filter((key,value) ->
!value.eventType.equals("Normal")
)
Stateful Operations – Aggregations
Held in distributed memory with option to
spill to disk (fault tolerant through
checkpointing to Hadoop-like FS)
Output modes: Complete, Append,
Update
count, sum, mapGroupsWithState,
flatMapGroupsWithState, reduce ...
Require state store which can be in-
memory, RocksDB or custom impl (fault
tolerant through Kafka topics)
Result of Aggregation is a KTable
count, sum, avg, reduce, aggregate
...
val c = source
.withWatermark("timestamp"
, "10 minutes")
.groupBy()
.count()
KTable<..> c = stream
.groupByKey(..)
.count(...);
Stateful Operations – Time Abstraction
Clock
Event Time
Processing Time
Ingestion Time
1 2 3 4 5
adapted from Matthias Niehoff (Codecentric)
Stateful Operations – Time Abstraction
Event Time
• New with Spark Structured Streaming
• Extracted from the message (payload)
Ingestion Time
• for sources which capture ingestion time
Processing Time
• “Old” Spark Streaming only supported
processing time
• generate the timestamp upon processing
Event Time
• Point in time when event occurred
• Extracted from the message (payload or
header)
Ingestion Time
• Point in time when event is stored in Kafka
(sent in message header)
Processing Time
• Point in time when event happens to be
processed by stream processing
applicationdf.withColumn("processingTime"
,current_timestamp())
.option("includeTimestamp", true)
Stateful Operations - Windowing
streams are unbounded
need some meaningful time frames to do
computations (i.e. aggregations)
Computations over events done using
windows of data
Windows are tracked per unique key
Fixed Window Sliding Window Session Window
Time
Stream of Data Window of Data
Stateful Operations - Windowing
Support for Tumbling & Hopping
(Sliding) Time Windows
Handling Late Data with
Watermarking
val c = source
.withWatermark("timestamp"
, "10 minutes")
.groupBy(window($"eventTime"
, "1 minutes"
, "30 seconds")
, $"word")
.count()
Data older than watermark
not expected / get discarded
event time
Trailing gap
of 10 mins
max event time
watermark
12:20
12:10
12:25
Trailing gap
of 10 mins
processing time
Stateful Operations - Windowing
Support for Tumbling & Hopping Windows
Support for Session Windows
Handling Late Data with Data
Retention (optional)
KTable<..> c = stream
.groupByKey(...)
.windowedBy(
SessionWindows
.with(5 * 60 * 1000)
).count();
KTable<..> c = stream
.groupByKey(..)
.windowedBy(
TimeWindows.of(60 * 1000)
.advanceBy(30 * 1000)
.until(10 * 60 * 1000)
).count(...);
Data older than watermark
not expected / get discarded
event time
Trailing gap
of 10 mins
max event time
Data Retention
12:20
12:10
12:25
Trailing gap
of 10 mins
processing time
Stateful Operations - Joins
Introduction to Stream Processing
Challenges of joining streams
1. Data streams need to be aligned as they
come because they have different timestamps
2. since streams are never-ending, the joins
must be limited; otherwise join will never end
3. join needs to produce results continuously as
there is no end to the data
Stream to Static (Table) Join
Stream to Stream Join (one window join)
Stream to Stream Join (two window join)
Stream-to-
Static Join
Stream-to-
Stream
Join
Stream-to-
Stream
Join
Time
Time
Time
Stateful Operations - Joins
Stream-to-Static and Stream-to-Stream
(since 2.3) Joins on Dataset/DataFrame
Watermarking helps Spark to know for
how long to retain data
• Optional for Inner Joins
• Mandatory for Outer Joins
val jsonTruckPlusDriverDf =
jsonFilteredDf.join(driverDf
, Seq("driverId")
, "left")
Source: Spark Documentation
Supports following joins
• KStream-to-KStream
• KTable-to-KTable
• KStream-to-KTable
• KStream-to-GlobalKTable
• KTable-to-GlobalKTable
Stateful Operations - Joins
KStream<String, TruckPositionDriver> joined =
filteredRekeyed.leftJoin(driver
, (left,right) -> new TruckPositionDriver(left
, StringUtils.defaultIfEmpty(right.first_name,"")
, StringUtils.defaultIfEmpty(right.last_name,""))
, Joined.with(Serdes.String()
, truckPositionSerde
, driverSerde));
Source: Confluent Documentation
There is more ….
• Streaming Deduplication
• Run-Once Trigger / fixed Interval
Micro-Batching
• Continuous Trigger with fixed
checkpoint interval (experimental in
2.3)
• Streaming Machine Learning
• REPL
• Queryable State
• Processor API
• Exactly Once Processing
• Microservices with Kafka Streams
• Automatic Scale-up / Scale-Down
• Stand-by replica of local state
• Streaming SQL
There is more … Streaming SQL with KSQL
• Enables stream processing with
zero coding required
• The simplest way to process
(structured) streams of data in real-
time
• Powered by Kafka Streams
• KSQL server with REST API
• Spark SQL also offers SQL on
streaming data, but not as a “first-
class citizen”
ksql> CREATE STREAM truck_position_s 
(timestamp BIGINT, 
truckId BIGINT, 
driverId BIGINT, 
routeId BIGINT, 
eventType VARCHAR, 
latitude DOUBLE, 
longitude DOUBLE) 
WITH (kafka_topic='truck_position', 
value_format='JSON');
ksql> SELECT * FROM truck_position_s;
1506922133306 | "truck/13/position0 | 2017-10-
02T07:28:53 | 31 | 13 | 371182829 | Memphis to
Little Rock | Normal | 41.76 | -89.6 | -
2084263951914664106
ksql> SELECT * FROM truck_position_s
WHERE eventType != 'Normal';
Summary
Spark Structured Streaming vs. Kafka Streams
• Runs on top of a Spark cluster
• Reuse your investments into Spark
(knowledge and maybe code)
• A HDFS like file system needs to be
available
• Higher latency due to micro-batching
• Multi-Language support: Java, Python,
Scala, R
• Supports ad-hoc, notebook-style
development/environment
• Available as a Java library
• Can be the implementation choice of a
microservice
• Can only work with Kafka for both input and
output
• low latency due to continuous processing
• Currently only supports Java, Scala support
available soon
• KSQL abstraction provides SQL on top of
Kafka Streams
Comparison
Kafka Streams Spark Streaming Spark Structured Streaming
Language Options Java (KIP for Scala), KSQL Scala, Java, Python, R, SQL Scala, Java, Python, R, SQL
Processing Model Continuous Streaming Micro-Batching Micro-Batching
Core Abstraction KStream / KTable DStream (RDD) Data Frame / Dataset
Programming Model Declarative/Imperative Declarative Declarative
Time Support Event / Ingestion / Processing Processing Event / Ingestion/ Processing
State Support Memory / RocksDB + Kafka Memory / Disk Memory / Disk
Time Window Support Fixed, Sliding, Session Fixed, Sliding Fixed, Sliding
Join Stream-Static, Stream-Stream Stream-Static Stream-Static, Stream-Stream (2.3)
Event Pattern detection No No No
Query Language Support KSQL No Spark SQL (limited)
Queryable State Interactive Queries No No
Scalability & Reliability Yes Yes Yes
Guarantees At Least Once/Exactly Once At Least Once/Exactly Once (partial) At Least Once/Exactly Once (partial)
Latency Sub-second seconds seconds
Deployment Java Library Cluster (with HDFS like FS) Cluster (with HDFS like FS)
Technology on its own won't help you.
You need to know how to use it properly.

More Related Content

What's hot

Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database Systemconfluent
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!Guido Schmutz
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Timothy Spann
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka IntroductionAmita Mirajkar
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?confluent
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 

What's hot (20)

Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 

Similar to Spark (Structured) Streaming vs. Kafka Streams

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learningfelixcss
 
SSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming on R for Machine Learning with Felix CheungSSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming on R for Machine Learning with Felix CheungDatabricks
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
 
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka coreKafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka coreGuido Schmutz
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Guido Schmutz
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Guido Schmutz
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!Guido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaGuido Schmutz
 

Similar to Spark (Structured) Streaming vs. Kafka Streams (20)

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learning
 
SSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming on R for Machine Learning with Felix CheungSSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming on R for Machine Learning with Felix Cheung
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka coreKafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Jug - ecosystem
Jug -  ecosystemJug -  ecosystem
Jug - ecosystem
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Chti jug - 2018-06-26
Chti jug - 2018-06-26Chti jug - 2018-06-26
Chti jug - 2018-06-26
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 

More from Guido Schmutz

30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as CodeGuido Schmutz
 
Event Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureEvent Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureGuido Schmutz
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsGuido Schmutz
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Guido Schmutz
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureGuido Schmutz
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureGuido Schmutz
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaGuido Schmutz
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaGuido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaSolutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaGuido Schmutz
 
What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?Guido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaGuido Schmutz
 
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaGuido Schmutz
 
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming VisualisationGuido Schmutz
 
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Guido Schmutz
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaGuido Schmutz
 
Fundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureFundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureGuido Schmutz
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Guido Schmutz
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
Location Analytics - Real Time Geofencing using Apache Kafka
Location Analytics - Real Time Geofencing using Apache KafkaLocation Analytics - Real Time Geofencing using Apache Kafka
Location Analytics - Real Time Geofencing using Apache KafkaGuido Schmutz
 

More from Guido Schmutz (20)

30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code
 
Event Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureEvent Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data Architecture
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data Architecture
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache Kafka
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaSolutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
 
What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using Kafka
 
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming Visualisation
 
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
 
Fundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureFundamentals Big Data and AI Architecture
Fundamentals Big Data and AI Architecture
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Location Analytics - Real Time Geofencing using Apache Kafka
Location Analytics - Real Time Geofencing using Apache KafkaLocation Analytics - Real Time Geofencing using Apache Kafka
Location Analytics - Real Time Geofencing using Apache Kafka
 

Recently uploaded

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 

Spark (Structured) Streaming vs. Kafka Streams

  • 1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH Spark (Structured) Streaming vs. Kafka Streams Two stream processing platforms compared Guido Schmutz 3.12.2018 @gschmutz guidoschmutz.wordpress.com
  • 2. Agenda 1. Introducing Stream Processing 2. Spark Streaming vs. Kafka Streams – Overview 3. Spark Structured Streaming vs. Kafka Streams – in Action 4. Summary
  • 3. Guido Schmutz Working at Trivadis for more than 22 years Oracle Groundbreaker Ambassador & Oracle ACE Director Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Head of Trivadis Architecture Board Technology Manager @ Trivadis More than 30 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz 135th edition
  • 5. “Data at Rest” vs. “Data in Motion” Data at Rest Data in Motion Store Act Analyze StoreAct Analyze 11101 01010 10110 11101 01010 10110 Architekturen von Big Data Anwendungen
  • 6. Hadoop Clusterd Hadoop Cluster Big Data Reference Architecture for Modern Data Analytics Service BI Tools Enterprise Data Warehouse Search / Explore File Import / SQL Import Event Hub D ata Flow D ata Flow Change DataCapture Parallel Processing Storage Storage RawRefined SQL Export Microservice State { } API Event Stream Event Stream Search Service Microservices Enterprise Apps Logic { } API Edge Node Rules Event Hub Storage Bulk Source Event Source Location DB Extract File IoT Data Mobile Apps Social Event Stream Telemetry Stream Processor State { } API Stream Analytics Results DB
  • 7. Two Types of Stream Processing (from Gartner) Stream Data Integration • Primarily cover streaming ETL • Integration of data source and data sinks • Filter and transform data • (Enrich data) • Route data Stream Analytics • calculating aggregates & detecting patterns to generate higher-level, more relevant summary information (complex events => used to be CEP) • Complex events may signify threats or opportunities that require a response
  • 8. Stream Processing & Analytics Ecosystem Stream Analytics Event Hub Open Source Closed Source Stream Data Integration Source: adapted from Tibco Edge Introduction to Stream Processing
  • 9. Stream Processing & Analytics Ecosystem Stream Analytics Event Hub Open Source Closed Source Stream Data Integration Source: adapted from Tibco Edge Introduction to Stream Processing
  • 11. Spark Streaming vs. Kafka Streams - Overview
  • 12. Spark (Structured) Streaming Spark Streaming • 1st generation • one of the first APIs to enable stream processing using high-level functional operators like map and reduce • Like RDD API the DStreams API is based on relatively low-level operations on Java/Python objects • Used by many organizations in production Spark Structured Streaming • 2nd generation • Structured API through DataFrames / Datasets rather than RDDs • Easier code reuse between batch and streaming • marked production ready in Spark 2.2.0 • Support for Java, Scala, Python, R and SQL • Focus of this talk
  • 13. Apache Spark Streaming as part of Spark Stack Spark (Structured) Streaming Resilient Distributed Dataset (RDD) Spark Standalone MESOS / Kubernetes YARN HDFS S3 RDBMS & NoSQL Kafka Libraries Low Level API Cluster Resource Managers Data Sources / Data Sinks Advanced Analytics Libraries & Ecosystem Data Frame Structured API Datasets SQL Distributed Variables
  • 14. Kafka Streams – part of Kafka Core • Designed as a simple and lightweight library in Apache Kafka • no external dependencies on systems other than Apache Kafka • Part of open source Apache Kafka, introduced in 0.10+ • Leverages Kafka as its internal messaging layer • Support for Java and SQL (KSQL)
  • 15. Spark Structured Streaming vs. Kafka Streams – in Action
  • 16. Infrastructure • Runs as part of a full Spark stack • Cluster can be either Spark Standalone, YARN-based or container-based • Many cloud options • Just a Java library • Runs anyware Java runs: Web Container, Java Application, Container- based …
  • 17. Main Abstractions Dataset/Data Frame API • DataFrames and Datasets can represent static, bounded data, as well as streaming, unbounded data • Use readStream() instead of read() Transformation & Actions • Almost all transformations from Spark bounded data processing (Batch) are also usable for streaming Input Sources and Sinks Triggers • triggers define when data is output • As soon as last group is finished • Fixed interval between micro-batches • One-time micro-batch Output Mode • Define how data is output • Append – only add new records to output • Update – update changed records in place • Complete – rewrite full output
  • 18. Main Abstractions Topologyval schema = new StructType() .add(...) val inputDf = spark .readStream .format(...) .option(...) .load() val filteredDf = inputDf.where(...) val query = filteredDf .writeStream .format(...) .option(...) .start() I F O
  • 19. Main Abstractions Stream Processing Application • program that uses Kafka Streams library Topology • logic that needs to be performed by stream processing • functional DSL or low-level Processor API Stream Processor • a node in the processor topology KStream • Abstraction of a stream of records • Interpreted as events KTable • Abstraction of a change log stream • Interpreted as update of same record (by key) GlobalKTable • Like KTable, but not partitioned => all data is available on all parallel application instances
  • 20. Main Abstractions Topologypublic static void main(String[] args) { Properties streamsConfiguration = new Properties(); streamsConfiguration.put(...); final StreamsBuilder builder = new StreamsBuilder(); KStream<..,..> stream = builder.stream(...); KStream<..,..> filtered = stream.filter(…) filtered.to(...) KafkaStreams streams = new KafkaStreams( builder.build(),streamsConfiguration); streams.start(); } I F O
  • 21. Streaming Data Sources • File Source • Reads files as a stream of data • Supports text, csv, json, orc parquet • Files must be atomically placed • Kafka Source • Reads from Kafka Topic • Supports Kafka broker > 0.10.x • Socket Source (for testing) • Reads UTF8 text from socket connection • Rate Source (for testing) • Generate data at specified number of rows per second val rawDf = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "broker-1:9092") .option("subscribe", "truck_position") .load()
  • 22. Streaming Data Sources "Kafka only" KStream from Topic KTable from Topic Use Kafka Connect s reading other data sources into Kafka first KStream<String, TruckPosition> positions = builder.stream("truck_position" , Consumed.with(Serdes.String() , truckPositionSerde)); KTable<String, Driver> driver = builder.table("trucking_driver" , Consumed.with(Serdes.String() , driverSerde) , Materialized.as("driver-store"));
  • 23. Streaming Sinks • File Sink – stores output to a directory • Kafka Sink – publishes to Kafka • Foreach Sink - Runs arbitrary computation on the records in the output • Console Sink – for debugging, prints output to console • Memory Sink – for debugging, stores output in-memory table val query = jsonTruckPlusDriverDf .selectExpr("to_json(struct(*)) AS value") .writeStream .format("kafka") .option("kafka.bootstrap.servers", "broker-1:9092") .option("topic","dangerous_driving ") .option("checkpointLocation", "/tmp") .start()
  • 24. Streaming Sinks "Kafka only" For testing only: Use Kafka Connect for writing out to other targets KStream<String, TruckPosition> posDriver = .. posDriver.to("dangerous_driving" ,Produced.with(Serdes.String() , truckPositionDriverSerde)); KStream<String, TruckPosition> posDriver = .. // print to system output posDriver.print(Printed.toSysOut()) // shortcut for posDriver.foreach((key,value) -> System.out.println(key + "=" + value))
  • 25. Processing Model: Event-at-a-time vs. Micro Batch Introduction to Stream Processing Micro-Batch Processing • Splits incoming stream in small batches • Higher latency • Fault tolerance easier Event-at-a-time Processing • Events processed as they arrive • low-latency • fault tolerance expensive
  • 26. Stateless Operations – Selection & Projection Most common operations on DataFrame/Dataset are supported for streaming as well select, filter, map, flatMap, … KStream and KTable interfaces support variety of transformation operations filter, filterNot, map, mapValues, flatMap, flatMapValues, branch, selectKey, groupByKey … val filteredDf = truckPosDf.where( "eventType !='Normal'") KStream<> filtered = positions.filter((key,value) -> !value.eventType.equals("Normal") )
  • 27. Stateful Operations – Aggregations Held in distributed memory with option to spill to disk (fault tolerant through checkpointing to Hadoop-like FS) Output modes: Complete, Append, Update count, sum, mapGroupsWithState, flatMapGroupsWithState, reduce ... Require state store which can be in- memory, RocksDB or custom impl (fault tolerant through Kafka topics) Result of Aggregation is a KTable count, sum, avg, reduce, aggregate ... val c = source .withWatermark("timestamp" , "10 minutes") .groupBy() .count() KTable<..> c = stream .groupByKey(..) .count(...);
  • 28. Stateful Operations – Time Abstraction Clock Event Time Processing Time Ingestion Time 1 2 3 4 5 adapted from Matthias Niehoff (Codecentric)
  • 29. Stateful Operations – Time Abstraction Event Time • New with Spark Structured Streaming • Extracted from the message (payload) Ingestion Time • for sources which capture ingestion time Processing Time • “Old” Spark Streaming only supported processing time • generate the timestamp upon processing Event Time • Point in time when event occurred • Extracted from the message (payload or header) Ingestion Time • Point in time when event is stored in Kafka (sent in message header) Processing Time • Point in time when event happens to be processed by stream processing applicationdf.withColumn("processingTime" ,current_timestamp()) .option("includeTimestamp", true)
  • 30. Stateful Operations - Windowing streams are unbounded need some meaningful time frames to do computations (i.e. aggregations) Computations over events done using windows of data Windows are tracked per unique key Fixed Window Sliding Window Session Window Time Stream of Data Window of Data
  • 31. Stateful Operations - Windowing Support for Tumbling & Hopping (Sliding) Time Windows Handling Late Data with Watermarking val c = source .withWatermark("timestamp" , "10 minutes") .groupBy(window($"eventTime" , "1 minutes" , "30 seconds") , $"word") .count() Data older than watermark not expected / get discarded event time Trailing gap of 10 mins max event time watermark 12:20 12:10 12:25 Trailing gap of 10 mins processing time
  • 32. Stateful Operations - Windowing Support for Tumbling & Hopping Windows Support for Session Windows Handling Late Data with Data Retention (optional) KTable<..> c = stream .groupByKey(...) .windowedBy( SessionWindows .with(5 * 60 * 1000) ).count(); KTable<..> c = stream .groupByKey(..) .windowedBy( TimeWindows.of(60 * 1000) .advanceBy(30 * 1000) .until(10 * 60 * 1000) ).count(...); Data older than watermark not expected / get discarded event time Trailing gap of 10 mins max event time Data Retention 12:20 12:10 12:25 Trailing gap of 10 mins processing time
  • 33. Stateful Operations - Joins Introduction to Stream Processing Challenges of joining streams 1. Data streams need to be aligned as they come because they have different timestamps 2. since streams are never-ending, the joins must be limited; otherwise join will never end 3. join needs to produce results continuously as there is no end to the data Stream to Static (Table) Join Stream to Stream Join (one window join) Stream to Stream Join (two window join) Stream-to- Static Join Stream-to- Stream Join Stream-to- Stream Join Time Time Time
  • 34. Stateful Operations - Joins Stream-to-Static and Stream-to-Stream (since 2.3) Joins on Dataset/DataFrame Watermarking helps Spark to know for how long to retain data • Optional for Inner Joins • Mandatory for Outer Joins val jsonTruckPlusDriverDf = jsonFilteredDf.join(driverDf , Seq("driverId") , "left") Source: Spark Documentation
  • 35. Supports following joins • KStream-to-KStream • KTable-to-KTable • KStream-to-KTable • KStream-to-GlobalKTable • KTable-to-GlobalKTable Stateful Operations - Joins KStream<String, TruckPositionDriver> joined = filteredRekeyed.leftJoin(driver , (left,right) -> new TruckPositionDriver(left , StringUtils.defaultIfEmpty(right.first_name,"") , StringUtils.defaultIfEmpty(right.last_name,"")) , Joined.with(Serdes.String() , truckPositionSerde , driverSerde)); Source: Confluent Documentation
  • 36. There is more …. • Streaming Deduplication • Run-Once Trigger / fixed Interval Micro-Batching • Continuous Trigger with fixed checkpoint interval (experimental in 2.3) • Streaming Machine Learning • REPL • Queryable State • Processor API • Exactly Once Processing • Microservices with Kafka Streams • Automatic Scale-up / Scale-Down • Stand-by replica of local state • Streaming SQL
  • 37. There is more … Streaming SQL with KSQL • Enables stream processing with zero coding required • The simplest way to process (structured) streams of data in real- time • Powered by Kafka Streams • KSQL server with REST API • Spark SQL also offers SQL on streaming data, but not as a “first- class citizen” ksql> CREATE STREAM truck_position_s (timestamp BIGINT, truckId BIGINT, driverId BIGINT, routeId BIGINT, eventType VARCHAR, latitude DOUBLE, longitude DOUBLE) WITH (kafka_topic='truck_position', value_format='JSON'); ksql> SELECT * FROM truck_position_s; 1506922133306 | "truck/13/position0 | 2017-10- 02T07:28:53 | 31 | 13 | 371182829 | Memphis to Little Rock | Normal | 41.76 | -89.6 | - 2084263951914664106 ksql> SELECT * FROM truck_position_s WHERE eventType != 'Normal';
  • 39. Spark Structured Streaming vs. Kafka Streams • Runs on top of a Spark cluster • Reuse your investments into Spark (knowledge and maybe code) • A HDFS like file system needs to be available • Higher latency due to micro-batching • Multi-Language support: Java, Python, Scala, R • Supports ad-hoc, notebook-style development/environment • Available as a Java library • Can be the implementation choice of a microservice • Can only work with Kafka for both input and output • low latency due to continuous processing • Currently only supports Java, Scala support available soon • KSQL abstraction provides SQL on top of Kafka Streams
  • 40. Comparison Kafka Streams Spark Streaming Spark Structured Streaming Language Options Java (KIP for Scala), KSQL Scala, Java, Python, R, SQL Scala, Java, Python, R, SQL Processing Model Continuous Streaming Micro-Batching Micro-Batching Core Abstraction KStream / KTable DStream (RDD) Data Frame / Dataset Programming Model Declarative/Imperative Declarative Declarative Time Support Event / Ingestion / Processing Processing Event / Ingestion/ Processing State Support Memory / RocksDB + Kafka Memory / Disk Memory / Disk Time Window Support Fixed, Sliding, Session Fixed, Sliding Fixed, Sliding Join Stream-Static, Stream-Stream Stream-Static Stream-Static, Stream-Stream (2.3) Event Pattern detection No No No Query Language Support KSQL No Spark SQL (limited) Queryable State Interactive Queries No No Scalability & Reliability Yes Yes Yes Guarantees At Least Once/Exactly Once At Least Once/Exactly Once (partial) At Least Once/Exactly Once (partial) Latency Sub-second seconds seconds Deployment Java Library Cluster (with HDFS like FS) Cluster (with HDFS like FS)
  • 41. Technology on its own won't help you. You need to know how to use it properly.