Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
S...
Agenda
1. Introducing Stream Processing
2. Spark Streaming vs. Kafka Streams – Overview
3. Spark Structured Streaming vs. ...
Guido Schmutz
Working at Trivadis for more than 22 years
Oracle Groundbreaker Ambassador & Oracle ACE Director
Consultant,...
Introducing Stream Processing
“Data at Rest” vs. “Data in Motion”
Data at Rest Data in Motion
Store
Act
Analyze
StoreAct
Analyze
11101
01010
10110
11101...
Hadoop Clusterd
Hadoop Cluster
Big Data
Reference Architecture for Modern Data Analytics
Service
BI Tools
Enterprise Data
...
Two Types of Stream Processing
(from Gartner)
Stream Data Integration
• Primarily cover streaming ETL
• Integration of dat...
Stream Processing & Analytics Ecosystem
Stream Analytics
Event Hub
Open Source Closed Source
Stream Data Integration
Sourc...
Stream Processing & Analytics Ecosystem
Stream Analytics
Event Hub
Open Source Closed Source
Stream Data Integration
Sourc...
Example Use Case
Truck-2
Truck-1
Truck-3
truck_
position
detect_danger
ous_driving
Truck
Driver
jdbc-source
join_dangerous...
Spark Streaming vs. Kafka Streams
- Overview
Spark (Structured) Streaming
Spark Streaming
• 1st generation
• one of the first APIs to enable stream
processing using hi...
Apache Spark Streaming as part of Spark Stack
Spark (Structured) Streaming
Resilient Distributed Dataset (RDD)
Spark
Stand...
Kafka Streams – part of Kafka Core
• Designed as a simple and lightweight
library in Apache Kafka
• no external dependenci...
Spark Structured Streaming vs.
Kafka Streams – in Action
Infrastructure
• Runs as part of a full Spark stack
• Cluster can be either Spark
Standalone, YARN-based or
container-base...
Main Abstractions
Dataset/Data Frame API
• DataFrames and Datasets can represent
static, bounded data, as well as streamin...
Main Abstractions
Topologyval schema = new StructType()
.add(...)
val inputDf = spark
.readStream
.format(...)
.option(......
Main Abstractions
Stream Processing Application
• program that uses Kafka Streams library
Topology
• logic that needs to b...
Main Abstractions
Topologypublic static void main(String[] args) {
Properties streamsConfiguration = new Properties();
str...
Streaming Data Sources
• File Source
• Reads files as a stream of data
• Supports text, csv, json, orc parquet
• Files mus...
Streaming Data Sources
"Kafka only"
KStream from Topic
KTable from Topic
Use Kafka Connect s reading
other data sources in...
Streaming Sinks
• File Sink – stores output to a directory
• Kafka Sink – publishes to Kafka
• Foreach Sink - Runs arbitra...
Streaming Sinks
"Kafka only"
For testing only:
Use Kafka Connect for
writing out to other targets
KStream<String, TruckPos...
Processing Model: Event-at-a-time vs. Micro Batch
Introduction to Stream Processing
Micro-Batch Processing
• Splits incomi...
Stateless Operations – Selection & Projection
Most common operations on
DataFrame/Dataset are supported for
streaming as w...
Stateful Operations – Aggregations
Held in distributed memory with option to
spill to disk (fault tolerant through
checkpo...
Stateful Operations – Time Abstraction
Clock
Event Time
Processing Time
Ingestion Time
1 2 3 4 5
adapted from Matthias Nie...
Stateful Operations – Time Abstraction
Event Time
• New with Spark Structured Streaming
• Extracted from the message (payl...
Stateful Operations - Windowing
streams are unbounded
need some meaningful time frames to do
computations (i.e. aggregatio...
Stateful Operations - Windowing
Support for Tumbling & Hopping
(Sliding) Time Windows
Handling Late Data with
Watermarking...
Stateful Operations - Windowing
Support for Tumbling & Hopping Windows
Support for Session Windows
Handling Late Data with...
Stateful Operations - Joins
Introduction to Stream Processing
Challenges of joining streams
1. Data streams need to be ali...
Stateful Operations - Joins
Stream-to-Static and Stream-to-Stream
(since 2.3) Joins on Dataset/DataFrame
Watermarking help...
Supports following joins
• KStream-to-KStream
• KTable-to-KTable
• KStream-to-KTable
• KStream-to-GlobalKTable
• KTable-to...
There is more ….
• Streaming Deduplication
• Run-Once Trigger / fixed Interval
Micro-Batching
• Continuous Trigger with fi...
There is more … Streaming SQL with KSQL
• Enables stream processing with
zero coding required
• The simplest way to proces...
Summary
Spark Structured Streaming vs. Kafka Streams
• Runs on top of a Spark cluster
• Reuse your investments into Spark
(knowled...
Comparison
Kafka Streams Spark Streaming Spark Structured Streaming
Language Options Java (KIP for Scala), KSQL Scala, Jav...
Technology on its own won't help you.
You need to know how to use it properly.
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Spark (Structured) Streaming vs. Kafka Streams

Download to read offline

Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams.
Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala.
Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application.
This presentation shows how you can implement stream processing solutions with each of the two frameworks, discusses how they compare and highlights the differences and similarities.

Spark (Structured) Streaming vs. Kafka Streams

  1. 1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH Spark (Structured) Streaming vs. Kafka Streams Two stream processing platforms compared Guido Schmutz 3.12.2018 @gschmutz guidoschmutz.wordpress.com
  2. 2. Agenda 1. Introducing Stream Processing 2. Spark Streaming vs. Kafka Streams – Overview 3. Spark Structured Streaming vs. Kafka Streams – in Action 4. Summary
  3. 3. Guido Schmutz Working at Trivadis for more than 22 years Oracle Groundbreaker Ambassador & Oracle ACE Director Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Head of Trivadis Architecture Board Technology Manager @ Trivadis More than 30 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz 135th edition
  4. 4. Introducing Stream Processing
  5. 5. “Data at Rest” vs. “Data in Motion” Data at Rest Data in Motion Store Act Analyze StoreAct Analyze 11101 01010 10110 11101 01010 10110 Architekturen von Big Data Anwendungen
  6. 6. Hadoop Clusterd Hadoop Cluster Big Data Reference Architecture for Modern Data Analytics Service BI Tools Enterprise Data Warehouse Search / Explore File Import / SQL Import Event Hub D ata Flow D ata Flow Change DataCapture Parallel Processing Storage Storage RawRefined SQL Export Microservice State { } API Event Stream Event Stream Search Service Microservices Enterprise Apps Logic { } API Edge Node Rules Event Hub Storage Bulk Source Event Source Location DB Extract File IoT Data Mobile Apps Social Event Stream Telemetry Stream Processor State { } API Stream Analytics Results DB
  7. 7. Two Types of Stream Processing (from Gartner) Stream Data Integration • Primarily cover streaming ETL • Integration of data source and data sinks • Filter and transform data • (Enrich data) • Route data Stream Analytics • calculating aggregates & detecting patterns to generate higher-level, more relevant summary information (complex events => used to be CEP) • Complex events may signify threats or opportunities that require a response
  8. 8. Stream Processing & Analytics Ecosystem Stream Analytics Event Hub Open Source Closed Source Stream Data Integration Source: adapted from Tibco Edge Introduction to Stream Processing
  9. 9. Stream Processing & Analytics Ecosystem Stream Analytics Event Hub Open Source Closed Source Stream Data Integration Source: adapted from Tibco Edge Introduction to Stream Processing
  10. 10. Example Use Case Truck-2 Truck-1 Truck-3 truck_ position detect_danger ous_driving Truck Driver jdbc-source join_dangerous_driv ing_driver dangerous_dri ving_driver Count By Event Type Window (1m, 30s) count_by_event _type
  11. 11. Spark Streaming vs. Kafka Streams - Overview
  12. 12. Spark (Structured) Streaming Spark Streaming • 1st generation • one of the first APIs to enable stream processing using high-level functional operators like map and reduce • Like RDD API the DStreams API is based on relatively low-level operations on Java/Python objects • Used by many organizations in production Spark Structured Streaming • 2nd generation • Structured API through DataFrames / Datasets rather than RDDs • Easier code reuse between batch and streaming • marked production ready in Spark 2.2.0 • Support for Java, Scala, Python, R and SQL • Focus of this talk
  13. 13. Apache Spark Streaming as part of Spark Stack Spark (Structured) Streaming Resilient Distributed Dataset (RDD) Spark Standalone MESOS / Kubernetes YARN HDFS S3 RDBMS & NoSQL Kafka Libraries Low Level API Cluster Resource Managers Data Sources / Data Sinks Advanced Analytics Libraries & Ecosystem Data Frame Structured API Datasets SQL Distributed Variables
  14. 14. Kafka Streams – part of Kafka Core • Designed as a simple and lightweight library in Apache Kafka • no external dependencies on systems other than Apache Kafka • Part of open source Apache Kafka, introduced in 0.10+ • Leverages Kafka as its internal messaging layer • Support for Java and SQL (KSQL)
  15. 15. Spark Structured Streaming vs. Kafka Streams – in Action
  16. 16. Infrastructure • Runs as part of a full Spark stack • Cluster can be either Spark Standalone, YARN-based or container-based • Many cloud options • Just a Java library • Runs anyware Java runs: Web Container, Java Application, Container- based …
  17. 17. Main Abstractions Dataset/Data Frame API • DataFrames and Datasets can represent static, bounded data, as well as streaming, unbounded data • Use readStream() instead of read() Transformation & Actions • Almost all transformations from Spark bounded data processing (Batch) are also usable for streaming Input Sources and Sinks Triggers • triggers define when data is output • As soon as last group is finished • Fixed interval between micro-batches • One-time micro-batch Output Mode • Define how data is output • Append – only add new records to output • Update – update changed records in place • Complete – rewrite full output
  18. 18. Main Abstractions Topologyval schema = new StructType() .add(...) val inputDf = spark .readStream .format(...) .option(...) .load() val filteredDf = inputDf.where(...) val query = filteredDf .writeStream .format(...) .option(...) .start() I F O
  19. 19. Main Abstractions Stream Processing Application • program that uses Kafka Streams library Topology • logic that needs to be performed by stream processing • functional DSL or low-level Processor API Stream Processor • a node in the processor topology KStream • Abstraction of a stream of records • Interpreted as events KTable • Abstraction of a change log stream • Interpreted as update of same record (by key) GlobalKTable • Like KTable, but not partitioned => all data is available on all parallel application instances
  20. 20. Main Abstractions Topologypublic static void main(String[] args) { Properties streamsConfiguration = new Properties(); streamsConfiguration.put(...); final StreamsBuilder builder = new StreamsBuilder(); KStream<..,..> stream = builder.stream(...); KStream<..,..> filtered = stream.filter(…) filtered.to(...) KafkaStreams streams = new KafkaStreams( builder.build(),streamsConfiguration); streams.start(); } I F O
  21. 21. Streaming Data Sources • File Source • Reads files as a stream of data • Supports text, csv, json, orc parquet • Files must be atomically placed • Kafka Source • Reads from Kafka Topic • Supports Kafka broker > 0.10.x • Socket Source (for testing) • Reads UTF8 text from socket connection • Rate Source (for testing) • Generate data at specified number of rows per second val rawDf = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "broker-1:9092") .option("subscribe", "truck_position") .load()
  22. 22. Streaming Data Sources "Kafka only" KStream from Topic KTable from Topic Use Kafka Connect s reading other data sources into Kafka first KStream<String, TruckPosition> positions = builder.stream("truck_position" , Consumed.with(Serdes.String() , truckPositionSerde)); KTable<String, Driver> driver = builder.table("trucking_driver" , Consumed.with(Serdes.String() , driverSerde) , Materialized.as("driver-store"));
  23. 23. Streaming Sinks • File Sink – stores output to a directory • Kafka Sink – publishes to Kafka • Foreach Sink - Runs arbitrary computation on the records in the output • Console Sink – for debugging, prints output to console • Memory Sink – for debugging, stores output in-memory table val query = jsonTruckPlusDriverDf .selectExpr("to_json(struct(*)) AS value") .writeStream .format("kafka") .option("kafka.bootstrap.servers", "broker-1:9092") .option("topic","dangerous_driving ") .option("checkpointLocation", "/tmp") .start()
  24. 24. Streaming Sinks "Kafka only" For testing only: Use Kafka Connect for writing out to other targets KStream<String, TruckPosition> posDriver = .. posDriver.to("dangerous_driving" ,Produced.with(Serdes.String() , truckPositionDriverSerde)); KStream<String, TruckPosition> posDriver = .. // print to system output posDriver.print(Printed.toSysOut()) // shortcut for posDriver.foreach((key,value) -> System.out.println(key + "=" + value))
  25. 25. Processing Model: Event-at-a-time vs. Micro Batch Introduction to Stream Processing Micro-Batch Processing • Splits incoming stream in small batches • Higher latency • Fault tolerance easier Event-at-a-time Processing • Events processed as they arrive • low-latency • fault tolerance expensive
  26. 26. Stateless Operations – Selection & Projection Most common operations on DataFrame/Dataset are supported for streaming as well select, filter, map, flatMap, … KStream and KTable interfaces support variety of transformation operations filter, filterNot, map, mapValues, flatMap, flatMapValues, branch, selectKey, groupByKey … val filteredDf = truckPosDf.where( "eventType !='Normal'") KStream<> filtered = positions.filter((key,value) -> !value.eventType.equals("Normal") )
  27. 27. Stateful Operations – Aggregations Held in distributed memory with option to spill to disk (fault tolerant through checkpointing to Hadoop-like FS) Output modes: Complete, Append, Update count, sum, mapGroupsWithState, flatMapGroupsWithState, reduce ... Require state store which can be in- memory, RocksDB or custom impl (fault tolerant through Kafka topics) Result of Aggregation is a KTable count, sum, avg, reduce, aggregate ... val c = source .withWatermark("timestamp" , "10 minutes") .groupBy() .count() KTable<..> c = stream .groupByKey(..) .count(...);
  28. 28. Stateful Operations – Time Abstraction Clock Event Time Processing Time Ingestion Time 1 2 3 4 5 adapted from Matthias Niehoff (Codecentric)
  29. 29. Stateful Operations – Time Abstraction Event Time • New with Spark Structured Streaming • Extracted from the message (payload) Ingestion Time • for sources which capture ingestion time Processing Time • “Old” Spark Streaming only supported processing time • generate the timestamp upon processing Event Time • Point in time when event occurred • Extracted from the message (payload or header) Ingestion Time • Point in time when event is stored in Kafka (sent in message header) Processing Time • Point in time when event happens to be processed by stream processing applicationdf.withColumn("processingTime" ,current_timestamp()) .option("includeTimestamp", true)
  30. 30. Stateful Operations - Windowing streams are unbounded need some meaningful time frames to do computations (i.e. aggregations) Computations over events done using windows of data Windows are tracked per unique key Fixed Window Sliding Window Session Window Time Stream of Data Window of Data
  31. 31. Stateful Operations - Windowing Support for Tumbling & Hopping (Sliding) Time Windows Handling Late Data with Watermarking val c = source .withWatermark("timestamp" , "10 minutes") .groupBy(window($"eventTime" , "1 minutes" , "30 seconds") , $"word") .count() Data older than watermark not expected / get discarded event time Trailing gap of 10 mins max event time watermark 12:20 12:10 12:25 Trailing gap of 10 mins processing time
  32. 32. Stateful Operations - Windowing Support for Tumbling & Hopping Windows Support for Session Windows Handling Late Data with Data Retention (optional) KTable<..> c = stream .groupByKey(...) .windowedBy( SessionWindows .with(5 * 60 * 1000) ).count(); KTable<..> c = stream .groupByKey(..) .windowedBy( TimeWindows.of(60 * 1000) .advanceBy(30 * 1000) .until(10 * 60 * 1000) ).count(...); Data older than watermark not expected / get discarded event time Trailing gap of 10 mins max event time Data Retention 12:20 12:10 12:25 Trailing gap of 10 mins processing time
  33. 33. Stateful Operations - Joins Introduction to Stream Processing Challenges of joining streams 1. Data streams need to be aligned as they come because they have different timestamps 2. since streams are never-ending, the joins must be limited; otherwise join will never end 3. join needs to produce results continuously as there is no end to the data Stream to Static (Table) Join Stream to Stream Join (one window join) Stream to Stream Join (two window join) Stream-to- Static Join Stream-to- Stream Join Stream-to- Stream Join Time Time Time
  34. 34. Stateful Operations - Joins Stream-to-Static and Stream-to-Stream (since 2.3) Joins on Dataset/DataFrame Watermarking helps Spark to know for how long to retain data • Optional for Inner Joins • Mandatory for Outer Joins val jsonTruckPlusDriverDf = jsonFilteredDf.join(driverDf , Seq("driverId") , "left") Source: Spark Documentation
  35. 35. Supports following joins • KStream-to-KStream • KTable-to-KTable • KStream-to-KTable • KStream-to-GlobalKTable • KTable-to-GlobalKTable Stateful Operations - Joins KStream<String, TruckPositionDriver> joined = filteredRekeyed.leftJoin(driver , (left,right) -> new TruckPositionDriver(left , StringUtils.defaultIfEmpty(right.first_name,"") , StringUtils.defaultIfEmpty(right.last_name,"")) , Joined.with(Serdes.String() , truckPositionSerde , driverSerde)); Source: Confluent Documentation
  36. 36. There is more …. • Streaming Deduplication • Run-Once Trigger / fixed Interval Micro-Batching • Continuous Trigger with fixed checkpoint interval (experimental in 2.3) • Streaming Machine Learning • REPL • Queryable State • Processor API • Exactly Once Processing • Microservices with Kafka Streams • Automatic Scale-up / Scale-Down • Stand-by replica of local state • Streaming SQL
  37. 37. There is more … Streaming SQL with KSQL • Enables stream processing with zero coding required • The simplest way to process (structured) streams of data in real- time • Powered by Kafka Streams • KSQL server with REST API • Spark SQL also offers SQL on streaming data, but not as a “first- class citizen” ksql> CREATE STREAM truck_position_s (timestamp BIGINT, truckId BIGINT, driverId BIGINT, routeId BIGINT, eventType VARCHAR, latitude DOUBLE, longitude DOUBLE) WITH (kafka_topic='truck_position', value_format='JSON'); ksql> SELECT * FROM truck_position_s; 1506922133306 | "truck/13/position0 | 2017-10- 02T07:28:53 | 31 | 13 | 371182829 | Memphis to Little Rock | Normal | 41.76 | -89.6 | - 2084263951914664106 ksql> SELECT * FROM truck_position_s WHERE eventType != 'Normal';
  38. 38. Summary
  39. 39. Spark Structured Streaming vs. Kafka Streams • Runs on top of a Spark cluster • Reuse your investments into Spark (knowledge and maybe code) • A HDFS like file system needs to be available • Higher latency due to micro-batching • Multi-Language support: Java, Python, Scala, R • Supports ad-hoc, notebook-style development/environment • Available as a Java library • Can be the implementation choice of a microservice • Can only work with Kafka for both input and output • low latency due to continuous processing • Currently only supports Java, Scala support available soon • KSQL abstraction provides SQL on top of Kafka Streams
  40. 40. Comparison Kafka Streams Spark Streaming Spark Structured Streaming Language Options Java (KIP for Scala), KSQL Scala, Java, Python, R, SQL Scala, Java, Python, R, SQL Processing Model Continuous Streaming Micro-Batching Micro-Batching Core Abstraction KStream / KTable DStream (RDD) Data Frame / Dataset Programming Model Declarative/Imperative Declarative Declarative Time Support Event / Ingestion / Processing Processing Event / Ingestion/ Processing State Support Memory / RocksDB + Kafka Memory / Disk Memory / Disk Time Window Support Fixed, Sliding, Session Fixed, Sliding Fixed, Sliding Join Stream-Static, Stream-Stream Stream-Static Stream-Static, Stream-Stream (2.3) Event Pattern detection No No No Query Language Support KSQL No Spark SQL (limited) Queryable State Interactive Queries No No Scalability & Reliability Yes Yes Yes Guarantees At Least Once/Exactly Once At Least Once/Exactly Once (partial) At Least Once/Exactly Once (partial) Latency Sub-second seconds seconds Deployment Java Library Cluster (with HDFS like FS) Cluster (with HDFS like FS)
  41. 41. Technology on its own won't help you. You need to know how to use it properly.
  • PeterNijem

    Aug. 30, 2021
  • TakumiYamaoka

    Jan. 14, 2020
  • FernandoFerri

    Nov. 16, 2019
  • KrutarthRao

    Oct. 9, 2019
  • mkoertgen

    Aug. 21, 2019
  • TysonCung

    Apr. 23, 2019
  • mfairchi

    Feb. 18, 2019
  • lalpal

    Feb. 16, 2019
  • MarcosColebrookSantamaria

    Dec. 26, 2018
  • GalShpantzer

    Dec. 19, 2018

Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams. Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala. Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application. This presentation shows how you can implement stream processing solutions with each of the two frameworks, discusses how they compare and highlights the differences and similarities.

Views

Total views

3,919

On Slideshare

0

From embeds

0

Number of embeds

52

Actions

Downloads

100

Shares

0

Comments

0

Likes

10

×