SlideShare a Scribd company logo
1 of 42
Our product is a transformational product that uses third
generation Big Data technologies to execute the most
comprehensive form of Digital Transformation
INTRODUCING
SPARK STRUCTURED
STREAMING
José Carlos García Serrano
Big Data Architect in Stratio.
I am from Granada and computer science engineer in
the ETSII, post graduate in Big Data and certificate in
Spark and AWS
def fanBoy(): Seq[Skills] = {
val functional = Seq(Scala, Akka)
val processing = Seq(Spark)
val noSql = Seq(MongoDB, Cassandra)
functional ++ processing ++ noSql
}
def aLongTimeAgo(): Seq[Skills] = {
val programming = Seq(Delphi, C++)
val processing = Seq(Hadoop)
val sql = Seq(Interbase, FireBird)
programming ++ processing ++ sql
}
INDEXINDEX
1. Introduction
2. Basic concepts
3. Continuous aggregations
4. Stateful operations
5. Kafka integration
6. Benchmark and other streaming frameworks
Introduction
1
One Streaming API to rule
them all
Spark Core + Spark SQL + Spark Streaming =
Spark Structured Streaming
Introduction
Introduction
Main features
● Integrated in SQL API
● Easy to use
● Continuous processing with exactly one
● Interactive queries
● Joins with static data integrating Streaming and Batch
● Continuous aggregations
● Stateful operations
● Low latency (<1ms)
Introduction
Basic concepts
2
Basic concepts
Load from Stream Sources
val csvDF = spark
.readStream
.option("sep", ";")
.schema(userSchema)
.csv("/path/to/directory")
.load()
val kafkaDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1")
.option("subscribe", "topic1")
.load()
Kafka, Files(json, csv, parquet, txt), Socket and more ….
Basic concepts
Output Sinks
csvDF.writeStream
.format("parquet")
.option("path", "path/to/destination/dir")
.start()
csvDF.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1")
.option("topic", "topic2")
.start()
Kafka, Files(json, csv, parquet, txt), JDBC, Foreach, Memory and more ….
Basic concepts
Triggers
csvDF.writeStream
.format("parquet")
.trigger(Trigger.Once())
.option("path", "path/to/destination/dir")
.start()
csvDF.writeStream
.format("kafka")
.trigger(Trigger.ProcessingTime(6,
seconds))
.option("kafka.bootstrap.servers", "host1:port1")
.option("topic", "topic2")
.start()
Continuous triggering, one execution or periodical execution
Basic concepts
Basic concepts
Integration with Spark SQL and Dataset API
val dataFrameStream : DataFrame = …….
dataFrameStream.createOrReplaceTempView(“streamTable”)
sparkSession.sql(“select * from streamTable”)
Catalyst and Tungsten optimizations!!!
Basic concepts
Mixing Streaming data with Batch data
val streamDf : DataFrame = spark.readStream
val staticDf : DataFrame = spark.read
streamDf.join(staticDf, “type”)
Two worlds in a line of code!!!
DStream.foreachRDD(rdd =>
val staticDf = spark.sql(“select …...”)
val streamDf = spark.createDataFrame(rdd,
schema)
streamDf.join(staticDf)
)
Basic concepts
Data cleaning (drop duplicates)
val dataFrameStream : DataFrame = …….
dataFrameStream.dropDuplicates("idColumn")
Optimized in DataFrames with watermark, without watermark all the
past events has stored in the State
Basic concepts
Output modes
● Complete -> The entire updated Result Table will be written
(only with aggregations)
● Append -> Only the new rows appended in the Result Table will be written
(with no aggregations, joins, with aggregations, with watermark)
● Update -> Only the rows that were updated in the Result Table will be written
(with no aggregations, with aggregations, no agg with watermark)
Basic concepts
Exactly one
● Source must manage offsets (Kafka or Kinesis)
● Checkpointing and write ahead logs
● Sink idempotent (transactional updates available in file system sink)
Basic concepts
Continuous
aggregations
3
Continuous aggregations
SQL:
select count(*) from words group by word
Spark Dataset:
words.groupBy("word").count()
Minimal state storage!!
Continuous aggregations
Incremental queries with windowed aggregations
Continuous aggregations
Windowed aggregation example
val words : DataFrame = …..
words.groupBy(
window(
timeColumn = $"timestamp",
windowDuration = "10 minutes",
slideDuration = "5 minutes"
),$"word"
).count()
6 lines of code in Scala!!!
Now we know how to build
continuous and incremental
aggregations
We need to optimize this aggregations for production
deployments
Continuous aggregations
With the watermarks Spark expire late data
Continuous aggregations
Stateful operations
4
Stateful operations
Stateful operations
Checkpointing integrated
csvDF.writeStream
.format("parquet")
.option("checkpointLocation", "/tmp/checkpoint")
.option("path", "path/to/destination/dir")
.start()
Stateful operations
Easy API with all functionalities
csvDF.groupByKey(...)
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout)(mappingFunction)
def mappingFunction(key: K, value: Iterator[V], state: GroupState[S]): String = {
if (state.hasTimedOut) { // If called when timing out, remove the state
state.remove()
} else if (state.exists) { // If state exists, use it for processing
val existingState = state.get // Get the existing state
val shouldRemove = ... // Decide whether to remove the state
if (shouldRemove) {
state.remove() // Remove the state
} else {
val newState = ...
state.update(newState) // Set the new state
state.setTimeoutDuration("1 hour") // Set the timeout
}
} else {
val initialState = ...
state.update(initialState) // Set the initial state
state.setTimeoutDuration("1 hour") // Set the timeout
}
}
“
If you want to build complex
aggregations you should use this
API or state functions in
Streaming API
Kafka integration
5
Spark Streaming and
Structured Streaming should
be used with Kafka
Integrated as a Source and as a Sink!!
It’s available now in batch
Kafka integration
Kafka integration
Reading streaming data
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
Kafka integration
Reading streaming data for batch jobs
spark.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1,topic2")
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
.option("endingOffsets", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""")
.load()
…
.option("subscribePattern", "topic.*")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
Kafka integration
Sink streaming data
spark.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.start()
Benchmark and other
streaming frameworks
6
Benchmark and other streaming frameworks
Benchmark and other streaming frameworks
Benchmark and other streaming frameworks
All that glitters is not
gold!!!
Contact me and check some examples in my GitHub:
● gserranojc@gmail.com
● www.linkedin.com/in/gserranojc
● https://github.com/compae/structured-streaming-tests
Spark documentation:
● https://spark.apache.org/docs/latest/structured-streaming-
programming-guide.html

More Related Content

What's hot

Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Qbeast
 

What's hot (20)

Advanced goldengate training ⅰ
Advanced goldengate training ⅰAdvanced goldengate training ⅰ
Advanced goldengate training ⅰ
 
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water Meetup
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
 
Intro to RxJava/RxAndroid - GDG Munich Android
Intro to RxJava/RxAndroid - GDG Munich AndroidIntro to RxJava/RxAndroid - GDG Munich Android
Intro to RxJava/RxAndroid - GDG Munich Android
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Dax Declarative Api For Xml
Dax   Declarative Api For XmlDax   Declarative Api For Xml
Dax Declarative Api For Xml
 
Connect S3 with Kafka using Akka Streams
Connect S3 with Kafka using Akka Streams Connect S3 with Kafka using Akka Streams
Connect S3 with Kafka using Akka Streams
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
 

Similar to Meetup spark structured streaming

Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 

Similar to Meetup spark structured streaming (20)

Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Productionalizing spark streaming applications
Productionalizing spark streaming applicationsProductionalizing spark streaming applications
Productionalizing spark streaming applications
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Sqlapi0.1
Sqlapi0.1Sqlapi0.1
Sqlapi0.1
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 

Recently uploaded

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 

Recently uploaded (20)

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 

Meetup spark structured streaming

  • 1. Our product is a transformational product that uses third generation Big Data technologies to execute the most comprehensive form of Digital Transformation INTRODUCING SPARK STRUCTURED STREAMING
  • 2. José Carlos García Serrano Big Data Architect in Stratio. I am from Granada and computer science engineer in the ETSII, post graduate in Big Data and certificate in Spark and AWS def fanBoy(): Seq[Skills] = { val functional = Seq(Scala, Akka) val processing = Seq(Spark) val noSql = Seq(MongoDB, Cassandra) functional ++ processing ++ noSql } def aLongTimeAgo(): Seq[Skills] = { val programming = Seq(Delphi, C++) val processing = Seq(Hadoop) val sql = Seq(Interbase, FireBird) programming ++ processing ++ sql }
  • 3. INDEXINDEX 1. Introduction 2. Basic concepts 3. Continuous aggregations 4. Stateful operations 5. Kafka integration 6. Benchmark and other streaming frameworks
  • 5. One Streaming API to rule them all Spark Core + Spark SQL + Spark Streaming = Spark Structured Streaming
  • 7. Introduction Main features ● Integrated in SQL API ● Easy to use ● Continuous processing with exactly one ● Interactive queries ● Joins with static data integrating Streaming and Batch ● Continuous aggregations ● Stateful operations ● Low latency (<1ms)
  • 10. Basic concepts Load from Stream Sources val csvDF = spark .readStream .option("sep", ";") .schema(userSchema) .csv("/path/to/directory") .load() val kafkaDF = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1") .option("subscribe", "topic1") .load() Kafka, Files(json, csv, parquet, txt), Socket and more ….
  • 11. Basic concepts Output Sinks csvDF.writeStream .format("parquet") .option("path", "path/to/destination/dir") .start() csvDF.writeStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1") .option("topic", "topic2") .start() Kafka, Files(json, csv, parquet, txt), JDBC, Foreach, Memory and more ….
  • 14. Basic concepts Integration with Spark SQL and Dataset API val dataFrameStream : DataFrame = ……. dataFrameStream.createOrReplaceTempView(“streamTable”) sparkSession.sql(“select * from streamTable”) Catalyst and Tungsten optimizations!!!
  • 15. Basic concepts Mixing Streaming data with Batch data val streamDf : DataFrame = spark.readStream val staticDf : DataFrame = spark.read streamDf.join(staticDf, “type”) Two worlds in a line of code!!! DStream.foreachRDD(rdd => val staticDf = spark.sql(“select …...”) val streamDf = spark.createDataFrame(rdd, schema) streamDf.join(staticDf) )
  • 16. Basic concepts Data cleaning (drop duplicates) val dataFrameStream : DataFrame = ……. dataFrameStream.dropDuplicates("idColumn") Optimized in DataFrames with watermark, without watermark all the past events has stored in the State
  • 17. Basic concepts Output modes ● Complete -> The entire updated Result Table will be written (only with aggregations) ● Append -> Only the new rows appended in the Result Table will be written (with no aggregations, joins, with aggregations, with watermark) ● Update -> Only the rows that were updated in the Result Table will be written (with no aggregations, with aggregations, no agg with watermark)
  • 18. Basic concepts Exactly one ● Source must manage offsets (Kafka or Kinesis) ● Checkpointing and write ahead logs ● Sink idempotent (transactional updates available in file system sink)
  • 21. Continuous aggregations SQL: select count(*) from words group by word Spark Dataset: words.groupBy("word").count() Minimal state storage!!
  • 22. Continuous aggregations Incremental queries with windowed aggregations
  • 23. Continuous aggregations Windowed aggregation example val words : DataFrame = ….. words.groupBy( window( timeColumn = $"timestamp", windowDuration = "10 minutes", slideDuration = "5 minutes" ),$"word" ).count() 6 lines of code in Scala!!!
  • 24. Now we know how to build continuous and incremental aggregations We need to optimize this aggregations for production deployments
  • 25. Continuous aggregations With the watermarks Spark expire late data
  • 29. Stateful operations Checkpointing integrated csvDF.writeStream .format("parquet") .option("checkpointLocation", "/tmp/checkpoint") .option("path", "path/to/destination/dir") .start()
  • 30. Stateful operations Easy API with all functionalities csvDF.groupByKey(...) .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout)(mappingFunction) def mappingFunction(key: K, value: Iterator[V], state: GroupState[S]): String = { if (state.hasTimedOut) { // If called when timing out, remove the state state.remove() } else if (state.exists) { // If state exists, use it for processing val existingState = state.get // Get the existing state val shouldRemove = ... // Decide whether to remove the state if (shouldRemove) { state.remove() // Remove the state } else { val newState = ... state.update(newState) // Set the new state state.setTimeoutDuration("1 hour") // Set the timeout } } else { val initialState = ... state.update(initialState) // Set the initial state state.setTimeoutDuration("1 hour") // Set the timeout } }
  • 31. “ If you want to build complex aggregations you should use this API or state functions in Streaming API
  • 33. Spark Streaming and Structured Streaming should be used with Kafka Integrated as a Source and as a Sink!! It’s available now in batch
  • 35. Kafka integration Reading streaming data spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load()
  • 36. Kafka integration Reading streaming data for batch jobs spark.read .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1,topic2") .option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""") .option("endingOffsets", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""") .load() … .option("subscribePattern", "topic.*") .option("startingOffsets", "earliest") .option("endingOffsets", "latest")
  • 37. Kafka integration Sink streaming data spark.writeStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic1") .start()
  • 39. Benchmark and other streaming frameworks
  • 40. Benchmark and other streaming frameworks
  • 41. Benchmark and other streaming frameworks All that glitters is not gold!!!
  • 42. Contact me and check some examples in my GitHub: ● gserranojc@gmail.com ● www.linkedin.com/in/gserranojc ● https://github.com/compae/structured-streaming-tests Spark documentation: ● https://spark.apache.org/docs/latest/structured-streaming- programming-guide.html