Meetup spark structured streaming

Our product is a transformational product that uses third
generation Big Data technologies to execute the most
comprehensive form of Digital Transformation
INTRODUCING
SPARK STRUCTURED
STREAMING

José Carlos García Serrano
Big Data Architect in Stratio.
I am from Granada and computer science engineer in
the ETSII, post graduate in Big Data and certificate in
Spark and AWS
def fanBoy(): Seq[Skills] = {
val functional = Seq(Scala, Akka)
val processing = Seq(Spark)
val noSql = Seq(MongoDB, Cassandra)
functional ++ processing ++ noSql
}
def aLongTimeAgo(): Seq[Skills] = {
val programming = Seq(Delphi, C++)
val processing = Seq(Hadoop)
val sql = Seq(Interbase, FireBird)
programming ++ processing ++ sql
}

INDEXINDEX
1. Introduction
2. Basic concepts
3. Continuous aggregations
4. Stateful operations
5. Kafka integration
6. Benchmark and other streaming frameworks

One Streaming API to rule
them all
Spark Core + Spark SQL + Spark Streaming =
Spark Structured Streaming

Introduction
Main features
● Integrated in SQL API
● Easy to use
● Continuous processing with exactly one
● Interactive queries
● Joins with static data integrating Streaming and Batch
● Continuous aggregations
● Stateful operations
● Low latency (<1ms)

Basic concepts
Load from Stream Sources
val csvDF = spark
.readStream
.option("sep", ";")
.schema(userSchema)
.csv("/path/to/directory")
.load()
val kafkaDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1")
.option("subscribe", "topic1")
.load()
Kafka, Files(json, csv, parquet, txt), Socket and more ….

Basic concepts
Output Sinks
csvDF.writeStream
.format("parquet")
.option("path", "path/to/destination/dir")
.start()
csvDF.writeStream
.format("kafka")
.option("topic", "topic2")
.start()
Kafka, Files(json, csv, parquet, txt), JDBC, Foreach, Memory and more ….

Basic concepts
Triggers
csvDF.writeStream
.format("parquet")
.trigger(Trigger.Once())
.start()
csvDF.writeStream
.format("kafka")
.trigger(Trigger.ProcessingTime(6,
seconds))
.start()
Continuous triggering, one execution or periodical execution

Basic concepts
Integration with Spark SQL and Dataset API
val dataFrameStream : DataFrame = …….
dataFrameStream.createOrReplaceTempView(“streamTable”)
sparkSession.sql(“select * from streamTable”)
Catalyst and Tungsten optimizations!!!

Basic concepts
Mixing Streaming data with Batch data
val streamDf : DataFrame = spark.readStream
val staticDf : DataFrame = spark.read
streamDf.join(staticDf, “type”)
Two worlds in a line of code!!!
DStream.foreachRDD(rdd =>
val staticDf = spark.sql(“select …...”)
val streamDf = spark.createDataFrame(rdd,
schema)
streamDf.join(staticDf)
)

Basic concepts
Data cleaning (drop duplicates)
val dataFrameStream : DataFrame = …….
dataFrameStream.dropDuplicates("idColumn")
Optimized in DataFrames with watermark, without watermark all the
past events has stored in the State

Basic concepts
Output modes
● Complete -> The entire updated Result Table will be written
(only with aggregations)
● Append -> Only the new rows appended in the Result Table will be written
(with no aggregations, joins, with aggregations, with watermark)
● Update -> Only the rows that were updated in the Result Table will be written
(with no aggregations, with aggregations, no agg with watermark)

Basic concepts
Exactly one
● Source must manage offsets (Kafka or Kinesis)
● Checkpointing and write ahead logs
● Sink idempotent (transactional updates available in file system sink)

Continuous aggregations
SQL:
select count(*) from words group by word
Spark Dataset:
words.groupBy("word").count()
Minimal state storage!!

Incremental queries with windowed aggregations

Windowed aggregation example
val words : DataFrame = …..
words.groupBy(
window(
timeColumn = $"timestamp",
windowDuration = "10 minutes",
slideDuration = "5 minutes"
),$"word"
).count()
6 lines of code in Scala!!!

Now we know how to build
continuous and incremental
aggregations
We need to optimize this aggregations for production
deployments

With the watermarks Spark expire late data

Stateful operations
Checkpointing integrated
csvDF.writeStream
.format("parquet")
.option("checkpointLocation", "/tmp/checkpoint")
.start()

Stateful operations
Easy API with all functionalities
csvDF.groupByKey(...)
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout)(mappingFunction)
def mappingFunction(key: K, value: Iterator[V], state: GroupState[S]): String = {
if (state.hasTimedOut) { // If called when timing out, remove the state
state.remove()
} else if (state.exists) { // If state exists, use it for processing
val existingState = state.get // Get the existing state
val shouldRemove = ... // Decide whether to remove the state
if (shouldRemove) {
state.remove() // Remove the state
} else {
val newState = ...
state.update(newState) // Set the new state
state.setTimeoutDuration("1 hour") // Set the timeout
}
} else {
val initialState = ...
state.update(initialState) // Set the initial state
state.setTimeoutDuration("1 hour") // Set the timeout
}
}

“
If you want to build complex
aggregations you should use this
API or state functions in
Streaming API

Spark Streaming and
Structured Streaming should
be used with Kafka
Integrated as a Source and as a Sink!!
It’s available now in batch

Kafka integration
Reading streaming data
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()

Kafka integration
Reading streaming data for batch jobs
spark.read
.format("kafka")
.option("subscribe", "topic1,topic2")
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
.option("endingOffsets", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""")
.load()
…
.option("subscribePattern", "topic.*")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")

Kafka integration
Sink streaming data
spark.writeStream
.format("kafka")
.start()

Benchmark and other
streaming frameworks
6

Benchmark and other streaming frameworks

Benchmark and other streaming frameworks
All that glitters is not
gold!!!

Contact me and check some examples in my GitHub:
● gserranojc@gmail.com
● www.linkedin.com/in/gserranojc
● https://github.com/compae/structured-streaming-tests
Spark documentation:
● https://spark.apache.org/docs/latest/structured-streaming-
programming-guide.html

Meetup spark structured streaming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Meetup spark structured streaming

Similar to Meetup spark structured streaming (20)

Recently uploaded

Recently uploaded (20)

Meetup spark structured streaming