Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Meetup spark structured streaming
1. Our product is a transformational product that uses third
generation Big Data technologies to execute the most
comprehensive form of Digital Transformation
INTRODUCING
SPARK STRUCTURED
STREAMING
2. José Carlos García Serrano
Big Data Architect in Stratio.
I am from Granada and computer science engineer in
the ETSII, post graduate in Big Data and certificate in
Spark and AWS
def fanBoy(): Seq[Skills] = {
val functional = Seq(Scala, Akka)
val processing = Seq(Spark)
val noSql = Seq(MongoDB, Cassandra)
functional ++ processing ++ noSql
}
def aLongTimeAgo(): Seq[Skills] = {
val programming = Seq(Delphi, C++)
val processing = Seq(Hadoop)
val sql = Seq(Interbase, FireBird)
programming ++ processing ++ sql
}
7. Introduction
Main features
● Integrated in SQL API
● Easy to use
● Continuous processing with exactly one
● Interactive queries
● Joins with static data integrating Streaming and Batch
● Continuous aggregations
● Stateful operations
● Low latency (<1ms)
14. Basic concepts
Integration with Spark SQL and Dataset API
val dataFrameStream : DataFrame = …….
dataFrameStream.createOrReplaceTempView(“streamTable”)
sparkSession.sql(“select * from streamTable”)
Catalyst and Tungsten optimizations!!!
15. Basic concepts
Mixing Streaming data with Batch data
val streamDf : DataFrame = spark.readStream
val staticDf : DataFrame = spark.read
streamDf.join(staticDf, “type”)
Two worlds in a line of code!!!
DStream.foreachRDD(rdd =>
val staticDf = spark.sql(“select …...”)
val streamDf = spark.createDataFrame(rdd,
schema)
streamDf.join(staticDf)
)
16. Basic concepts
Data cleaning (drop duplicates)
val dataFrameStream : DataFrame = …….
dataFrameStream.dropDuplicates("idColumn")
Optimized in DataFrames with watermark, without watermark all the
past events has stored in the State
17. Basic concepts
Output modes
● Complete -> The entire updated Result Table will be written
(only with aggregations)
● Append -> Only the new rows appended in the Result Table will be written
(with no aggregations, joins, with aggregations, with watermark)
● Update -> Only the rows that were updated in the Result Table will be written
(with no aggregations, with aggregations, no agg with watermark)
18. Basic concepts
Exactly one
● Source must manage offsets (Kafka or Kinesis)
● Checkpointing and write ahead logs
● Sink idempotent (transactional updates available in file system sink)
30. Stateful operations
Easy API with all functionalities
csvDF.groupByKey(...)
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout)(mappingFunction)
def mappingFunction(key: K, value: Iterator[V], state: GroupState[S]): String = {
if (state.hasTimedOut) { // If called when timing out, remove the state
state.remove()
} else if (state.exists) { // If state exists, use it for processing
val existingState = state.get // Get the existing state
val shouldRemove = ... // Decide whether to remove the state
if (shouldRemove) {
state.remove() // Remove the state
} else {
val newState = ...
state.update(newState) // Set the new state
state.setTimeoutDuration("1 hour") // Set the timeout
}
} else {
val initialState = ...
state.update(initialState) // Set the initial state
state.setTimeoutDuration("1 hour") // Set the timeout
}
}
31. “
If you want to build complex
aggregations you should use this
API or state functions in
Streaming API
42. Contact me and check some examples in my GitHub:
● gserranojc@gmail.com
● www.linkedin.com/in/gserranojc
● https://github.com/compae/structured-streaming-tests
Spark documentation:
● https://spark.apache.org/docs/latest/structured-streaming-
programming-guide.html