SlideShare a Scribd company logo
1 of 71
Download to read offline
Structured Streaming
For machine Learning -
Why it sucks and how we’re working on it
kroszk@
Built with
experimental
APIs :)
Who am I?
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google focused on OSS Big Data
● Apache Spark PMC
● Contributor to a lot of other projects (including BEAM)
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of High Performance Spark & Learning Spark (+ more)
● Twitter: @holdenkarau
● Slideshare http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Related Spark Videos http://bit.ly/holdenSparkVideos
Quick side notes:
● In town for LinuxConf AU (come see me tomorrow @ 1:40 pm)
● My voice feels like shit today
○ If you can’t hear me let me know
○ If I take a break to have some tea sorry!
● Depending on my voice I might just ask to do Q&A with e-mail later
What is going to be covered:
● Who I think y’all are
● What the fuck Spark is -- O’Reilly wouldn’t let me name a chapter 1 this...
● Abridged Introduction to Datasets
● Abridged Introduction to Structured Streaming
● What Structured Streaming is and is not
● How to write simple structured streaming queries
● The “exciting” part: Building machine learning on top of structured streaming
● Possible future changes to make structured streaming & ML work together
nicely
Torsten Reuschling
Who I think you wonderful humans are?
● Nice* people
● Don’t mind pictures of cats
● Possibly Know some Apache Spark
● May or may not know the Dataset API
● Want to take advantage of Spark’s Structured Streaming
● May care about machine learning
● Possible distracted with the new Zelda game?
ALPHA =~ Please don’t use this in production
We decided to change all the APIs again :p
Image by Mr Thinktank
What are Datasets?
● New in Spark 1.6 (comparatively old hat now)
● Provide compile time strongly typed version of DataFrames
● Make it easier to intermix functional & relational code
○ Do you hate writing UDFS? So do I!
● The basis of the Structured Streaming (new in 2.0 with more changes in 2.3)
○ Still an experimental component (API will change in future versions)
Houser Wolf
Using Datasets to mix functional & relational:
val ds: Dataset[RawPanda] = ...
val happiness = ds.filter($"happy" === true).
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
Sephiroty Magno Fiesta
So what was that?
ds.filter($"happy" === true).
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
A typed query (specifies the
return type). Without the as[]
will return a DataFrame
(Dataset[Row])
Traditional functional
reduction:
arbitrary scala code :)
Robert Couse-Baker
And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
And now we can use it for streaming too!
● StructuredStreaming - new to Spark 2.0
○ Emphasis on new - be cautious when using
● Extends the Dataset & DataFrame APIs to represent continuous tables
● Still very early stages - but lots of really cool optimizations possible now
● We can build a machine learning pipeline with it together :)
○ Well we have to use some hacks - but ssssssh don’t tell TD
https://github.com/holdenk/spark-structured-streaming-ml
V2.0* Architecture
Planner
(microbatch thread)
Streaming
Source
Streaming
Sink
Incremental
Execution
Incremental
Execution
Rich Bowen
...
time
V2.3+* Architecture (SPARK-20928)
Planner
Streaming
Source
Streaming
Sink
Process new
epochs
Process new
epochs
Rich Bowen
...
time
Aggregates: V2.0 API only for now?
abstract class UserDefinedAggregateFunction {
def initialize(buffer: MutableAggregationBuffer): Unit
def update(buffer: MutableAggregationBuffer, input: Row): Unit
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit
def evaluate(buffer: Row): Any
}
Get a streaming dataset
// Read a streaming dataframe
val schema = new StructType()
.add("happiness", "double")
.add("coffees", "integer")
val streamingDS = spark
.readStream
.schema(schema)
.format(“parquet”)
.load(path)
Dataset
isStreaming = true
streaming
source
Build the recipe for each query
val happinessByCoffee = streamingDS
.groupBy($"coffees")
.agg(avg($"happiness"))
Dataset
isStreaming = true
streaming
source
Aggregate
groupBy = “coffees”
expr = avg(“happiness”)
Streaming Aggregation
data
partial_avg
avg
shuffle
Inject stateful
operators
Batch physical plan Streaming microbatch
physical plan
data
partial_avg
restore_state
shuffle
avg
avg
save_state
sink sink sink
t0
t1
t2
kitty.green66
Streaming Aggregation - Partial Average
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
0 -> (1, -150)
2 -> (1, 30)
8 -> (2, 350)
0 -> (1, -100)
2 -> (3, 90)
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
0 -> (3, -350)
8 -> (1, 120)
2 -> (3, 100)
8 -> (1, 100)
data
partial_avg
restore_state
shuffle
avg
avg
save_state
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
Executor 1 Executor 2
Data source
State store
Beatrice Murch
Streaming Aggregation - Shuffle
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
data
partial_avg
restore_state
shuffle
avg
avg
save_state
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
0 -> (1, -150)
0 -> (3, -350)
0 -> (1, -100)
2 -> (1, 30)
2 -> (3, 100)
2 -> (3, 90)
8 -> (1, 120)
8 -> (1, 100)
8 -> (2, 350)
Executor 1 Executor 2
Data source
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
State store
UnknownNet Photography
Streaming Aggregation - Merge Average
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
data
partial_avg
restore_state
shuffle
avg
avg
save_state
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
0 -> (5, -600) 2 -> (7, 220) 8 -> (4, 570)
Executor 1 Executor 2
Data source
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
State store
elminium
Streaming Aggregation - Restore State
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
data
partial_avg
restore_state
shuffle
avg
avg
save_state
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
0 -> (5, -600)
0 -> (5, -800)
2 -> (7, 220)
2 -> (3, 150)
8 -> (4, 570)
8 -> (4, 450)
Executor 1 Executor 2
Data source
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
State store
elminium
Streaming Aggregation - Merge Average
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
data
partial_avg
restore_state
shuffle
avg
avg
save_state
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
0 -> (10, -1400) 2 -> (10, 370) 8 -> (8, 1020)
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
State store
Executor 1 Executor 2
Data source
elminium
Streaming Aggregation - Save State
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
data
partial_avg
restore_state
shuffle
avg
avg
save_state
0 -> (10, -1400) 2 -> (10, 370) 8 -> (8, 1020)
0 -> (10, -1400) 2 -> (10, 370) 8 -> (8, 1020)
0 -> (10, -1400) 2 -> (10, 370) 8 -> (8, 1020)
State store
Executor 1 Executor 2
Data source
elminium
How to train a streaming ML model
1. Future: directly use structured streaming to create model streams via stateful
aggregators
○ https://spark-summit.org/eu-2016/events/online-learning-with-structured-streaming/
2. Today: use the sink to collect model updates and store them on the driver
Stateful Aggregator
data
partial_agg
final_agg
save_state
restore_initial
merge_agg
data1
dataP
...
zero
saved
state
data1
dataP
...
saved state
new aggregate
new model
Stateful Aggregator
Streaming Aggregator Stateful Aggregator
Physical Plan
Pedro Ribeiro Simões
Stateful Aggregator - Restore initial state
[w0
0
, …, wd
0
] [w0
0
, …, wd
0
] [w0
0
, …, wd
0
] [w0
0
, …, wd
0
]
(y0
, x0
)
...
(yk-1
, xk-1
)
(yk
, xk
)
...
(y2k-1
, x2k-1
)
(y2k
, x2k
)
...
(y3k-1
, x3k-1
)
(y3k
, x3k
)
...
(y4k-1
, x4k-1
)
Executor 1 Executor 2
Data source
data
partial_agg
save_state
restore_initial
merge_agg
State store
[w0
0
, …, wd
0
]
Stateful Aggregator - Partial Aggregation
(y0
, x0
)
...
(yk-1
, xk-1
)
(yk
, xk
)
...
(y2k-1
, x2k-1
)
(y2k
, x2k
)
...
(y3k-1
, x3k-1
)
(y3k
, x3k
)
...
(y4k-1
, x4k-1
)
Executor 1 Executor 2
Data source
data
partial_agg
save_state
restore_initial
merge_agg
State store
[w0
0
, …, wd
0
]
[w0
0
, …, wd
0
] [w0
0
, …, wd
0
] [w0
0
, …, wd
0
] [w0
0
, …, wd
0
]
Stateful Aggregator - Merge Aggregators
data
partial_agg
save_state
restore_initial
merge_agg
[w0
0
, …, wd
0
]
(y0
, x0
)
...
(yk-1
, xk-1
)
(yk
, xk
)
...
(y2k-1
, x2k-1
)
(y2k
, x2k
)
...
(y3k-1
, x3k-1
)
(y3k
, x3k
)
...
(y4k-1
, x4k-1
)
Model combining scheme,
e.g. weighted average
Executor 1 Executor 2
Data source
State store
[w0
0,1
, …, wd
0,1
]
...
[w0
0,P
, …, wd
0,P
]
Stateful Aggregator - Save State
(y0
, x0
)
...
(yk-1
, xk-1
)
(yk
, xk
)
...
(y2k-1
, x2k-1
)
(y2k
, x2k
)
...
(y3k-1
, x3k-1
)
(y3k
, x3k
)
...
(y4k-1
, x4k-1
)
Executor 1 Executor 2
Data source
State storedata
partial_agg
save_state
restore_initial
merge_agg
[w0
1
, …, wd
1
]
[w0
1
, …, wd
1
]
Batch ML pipelines
Tokenizer HashingTF String Indexer Naive Bayes
Tokenizer HashingTF
Streaming
String Indexer
Streaming
Naive Bayes
fit(df)
Estimator
Transformer
● In the batch setting, an estimator is trained on a dataset, and
produces a static, immutable transformer.
● There is no communication between the two.
Streaming ML pipelines
Tokenizer HashingTF String Indexer
Naive BayesTokenizer HashingTF
Streaming
String Indexer
Streaming
Naive Bayes
Model
Sink
Tokenizer HashingTF
Streaming
String Indexer
Model
Sink
Data
sink
Cool - lets build some ML with it!
Lauren Coolman
Streaming ML Pipelines (Proof of Concept)
Tokenizer HashingTF
Streaming
String Indexer
Streaming
Naive Bayes
Tokenizer HashingTF
Streaming
String Indexer
Streaming
Naive Bayes
fit(df)
Estimator
Transformer
(mutable)
● In this implementation, the estimator produces an initial transformer, and
communicates updates to a specialized StreamingTransformer.
● Streaming transformers must provide a means of incorporating model
updates into predictions.
state state
Lauren Coolman
Streaming Estimator/Transformer (POC)
trait StreamingModel[S] extends Transformer {
def update(updates: S): Unit
}
trait StreamingEstimator[S] extends Estimator {
def model: StreamingModel[S]
def update(batch: Dataset[_]): Unit
}
Sufficient statistics for
model updates
Blinke
nArea
Getting a micro-batch view with distributed
collection*
case class ForeachDatasetSink(func: DataFrame => Unit) extends Sink
{
override def addBatch(batchId: Long, data: DataFrame): Unit = {
func(data)
}
}
https://github.com/holdenk/spark-structured-streaming-ml
And doing some ML with it:
def evilTrain(df: DataFrame): StreamingQuery = {
val sink = new ForeachDatasetSink({df: DataFrame => update(df)})
val sparkSession = df.sparkSession
val evilStreamingQueryManager =
EvilStreamingQueryManager(sparkSession.streams)
evilStreamingQueryManager.startQuery(
Some("snb-train"),
None,
df,
sink,
OutputMode.Append())
}
And doing some ML with it:
def update(batch: Dataset[_]): Unit = {
val newCountsByClass = add(batch)
model.update(newCountsByClass)
} Aggregate new batch
Merge with previous aggregates
And doing some ML with it*
(Algorithm specific)
def update(updates: Array[(Double, (Long, DenseVector))]): Unit = {
updates.foreach { case (label, (numDocs, termCounts)) =>
countsByClass.get(label) match {
case Some((n, c)) =>
axpy(1.0, termCounts, c)
countsByClass(label) = (n + numDocs, c)
case None =>
// new label encountered
countsByClass += (label -> (numDocs, termCounts))
}
}
}
Non-Evil alternatives to our Evil:
● ForeachWriter exists
● Since everything runs on the executors it's difficult to update the model
● You could:
○ Use accumulators
○ Write the updates to Kafka
○ Send the updates to a param server of some type with RPC
○ Or do the evil things we did instead :)
● Wait for the “future?”: https://github.com/apache/spark/pull/15178
_torne
Working with the results - foreach (1 of 2)
val foreachWriter: ForeachWriter[T] =
new ForeachWriter[T] {
def open(partitionId: Long, version: Long): Boolean = {
True // always open
}
def close(errorOrNull: Throwable): Unit = {
// No close logic - if we wanted to copy updates per-batch
}
def process(record: T): Unit = {
db.update(record)
}
}
Working with the results - foreach (2 of 2)
// Apply foreach
happinessByCoffee.writeStream.outputMode(OutputMode.Complete())
foreach(foreachWriter).start()
Structured Streaming in Review:
● Pre-2.3 Structured Streaming still uses Spark’s Microbatch approach
● 2.3 forward: New execution engine! Yes this breaks everything
● One of the areas that Matei is researching
○ Researching ==~ future , research !~ today
Windell Oskay
Ok but where can we not use it?
● A lot of random methods on DataFrames & Datasets won’t work
● They will fail at runtime rather than compile time - so have tests!
● Anything which roundtrips through an rdd() is going to be pretty sad (aka fail)
○ Lot’s of internals randomly do (like toJson) for historical reasons
● Need to run a query inside of a sink? That is not going to work
● Need a complex receiver type? Many receivers are not ported yet
● Also you will need distinct query names - even if you stop the previous query.
● Aggregations and Append output mode (and the only file sink requires
Append)
● DataFrame/Dataset transformations inside of a sink
Open questions for ML pipelines
● How to train and predict simultaneously, on the same data?
○ Transform thread should be executed first
○ Do we actually need to support this or is this just a common demo?
● How to ensure robustness to failures?
○ Treat the output of training as a stream of models, with the same robustness guarantees of
any structured streaming query
○ Work based on this approach has already been prototyped
● Model training must be idempotent - should not train on the same data twice
○ Leverage batch ID, similar to `FileStreamSink`
● How to extend MLWritable for streaming
○ Spark’s format isn’t really all that useful - maybe PMML or PFA
Photo by
bullet101
Structured Streaming ML vs DStreams ML
What could be different for ML on structured streaming vs ML on DStreams?
● Structured streaming is built on the Spark SQL engine
○ Catalyst optimizer
○ Project tungsten
● Pipeline integration
○ ML pipelines have been improved and iterated across 5 releases, we can leverage their
mature design for streaming pipelines
○ This will make adding and working with new algorithms much easier than in the past
● Event time handling
○ Streaming ML algorithms typically use a decay factor
○ Structured streaming provides native support for event time, which is more appropriate for
decay
Krzysztof Belczyński
Batch vs Streaming Pipelines (Draft POC API)
val df = spark
.read
.schema(schema)
.parquet(path)
val tokenizer = new RegexTokenizer()
val htf = new HashingTF()
val nb = new NaiveBayes()
val pipeline = new Pipeline()
.setStages(Array(tokenizer, htf, nb))
val pipelineModel = pipeline.fit(df)
val df = spark
.readStream
.schema(schema)
.parquet(path)
val tokenizer = new RegexTokenizer()
val htf = new HashingTF()
val snb = new StreamingNaiveBayes()
val pipeline = new StreamingPipeline()
.setStages(Array(tokenizer, htf, snb))
.setCheckpointLocation(path)
val query = pipeline.fitStreaming(df)
query.awaitTermination()
https://github.com/sethah/spark/tree/structured-streaming-fun
Additional Spark Resources
● Programming guide (along with JavaDoc, PyDoc,
ScalaDoc, etc.)
○ http://spark.apache.org/docs/latest/
● Books
● Videos
● Spark Office Hours
○ Normally in the bay area - will do Google Hangouts ones soon
○ follow me on twitter for future ones - https://twitter.com/holdenkarau
Structured Streaming Resources
● Programming guide (along with JavaDoc, PyDoc,
ScalaDoc, etc.)
○ http://spark.apache.org/docs/latest/structured-streaming-programming
-guide.html
● https://github.com/holdenk/spark-structured-streaming-ml
● TD
https://spark-summit.org/2016/events/a-deep-dive-into-structured-st
reaming/
Surveys!!!!!!!! :D
● Interested in Structured Streaming?
○ http://bit.ly/structuredStreamingML - Let us know your thoughts
● Pssst: Care about Python DataFrame UDF
Performance?
○ http://bit.ly/pySparkUDF
● Care about Spark Testing?
○ http://bit.ly/holdenTestingSpark
● Want to give me feedback on this talk?
○ http://bit.ly/holdenTalkFeedback
Michael
Himbeault
And some upcoming talks:
● Jan
○ If interest tomorrow: Office Hours? Tweet me @holdenkarau
○ LinuxConf AU - tomorrow!
○ Data Day Texas - kind of far from Sydney but….
● Feb
○ FOSDEM - One on testing one on scaling
○ JFokus in Stockholm - Adding deep learning to Spark
○ I disappear for a week and pretend computers work
● March
○ Strata San Jose - Big Data Beyond the JVM
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
The gift of whichever holiday season is next!
Cats love it!**
You can buy it from that scrappy Seattle bookstore, Jeff
Bezos needs another newspaper and I want a cup of
coffee.
http://bit.ly/hkHighPerfSpark
Cat wave photo by Quinn Dombrowski
k thnx bye!
If you <3 testing & want to fill out
survey: http://bit.ly/holdenTestingSpark
Want to tell me (and or my boss) how
I’m doing?
http://bit.ly/holdenTalkFeedback
Want to e-mail me? Promise not to be
creepy? Ok:
holden@pigscanfly.ca
k thnx bye!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)
Appendix
Start a continuous query
val query = happinessByCoffee
.writeStream
.format(“parquet”)
.outputMode(“complete”)
.trigger(ProcessingTime(5.seconds))
.start()
StreamingQuery
source
relation
groupBy
avglogicalPlan =
Launch a new thread to listen for new data
Source
Available Offsets:
Sink
Committed Offsets:
MicroBatch Thread
StreamingQuery
source
relation
groupBy
avglogicalPlan =
Listening
Neil Falzon
Write new offsets to WAL
Source
Available Offsets:
0, 1, 2
Sink
Committed Offsets:
MicroBatch Thread
StreamingQuery
source
relation
groupBy
avglogicalPlan =
Commit to WAL
April Weeks
Check the source for new offsets
Source
Available Offsets:
0, 1, 2
Sink
Committed Offsets
MicroBatch Thread
StreamingQuery
source
relation
groupBy
avglogicalPlan =
batchId=
42
getBatch()
cat-observer
Get the “recipe” for this micro batch
Source
Available Offsets:
0, 1, 2
Sink
Committed Offsets
MicroBatch Thread
StreamingQuery
source
relation
groupBy
avglogicalPlan =
source
scan
groupBy
avg
batchId=
42
transform
Jackie
Send the micro batch Dataset to the sink
Source
Available Offsets:
0, 1, 2
Sink
Committed Offsets
MicroBatch Thread
StreamingQuery
source
scan
groupBy
avglogicalPlan =
groupBy
avg
MicroBatch Dataset
isStreaming = false
addBatch()
batchId=
42
Backed by an incremental
execution plan
Jason Rojas
Commit and listen again
Source
Available Offsets:
0, 1, 2
Sink
Committed Offsets:
0, 1, 2
MicroBatch Thread
StreamingQuery
source
scan
groupBy
avglogicalPlan =
Listening
S Orchard
Execution Summary
● Each query has its own thread - asynchronous
● Sources must be replayable
● Use write-ahead-logs for durability
● Sinks must be idempotent
● Each batch is executed with an incremental execution
plan
● Sinks get a micro batch view of the data
snaxor
Cool - lets build some ML with it!
Lauren Coolman
Get a dataframe
val schema = new StructType()
.add("happiness", "double")
.add("coffees", "integer")
val batchDS = spark
.read
.schema(schema)
.format(“parquet”)
.load(path)
Dataset
data
source
isStreaming = false
Build the recipe for each query
val happinessByCoffee = batchDS
.groupBy($"coffees")
.agg(avg($"happiness"))
Dataset
isStreaming = false
data
source
Aggregate
groupBy = “coffees”
expr = avg(“happiness”)
Batch Aggregation - Partial Average
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
0 -> (1, -150)
2 -> (1, 30)
8 -> (2, 350)
0 -> (1, -150)
2 -> (3, 90)
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
0 -> (3, -350)
8 -> (1, 120)
2 -> (3, 100)
8 -> (1, 100)
data
partial_avg
avg
shuffle
Executor 1 Executor 2
Data source
Batch Aggregation - Shuffle
data
partial_avg
avg
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
0 -> (1, -150)
0 -> (3, -350)
0 -> (1, -150)
2 -> (1, 30)
2 -> (3, 100)
2 -> (3, 90)
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
8 -> (1, 120)
8 -> (1, 100)
8 -> (2, 350)shuffle
Executor 1 Executor 2
Data source
Batch Aggregation - Final Average
data
partial_avg
avg
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
0 -> (5, -650) 2 -> (7, 220)
Executor 1
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
8 -> (4, 570)
shuffle
Executor 2
Data source

More Related Content

What's hot

Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Holden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopHolden Karau
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Holden Karau
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Holden Karau
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Holden Karau
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018Holden Karau
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Modelssparktc
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
 

What's hot (20)

Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 

Similar to Streaming ML on Spark: Deprecated, experimental and internal ap is galore!

Structured streaming for machine learning
Structured streaming for machine learningStructured streaming for machine learning
Structured streaming for machine learningSeth Hendrickson
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomyDongmin Yu
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Best Practices in Handling Performance Issues
Best Practices in Handling Performance IssuesBest Practices in Handling Performance Issues
Best Practices in Handling Performance IssuesOdoo
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기NAVER D2
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafkaDori Waldman
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
OpenStack API's and WSGI
OpenStack API's and WSGIOpenStack API's and WSGI
OpenStack API's and WSGIMike Pittaro
 
PHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPatrick Allaert
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...Jos Boumans
 
Alexander Mostovenko "'Devide at impera' with GraphQL and SSR"
Alexander Mostovenko "'Devide at impera' with GraphQL and SSR"Alexander Mostovenko "'Devide at impera' with GraphQL and SSR"
Alexander Mostovenko "'Devide at impera' with GraphQL and SSR"Fwdays
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark ApplicationsTzach Zohar
 
Golang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war storyGolang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war storyAerospike
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Jason Dai
 

Similar to Streaming ML on Spark: Deprecated, experimental and internal ap is galore! (20)

Structured streaming for machine learning
Structured streaming for machine learningStructured streaming for machine learning
Structured streaming for machine learning
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Best Practices in Handling Performance Issues
Best Practices in Handling Performance IssuesBest Practices in Handling Performance Issues
Best Practices in Handling Performance Issues
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
OpenStack API's and WSGI
OpenStack API's and WSGIOpenStack API's and WSGI
OpenStack API's and WSGI
 
PHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & Pinba
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...
 
Alexander Mostovenko "'Devide at impera' with GraphQL and SSR"
Alexander Mostovenko "'Devide at impera' with GraphQL and SSR"Alexander Mostovenko "'Devide at impera' with GraphQL and SSR"
Alexander Mostovenko "'Devide at impera' with GraphQL and SSR"
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark Applications
 
Golang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war storyGolang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war story
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 

Recently uploaded

Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.soniya singh
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Delhi Call girls
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...SUHANI PANDEY
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.soniya singh
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Servicegwenoracqe6
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge GraphsEleniIlkou
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtrahman018755
 
Al Barsha Night Partner +0567686026 Call Girls Dubai
Al Barsha Night Partner +0567686026 Call Girls  DubaiAl Barsha Night Partner +0567686026 Call Girls  Dubai
Al Barsha Night Partner +0567686026 Call Girls DubaiEscorts Call Girls
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024APNIC
 
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night StandHot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Standkumarajju5765
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Sheetaleventcompany
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...SUHANI PANDEY
 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...Escorts Call Girls
 

Recently uploaded (20)

Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
Al Barsha Night Partner +0567686026 Call Girls Dubai
Al Barsha Night Partner +0567686026 Call Girls  DubaiAl Barsha Night Partner +0567686026 Call Girls  Dubai
Al Barsha Night Partner +0567686026 Call Girls Dubai
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
 
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night StandHot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
 

Streaming ML on Spark: Deprecated, experimental and internal ap is galore!

  • 1. Structured Streaming For machine Learning - Why it sucks and how we’re working on it kroszk@ Built with experimental APIs :)
  • 3. Who am I? ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google focused on OSS Big Data ● Apache Spark PMC ● Contributor to a lot of other projects (including BEAM) ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of High Performance Spark & Learning Spark (+ more) ● Twitter: @holdenkarau ● Slideshare http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Related Spark Videos http://bit.ly/holdenSparkVideos
  • 4.
  • 5. Quick side notes: ● In town for LinuxConf AU (come see me tomorrow @ 1:40 pm) ● My voice feels like shit today ○ If you can’t hear me let me know ○ If I take a break to have some tea sorry! ● Depending on my voice I might just ask to do Q&A with e-mail later
  • 6. What is going to be covered: ● Who I think y’all are ● What the fuck Spark is -- O’Reilly wouldn’t let me name a chapter 1 this... ● Abridged Introduction to Datasets ● Abridged Introduction to Structured Streaming ● What Structured Streaming is and is not ● How to write simple structured streaming queries ● The “exciting” part: Building machine learning on top of structured streaming ● Possible future changes to make structured streaming & ML work together nicely Torsten Reuschling
  • 7. Who I think you wonderful humans are? ● Nice* people ● Don’t mind pictures of cats ● Possibly Know some Apache Spark ● May or may not know the Dataset API ● Want to take advantage of Spark’s Structured Streaming ● May care about machine learning ● Possible distracted with the new Zelda game?
  • 8. ALPHA =~ Please don’t use this in production We decided to change all the APIs again :p Image by Mr Thinktank
  • 9. What are Datasets? ● New in Spark 1.6 (comparatively old hat now) ● Provide compile time strongly typed version of DataFrames ● Make it easier to intermix functional & relational code ○ Do you hate writing UDFS? So do I! ● The basis of the Structured Streaming (new in 2.0 with more changes in 2.3) ○ Still an experimental component (API will change in future versions) Houser Wolf
  • 10. Using Datasets to mix functional & relational: val ds: Dataset[RawPanda] = ... val happiness = ds.filter($"happy" === true). select($"attributes"(0).as[Double]). reduce((x, y) => x + y) Sephiroty Magno Fiesta
  • 11. So what was that? ds.filter($"happy" === true). select($"attributes"(0).as[Double]). reduce((x, y) => x + y) A typed query (specifies the return type). Without the as[] will return a DataFrame (Dataset[Row]) Traditional functional reduction: arbitrary scala code :) Robert Couse-Baker
  • 12. And functional style maps: /** * Functional map + Dataset, sums the positive attributes for the pandas */ def funMap(ds: Dataset[RawPanda]): Dataset[Double] = { ds.map{rp => rp.attributes.filter(_ > 0).sum} }
  • 13. And now we can use it for streaming too! ● StructuredStreaming - new to Spark 2.0 ○ Emphasis on new - be cautious when using ● Extends the Dataset & DataFrame APIs to represent continuous tables ● Still very early stages - but lots of really cool optimizations possible now ● We can build a machine learning pipeline with it together :) ○ Well we have to use some hacks - but ssssssh don’t tell TD https://github.com/holdenk/spark-structured-streaming-ml
  • 16. Aggregates: V2.0 API only for now? abstract class UserDefinedAggregateFunction { def initialize(buffer: MutableAggregationBuffer): Unit def update(buffer: MutableAggregationBuffer, input: Row): Unit def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit def evaluate(buffer: Row): Any }
  • 17. Get a streaming dataset // Read a streaming dataframe val schema = new StructType() .add("happiness", "double") .add("coffees", "integer") val streamingDS = spark .readStream .schema(schema) .format(“parquet”) .load(path) Dataset isStreaming = true streaming source
  • 18. Build the recipe for each query val happinessByCoffee = streamingDS .groupBy($"coffees") .agg(avg($"happiness")) Dataset isStreaming = true streaming source Aggregate groupBy = “coffees” expr = avg(“happiness”)
  • 19. Streaming Aggregation data partial_avg avg shuffle Inject stateful operators Batch physical plan Streaming microbatch physical plan data partial_avg restore_state shuffle avg avg save_state sink sink sink t0 t1 t2 kitty.green66
  • 20. Streaming Aggregation - Partial Average (0, -150) (2, 30) (8, 200) (8, 150) (0, -100) (2, 30) (2, 10) (2, 50) 0 -> (1, -150) 2 -> (1, 30) 8 -> (2, 350) 0 -> (1, -100) 2 -> (3, 90) (0, -150) (0, 0) (0, -200) (8, 120) (8, 100) (2, 20) (2, 20) (2, 60) 0 -> (3, -350) 8 -> (1, 120) 2 -> (3, 100) 8 -> (1, 100) data partial_avg restore_state shuffle avg avg save_state 0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450) 0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450) Executor 1 Executor 2 Data source State store Beatrice Murch
  • 21. Streaming Aggregation - Shuffle (0, -150) (2, 30) (8, 200) (8, 150) (0, -100) (2, 30) (2, 10) (2, 50) (0, -150) (0, 0) (0, -200) (8, 120) (8, 100) (2, 20) (2, 20) (2, 60) data partial_avg restore_state shuffle avg avg save_state 0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450) 0 -> (1, -150) 0 -> (3, -350) 0 -> (1, -100) 2 -> (1, 30) 2 -> (3, 100) 2 -> (3, 90) 8 -> (1, 120) 8 -> (1, 100) 8 -> (2, 350) Executor 1 Executor 2 Data source 0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450) State store UnknownNet Photography
  • 22. Streaming Aggregation - Merge Average (0, -150) (2, 30) (8, 200) (8, 150) (0, -100) (2, 30) (2, 10) (2, 50) (0, -150) (0, 0) (0, -200) (8, 120) (8, 100) (2, 20) (2, 20) (2, 60) data partial_avg restore_state shuffle avg avg save_state 0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450) 0 -> (5, -600) 2 -> (7, 220) 8 -> (4, 570) Executor 1 Executor 2 Data source 0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450) State store elminium
  • 23. Streaming Aggregation - Restore State (0, -150) (2, 30) (8, 200) (8, 150) (0, -100) (2, 30) (2, 10) (2, 50) (0, -150) (0, 0) (0, -200) (8, 120) (8, 100) (2, 20) (2, 20) (2, 60) data partial_avg restore_state shuffle avg avg save_state 0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450) 0 -> (5, -600) 0 -> (5, -800) 2 -> (7, 220) 2 -> (3, 150) 8 -> (4, 570) 8 -> (4, 450) Executor 1 Executor 2 Data source 0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450) State store elminium
  • 24. Streaming Aggregation - Merge Average (0, -150) (2, 30) (8, 200) (8, 150) (0, -100) (2, 30) (2, 10) (2, 50) (0, -150) (0, 0) (0, -200) (8, 120) (8, 100) (2, 20) (2, 20) (2, 60) data partial_avg restore_state shuffle avg avg save_state 0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450) 0 -> (10, -1400) 2 -> (10, 370) 8 -> (8, 1020) 0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450) State store Executor 1 Executor 2 Data source elminium
  • 25. Streaming Aggregation - Save State (0, -150) (2, 30) (8, 200) (8, 150) (0, -100) (2, 30) (2, 10) (2, 50) (0, -150) (0, 0) (0, -200) (8, 120) (8, 100) (2, 20) (2, 20) (2, 60) data partial_avg restore_state shuffle avg avg save_state 0 -> (10, -1400) 2 -> (10, 370) 8 -> (8, 1020) 0 -> (10, -1400) 2 -> (10, 370) 8 -> (8, 1020) 0 -> (10, -1400) 2 -> (10, 370) 8 -> (8, 1020) State store Executor 1 Executor 2 Data source elminium
  • 26. How to train a streaming ML model 1. Future: directly use structured streaming to create model streams via stateful aggregators ○ https://spark-summit.org/eu-2016/events/online-learning-with-structured-streaming/ 2. Today: use the sink to collect model updates and store them on the driver
  • 27. Stateful Aggregator data partial_agg final_agg save_state restore_initial merge_agg data1 dataP ... zero saved state data1 dataP ... saved state new aggregate new model Stateful Aggregator Streaming Aggregator Stateful Aggregator Physical Plan Pedro Ribeiro Simões
  • 28. Stateful Aggregator - Restore initial state [w0 0 , …, wd 0 ] [w0 0 , …, wd 0 ] [w0 0 , …, wd 0 ] [w0 0 , …, wd 0 ] (y0 , x0 ) ... (yk-1 , xk-1 ) (yk , xk ) ... (y2k-1 , x2k-1 ) (y2k , x2k ) ... (y3k-1 , x3k-1 ) (y3k , x3k ) ... (y4k-1 , x4k-1 ) Executor 1 Executor 2 Data source data partial_agg save_state restore_initial merge_agg State store [w0 0 , …, wd 0 ]
  • 29. Stateful Aggregator - Partial Aggregation (y0 , x0 ) ... (yk-1 , xk-1 ) (yk , xk ) ... (y2k-1 , x2k-1 ) (y2k , x2k ) ... (y3k-1 , x3k-1 ) (y3k , x3k ) ... (y4k-1 , x4k-1 ) Executor 1 Executor 2 Data source data partial_agg save_state restore_initial merge_agg State store [w0 0 , …, wd 0 ] [w0 0 , …, wd 0 ] [w0 0 , …, wd 0 ] [w0 0 , …, wd 0 ] [w0 0 , …, wd 0 ]
  • 30. Stateful Aggregator - Merge Aggregators data partial_agg save_state restore_initial merge_agg [w0 0 , …, wd 0 ] (y0 , x0 ) ... (yk-1 , xk-1 ) (yk , xk ) ... (y2k-1 , x2k-1 ) (y2k , x2k ) ... (y3k-1 , x3k-1 ) (y3k , x3k ) ... (y4k-1 , x4k-1 ) Model combining scheme, e.g. weighted average Executor 1 Executor 2 Data source State store [w0 0,1 , …, wd 0,1 ] ... [w0 0,P , …, wd 0,P ]
  • 31. Stateful Aggregator - Save State (y0 , x0 ) ... (yk-1 , xk-1 ) (yk , xk ) ... (y2k-1 , x2k-1 ) (y2k , x2k ) ... (y3k-1 , x3k-1 ) (y3k , x3k ) ... (y4k-1 , x4k-1 ) Executor 1 Executor 2 Data source State storedata partial_agg save_state restore_initial merge_agg [w0 1 , …, wd 1 ] [w0 1 , …, wd 1 ]
  • 32. Batch ML pipelines Tokenizer HashingTF String Indexer Naive Bayes Tokenizer HashingTF Streaming String Indexer Streaming Naive Bayes fit(df) Estimator Transformer ● In the batch setting, an estimator is trained on a dataset, and produces a static, immutable transformer. ● There is no communication between the two.
  • 33. Streaming ML pipelines Tokenizer HashingTF String Indexer Naive BayesTokenizer HashingTF Streaming String Indexer Streaming Naive Bayes Model Sink Tokenizer HashingTF Streaming String Indexer Model Sink Data sink
  • 34. Cool - lets build some ML with it! Lauren Coolman
  • 35. Streaming ML Pipelines (Proof of Concept) Tokenizer HashingTF Streaming String Indexer Streaming Naive Bayes Tokenizer HashingTF Streaming String Indexer Streaming Naive Bayes fit(df) Estimator Transformer (mutable) ● In this implementation, the estimator produces an initial transformer, and communicates updates to a specialized StreamingTransformer. ● Streaming transformers must provide a means of incorporating model updates into predictions. state state Lauren Coolman
  • 36. Streaming Estimator/Transformer (POC) trait StreamingModel[S] extends Transformer { def update(updates: S): Unit } trait StreamingEstimator[S] extends Estimator { def model: StreamingModel[S] def update(batch: Dataset[_]): Unit } Sufficient statistics for model updates Blinke nArea
  • 37. Getting a micro-batch view with distributed collection* case class ForeachDatasetSink(func: DataFrame => Unit) extends Sink { override def addBatch(batchId: Long, data: DataFrame): Unit = { func(data) } } https://github.com/holdenk/spark-structured-streaming-ml
  • 38. And doing some ML with it: def evilTrain(df: DataFrame): StreamingQuery = { val sink = new ForeachDatasetSink({df: DataFrame => update(df)}) val sparkSession = df.sparkSession val evilStreamingQueryManager = EvilStreamingQueryManager(sparkSession.streams) evilStreamingQueryManager.startQuery( Some("snb-train"), None, df, sink, OutputMode.Append()) }
  • 39. And doing some ML with it: def update(batch: Dataset[_]): Unit = { val newCountsByClass = add(batch) model.update(newCountsByClass) } Aggregate new batch Merge with previous aggregates
  • 40. And doing some ML with it* (Algorithm specific) def update(updates: Array[(Double, (Long, DenseVector))]): Unit = { updates.foreach { case (label, (numDocs, termCounts)) => countsByClass.get(label) match { case Some((n, c)) => axpy(1.0, termCounts, c) countsByClass(label) = (n + numDocs, c) case None => // new label encountered countsByClass += (label -> (numDocs, termCounts)) } } }
  • 41. Non-Evil alternatives to our Evil: ● ForeachWriter exists ● Since everything runs on the executors it's difficult to update the model ● You could: ○ Use accumulators ○ Write the updates to Kafka ○ Send the updates to a param server of some type with RPC ○ Or do the evil things we did instead :) ● Wait for the “future?”: https://github.com/apache/spark/pull/15178 _torne
  • 42. Working with the results - foreach (1 of 2) val foreachWriter: ForeachWriter[T] = new ForeachWriter[T] { def open(partitionId: Long, version: Long): Boolean = { True // always open } def close(errorOrNull: Throwable): Unit = { // No close logic - if we wanted to copy updates per-batch } def process(record: T): Unit = { db.update(record) } }
  • 43. Working with the results - foreach (2 of 2) // Apply foreach happinessByCoffee.writeStream.outputMode(OutputMode.Complete()) foreach(foreachWriter).start()
  • 44. Structured Streaming in Review: ● Pre-2.3 Structured Streaming still uses Spark’s Microbatch approach ● 2.3 forward: New execution engine! Yes this breaks everything ● One of the areas that Matei is researching ○ Researching ==~ future , research !~ today Windell Oskay
  • 45. Ok but where can we not use it? ● A lot of random methods on DataFrames & Datasets won’t work ● They will fail at runtime rather than compile time - so have tests! ● Anything which roundtrips through an rdd() is going to be pretty sad (aka fail) ○ Lot’s of internals randomly do (like toJson) for historical reasons ● Need to run a query inside of a sink? That is not going to work ● Need a complex receiver type? Many receivers are not ported yet ● Also you will need distinct query names - even if you stop the previous query. ● Aggregations and Append output mode (and the only file sink requires Append) ● DataFrame/Dataset transformations inside of a sink
  • 46. Open questions for ML pipelines ● How to train and predict simultaneously, on the same data? ○ Transform thread should be executed first ○ Do we actually need to support this or is this just a common demo? ● How to ensure robustness to failures? ○ Treat the output of training as a stream of models, with the same robustness guarantees of any structured streaming query ○ Work based on this approach has already been prototyped ● Model training must be idempotent - should not train on the same data twice ○ Leverage batch ID, similar to `FileStreamSink` ● How to extend MLWritable for streaming ○ Spark’s format isn’t really all that useful - maybe PMML or PFA Photo by bullet101
  • 47. Structured Streaming ML vs DStreams ML What could be different for ML on structured streaming vs ML on DStreams? ● Structured streaming is built on the Spark SQL engine ○ Catalyst optimizer ○ Project tungsten ● Pipeline integration ○ ML pipelines have been improved and iterated across 5 releases, we can leverage their mature design for streaming pipelines ○ This will make adding and working with new algorithms much easier than in the past ● Event time handling ○ Streaming ML algorithms typically use a decay factor ○ Structured streaming provides native support for event time, which is more appropriate for decay Krzysztof Belczyński
  • 48. Batch vs Streaming Pipelines (Draft POC API) val df = spark .read .schema(schema) .parquet(path) val tokenizer = new RegexTokenizer() val htf = new HashingTF() val nb = new NaiveBayes() val pipeline = new Pipeline() .setStages(Array(tokenizer, htf, nb)) val pipelineModel = pipeline.fit(df) val df = spark .readStream .schema(schema) .parquet(path) val tokenizer = new RegexTokenizer() val htf = new HashingTF() val snb = new StreamingNaiveBayes() val pipeline = new StreamingPipeline() .setStages(Array(tokenizer, htf, snb)) .setCheckpointLocation(path) val query = pipeline.fitStreaming(df) query.awaitTermination() https://github.com/sethah/spark/tree/structured-streaming-fun
  • 49. Additional Spark Resources ● Programming guide (along with JavaDoc, PyDoc, ScalaDoc, etc.) ○ http://spark.apache.org/docs/latest/ ● Books ● Videos ● Spark Office Hours ○ Normally in the bay area - will do Google Hangouts ones soon ○ follow me on twitter for future ones - https://twitter.com/holdenkarau
  • 50. Structured Streaming Resources ● Programming guide (along with JavaDoc, PyDoc, ScalaDoc, etc.) ○ http://spark.apache.org/docs/latest/structured-streaming-programming -guide.html ● https://github.com/holdenk/spark-structured-streaming-ml ● TD https://spark-summit.org/2016/events/a-deep-dive-into-structured-st reaming/
  • 51. Surveys!!!!!!!! :D ● Interested in Structured Streaming? ○ http://bit.ly/structuredStreamingML - Let us know your thoughts ● Pssst: Care about Python DataFrame UDF Performance? ○ http://bit.ly/pySparkUDF ● Care about Spark Testing? ○ http://bit.ly/holdenTestingSpark ● Want to give me feedback on this talk? ○ http://bit.ly/holdenTalkFeedback Michael Himbeault
  • 52. And some upcoming talks: ● Jan ○ If interest tomorrow: Office Hours? Tweet me @holdenkarau ○ LinuxConf AU - tomorrow! ○ Data Day Texas - kind of far from Sydney but…. ● Feb ○ FOSDEM - One on testing one on scaling ○ JFokus in Stockholm - Adding deep learning to Spark ○ I disappear for a week and pretend computers work ● March ○ Strata San Jose - Big Data Beyond the JVM
  • 53. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action High Performance SparkLearning PySpark
  • 54. High Performance Spark! The gift of whichever holiday season is next! Cats love it!** You can buy it from that scrappy Seattle bookstore, Jeff Bezos needs another newspaper and I want a cup of coffee. http://bit.ly/hkHighPerfSpark
  • 55. Cat wave photo by Quinn Dombrowski k thnx bye! If you <3 testing & want to fill out survey: http://bit.ly/holdenTestingSpark Want to tell me (and or my boss) how I’m doing? http://bit.ly/holdenTalkFeedback Want to e-mail me? Promise not to be creepy? Ok: holden@pigscanfly.ca
  • 56. k thnx bye! If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau Any PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?: http://bit.ly/pySparkUDF Pssst: Have feedback on the presentation? Give me a shout (holden@pigscanfly.ca) if you feel comfortable doing so :)
  • 58. Start a continuous query val query = happinessByCoffee .writeStream .format(“parquet”) .outputMode(“complete”) .trigger(ProcessingTime(5.seconds)) .start() StreamingQuery source relation groupBy avglogicalPlan =
  • 59. Launch a new thread to listen for new data Source Available Offsets: Sink Committed Offsets: MicroBatch Thread StreamingQuery source relation groupBy avglogicalPlan = Listening Neil Falzon
  • 60. Write new offsets to WAL Source Available Offsets: 0, 1, 2 Sink Committed Offsets: MicroBatch Thread StreamingQuery source relation groupBy avglogicalPlan = Commit to WAL April Weeks
  • 61. Check the source for new offsets Source Available Offsets: 0, 1, 2 Sink Committed Offsets MicroBatch Thread StreamingQuery source relation groupBy avglogicalPlan = batchId= 42 getBatch() cat-observer
  • 62. Get the “recipe” for this micro batch Source Available Offsets: 0, 1, 2 Sink Committed Offsets MicroBatch Thread StreamingQuery source relation groupBy avglogicalPlan = source scan groupBy avg batchId= 42 transform Jackie
  • 63. Send the micro batch Dataset to the sink Source Available Offsets: 0, 1, 2 Sink Committed Offsets MicroBatch Thread StreamingQuery source scan groupBy avglogicalPlan = groupBy avg MicroBatch Dataset isStreaming = false addBatch() batchId= 42 Backed by an incremental execution plan Jason Rojas
  • 64. Commit and listen again Source Available Offsets: 0, 1, 2 Sink Committed Offsets: 0, 1, 2 MicroBatch Thread StreamingQuery source scan groupBy avglogicalPlan = Listening S Orchard
  • 65. Execution Summary ● Each query has its own thread - asynchronous ● Sources must be replayable ● Use write-ahead-logs for durability ● Sinks must be idempotent ● Each batch is executed with an incremental execution plan ● Sinks get a micro batch view of the data snaxor
  • 66. Cool - lets build some ML with it! Lauren Coolman
  • 67. Get a dataframe val schema = new StructType() .add("happiness", "double") .add("coffees", "integer") val batchDS = spark .read .schema(schema) .format(“parquet”) .load(path) Dataset data source isStreaming = false
  • 68. Build the recipe for each query val happinessByCoffee = batchDS .groupBy($"coffees") .agg(avg($"happiness")) Dataset isStreaming = false data source Aggregate groupBy = “coffees” expr = avg(“happiness”)
  • 69. Batch Aggregation - Partial Average (0, -150) (2, 30) (8, 200) (8, 150) (0, -100) (2, 30) (2, 10) (2, 50) 0 -> (1, -150) 2 -> (1, 30) 8 -> (2, 350) 0 -> (1, -150) 2 -> (3, 90) (0, -150) (0, 0) (0, -200) (8, 120) (8, 100) (2, 20) (2, 20) (2, 60) 0 -> (3, -350) 8 -> (1, 120) 2 -> (3, 100) 8 -> (1, 100) data partial_avg avg shuffle Executor 1 Executor 2 Data source
  • 70. Batch Aggregation - Shuffle data partial_avg avg (0, -150) (2, 30) (8, 200) (8, 150) (0, -100) (2, 30) (2, 10) (2, 50) 0 -> (1, -150) 0 -> (3, -350) 0 -> (1, -150) 2 -> (1, 30) 2 -> (3, 100) 2 -> (3, 90) (0, -150) (0, 0) (0, -200) (8, 120) (8, 100) (2, 20) (2, 20) (2, 60) 8 -> (1, 120) 8 -> (1, 100) 8 -> (2, 350)shuffle Executor 1 Executor 2 Data source
  • 71. Batch Aggregation - Final Average data partial_avg avg (0, -150) (2, 30) (8, 200) (8, 150) (0, -100) (2, 30) (2, 10) (2, 50) 0 -> (5, -650) 2 -> (7, 220) Executor 1 (0, -150) (0, 0) (0, -200) (8, 120) (8, 100) (2, 20) (2, 20) (2, 60) 8 -> (4, 570) shuffle Executor 2 Data source