The magic of (data parallel) distributed systems and where it all breaks - Reversim

The Magic of (Data Parallel)
Distributed Systems
& where it breaks
Holden Karau
@holdenkarau

Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback

Some links (slides & recordings will be at):
http://bit.ly/2E5qFsC
CatLoversShow

Who do I think you all are?
● Nice people*
● Possibly some knowledge of distributed systems
● Curious how systems fail
Amanda

What are some data parallel systems?
● Apache Spark
○ What if we functional programming was distributed?
● Kafka Streams
○ What if all your data ever lived in Kafka?
● Apache Flink
○ What if Spark focused on streaming
● Apache Beam
○ What if one more standard was the answer?
● Storm
○ What if we described everything like plumbing?

What is Spark?
● General purpose distributed system
○ With a really nice API including Python :)
● Apache project (one of the most
active)
● Faster than Hadoop Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets

What is the “magic” of these systems? Richard Gillin
Ethan Prater

What is the “magic” of these systems?
● A deal where in we promise to only do certain things
and the distributed system promises to make it look like
we’re on one computer
● The deal quickly breaks though, so don’t feel bad about
not holding up your end either.
Richard Gillin

Magic part #1: Pipelining
● Most of our work is done by transformations
○ Things like map
● Transformations return new objects representing this
data
● These object’s (often) don’t really exist
● Spark & Flink use lazy evaluation
● Beam/Dataflow use something closer to whole pipeline
evaluation
● tl;dr - the data doesn’t exist until it “has” to
Photo by Dan G

The DAG The query plan Susanne Nilsson

Word count
val lines = sc.textFile("boop")
val words = lines.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts.saveAsTextFile("snoop")
Photo By: Will
Keightley

Word count in Spark
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts.saveAsTextFile("snoop") No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
daniilr
Transformations
actions

Word count in Kafka Streams
KStream<String, String> textLines =
builder.stream("streams-plaintext-input",
Consumed.with(stringSerde, stringSerde));
KTable<String, Long> wordCounts = textLines
.flatMapValues(value ->
Arrays.asList(value.toLowerCase().split("W+")))
.groupBy((key, value) -> value)
.count();
wordCounts.toStream().to("streams-wordcount-output",
Produced.with(stringSerde, longSerde));
daniilr

How the magic is awesome:
● Pipelining (can put maps, filter, flatMap together)
● Can do interesting optimizations by delaying work
● We use the DAG to recompute on failure
○ (writing data out to 3 disks on different machines is so last season)
○ Or the DAG puts the R is Resilient RDD, except DAG doesn’t have an
R :(
Matthew Hurst

And where it reaches its limits:
● It doesn’t have a whole program view (no seeing the
future)
○ Can only see up to the action, can’t see into the next one
○ So we have to help Spark out and cache
● Combining the transformations together makes it hard
to know what failed
● It can only see the pieces it understands
○ can see two maps but can’t tell what each map is doing
r2hox

What about Beam (& friends)?
● Closer to a whole program view
● Makes implementing iterative algorithms like gradient
descent a little more complicated
r2hox

Magically distributed Data
● At some point we have to do actual work :(
● The data is split up on a bunch of different machines
● If the data needs to be joined (or similar) Spark does a
“shuffle” so it knows which keys are where
● Partitioners in Spark are deterministic on key input
○ And Kafka, and Flink and ….
● Not all data is partitioned explicitly partitioner
○ (so same key can be in different partitions in some RDDs or different
Kafka partitions under certain circumstances)
ncfc0721

When we say distributed we mean...

Key-skew to the anti-rescue… :(
● Keys aren’t evenly distributed
○ Sales by zip code, or records by city, etc.
● Workers can become starved while others are overloaded
○ Think topics keyed by customer location as people
wake up over time
● Naive grouping operations can fail
○ Like Spark & Flink’s groupByKey
● Liquid sharding can save some of this….
cinnamonster

An example about mustaches
● Sorting by key can put all of the records in the same partition
● Or just get bad performance
● So we can handle data like the above we can add some “junk” to our key
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
PROTodd
Klassy

Shuffle explosions :(
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(94110, A, B)
(94110, A, C)
(94110, E, F)
(94110, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(10003, D, E)
javier_artiles

Happy Shuffle (100% less explosions)
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, U, R)
(94110, T, R)
(94110_A, A, B)
(94110_A, A, C)
(94110_A, A, R)
(94110_D, D, R)
(94110_U, U, R)
(10003_A, A, R)
(10003_D, D, E)
(67843_T, T, R)
(94110_E, E, R)
(94110_E, E, R)
(94110_E, E, F)
(94110_T, T, R)
Jennifer Williams

key-skew + black boxes == more sadness
● There is a worse way to do WordCount
● We can use the seemingly safe thing called groupByKey
● Then compute the sum
● But since it’s on a slide of “more sadness” we know where
this is going...
_torne

Bad word count :(
val grouped = wordPairs.groupByKey()
val wordCounts = grouped.mapValues(_.sum)
wordCounts.saveAsTextFile("snoop")
Tomomi

So what did we do instead?
● reduceByKey
○ Works when the types are the same (e.g. in our summing version)
● aggregateByKey
○ Doesn’t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)
tl;dr - the abstraction layer fails and we need to change our code

Mini “fix”: SQL
● Understandable to a higher degree by the systems
● Implemented across systems
○ Sometimes shared. Sometimes unique. Often hybrid (ala Spark
SQL/Hive and Beam/Calcite)
○ Easier for distributed systems to parse & understand
■ See KSQL + schema integrations
● Buuuuut (imho) way less fun

What can the optimizer do now?
● Understand which data is needed & which isn’t
○ More generally push down predicates to the data sources is now
possible without inspecting JVM bytecode. Doesn’t mean everyone
does it.
● Understand the aggregate (“partial aggregates”)
○ Could sort of do this before but not as awesomely, and only if we used
reduceByKey - not groupByKey
● Have a serializer that understand the data
○ Most of these systems allow arbitrary data types, but restrict the data
types in SQL based land.
Capes Treasures

Or the hybrid approach:
● In Spark done with DataFrames/Datasets
● In Kafka we can have KTransform + KSQL
● In Beam Calcite is used in the middle anyways
● etc.
● Generally still allow arbitrary lambdas
● But give you more options to “help” the optimizer
● Less standard

Using Datasets to mix functional & relational style
val ds: Dataset[RawPanda] = ...
val happiness = ds.filter($"happy" === true).
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)

How does [R, Py]{BigData} work?
● Yet another level of indirection!
○ In Spark Py4J + pickling + magic, in Beam GRPC, in Flink mmapped
files
● RDDs are generally RDDs of pickled objects
● Spark SQL (and DataFrames) avoid some of this
kristin klein

A quick detour into PySpark’s internals
+ + JSON

More seriously:
Driver
py4j
Worker 1
Worker K
pipe
pipe

What do the gnomes look like?
self.is_cached = True
javaStorageLevel =
self.ctx._getJavaStorageLevel(storageLevel)
self._jrdd.persist(javaStorageLevel)
return self

So how does this break?
● Data from Spark worker serialized and piped to Python
worker
○ Multiple iterator-to-iterator transformations are still pipelined :)
● Double serialization cost makes everything more
expensive
● Python worker startup takes a bit of extra time
● Python memory isn’t controlled by the JVM - easy to go
over container limits if deploying on YARN or similar
● etc.

Spark Videos
● Apache Spark Youtube Channel
● My Spark videos on YouTube -
○ http://bit.ly/holdenSparkVideos
● Spark Summit 2014 training
● Paco’s Introduction to Apache Spark
Paul Anderson

In summary
● Data Parallel distributed systems involve trade-offs
○ Not like good ones either
● It’s not your fault none of this works
● Be nice to the data engineers, they have seen Cthulhu
● Even though it’s a pain you really need to test & validate
these things

Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark

High Performance Spark
Available today! Please buy many copies!
● http://bit.ly/hkHighPerfSpark (Amazon)

Sign up for the mailing list @
http://www.distributedcomputing4kids.com

Office Hours?
● Is there interest in Spark office Hours tomorrow?
○ If so tweet @ me @holdenkarau
○ Otherwise I’m going to hang on a beach w/coffee and “work”

And some upcoming talks:
● Twilio Conference (SF) - October
● PyCon CA (Toronto) - November
● Big Data Spain - November
● Scala eXchange (London, UK) - December

(Time Permitted) Optional Demo Of Failures!
● One of the few demos today where success if failure
● We can totally skip this and go into Q&A
● Pick your favourite failure:
○ Serialization
○ OOM
○ Key-Skew
○ WiFi :p
● Or if we’re short on time come talk to me after

k thnx bye!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)

The magic of (data parallel) distributed systems and where it all breaks - Reversim

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The magic of (data parallel) distributed systems and where it all breaks - Reversim

Similar to The magic of (data parallel) distributed systems and where it all breaks - Reversim (20)

Recently uploaded

Recently uploaded (20)

The magic of (data parallel) distributed systems and where it all breaks - Reversim