Distributed systems can seem magical, and sometimes all of the magic works and our job succeeds. However, if you've worked with them for a long enough time you've found a few places where the magic starts to break down and the fact that it's actually a collection of several hundred garden gnomes* rather than a single large garden gnome.
This talk will use Apache Spark, Beam, Flink, Kafka, and Map Reduce to explore the world of data parallel distributed systems. We'll start with some happy pieces of magic, like how we can combine different transformations into a single pass over the data, working between different languages, data partitioning, and lambda serialization. After each new piece of magic is introduced we'll look at how it breaks in one (or two) of the systems.
Come to be told it's not your fault everything is broken, or if your distributed software still works an exciting preview of everything that's going to go wrong. Don't work with distributed systems? Come to be reassured you've made good life choices.
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
The magic of (data parallel) distributed systems and where it all breaks - Reversim
1. The Magic of (Data Parallel)
Distributed Systems
& where it breaks
Holden Karau
@holdenkarau
2. Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback
3.
4. Some links (slides & recordings will be at):
http://bit.ly/2E5qFsC
CatLoversShow
5. Who do I think you all are?
● Nice people*
● Possibly some knowledge of distributed systems
● Curious how systems fail
Amanda
6. What are some data parallel systems?
● Apache Spark
○ What if we functional programming was distributed?
● Kafka Streams
○ What if all your data ever lived in Kafka?
● Apache Flink
○ What if Spark focused on streaming
● Apache Beam
○ What if one more standard was the answer?
● Storm
○ What if we described everything like plumbing?
7. What is Spark?
● General purpose distributed system
○ With a really nice API including Python :)
● Apache project (one of the most
active)
● Faster than Hadoop Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
8. What is the “magic” of these systems? Richard Gillin
Ethan Prater
9. What is the “magic” of these systems?
● A deal where in we promise to only do certain things
and the distributed system promises to make it look like
we’re on one computer
● The deal quickly breaks though, so don’t feel bad about
not holding up your end either.
Richard Gillin
10. Magic part #1: Pipelining
● Most of our work is done by transformations
○ Things like map
● Transformations return new objects representing this
data
● These object’s (often) don’t really exist
● Spark & Flink use lazy evaluation
● Beam/Dataflow use something closer to whole pipeline
evaluation
● tl;dr - the data doesn’t exist until it “has” to
Photo by Dan G
12. Word count
val lines = sc.textFile("boop")
val words = lines.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts.saveAsTextFile("snoop")
Photo By: Will
Keightley
13. Word count in Spark
val lines = sc.textFile("boop")
val words = lines.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts.saveAsTextFile("snoop") No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
daniilr
Transformations
actions
15. How the magic is awesome:
● Pipelining (can put maps, filter, flatMap together)
● Can do interesting optimizations by delaying work
● We use the DAG to recompute on failure
○ (writing data out to 3 disks on different machines is so last season)
○ Or the DAG puts the R is Resilient RDD, except DAG doesn’t have an
R :(
Matthew Hurst
16. And where it reaches its limits:
● It doesn’t have a whole program view (no seeing the
future)
○ Can only see up to the action, can’t see into the next one
○ So we have to help Spark out and cache
● Combining the transformations together makes it hard
to know what failed
● It can only see the pieces it understands
○ can see two maps but can’t tell what each map is doing
r2hox
17. What about Beam (& friends)?
● Closer to a whole program view
● Makes implementing iterative algorithms like gradient
descent a little more complicated
r2hox
18. Magically distributed Data
● At some point we have to do actual work :(
● The data is split up on a bunch of different machines
● If the data needs to be joined (or similar) Spark does a
“shuffle” so it knows which keys are where
● Partitioners in Spark are deterministic on key input
○ And Kafka, and Flink and ….
● Not all data is partitioned explicitly partitioner
○ (so same key can be in different partitions in some RDDs or different
Kafka partitions under certain circumstances)
ncfc0721
20. Key-skew to the anti-rescue… :(
● Keys aren’t evenly distributed
○ Sales by zip code, or records by city, etc.
● Workers can become starved while others are overloaded
○ Think topics keyed by customer location as people
wake up over time
● Naive grouping operations can fail
○ Like Spark & Flink’s groupByKey
● Liquid sharding can save some of this….
cinnamonster
21. An example about mustaches
● Sorting by key can put all of the records in the same partition
● Or just get bad performance
● So we can handle data like the above we can add some “junk” to our key
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
PROTodd
Klassy
22. Shuffle explosions :(
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(94110, A, B)
(94110, A, C)
(94110, E, F)
(94110, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(10003, D, E)
javier_artiles
23. Happy Shuffle (100% less explosions)
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, U, R)
(94110, T, R)
(94110_A, A, B)
(94110_A, A, C)
(94110_A, A, R)
(94110_D, D, R)
(94110_U, U, R)
(10003_A, A, R)
(10003_D, D, E)
(67843_T, T, R)
(94110_E, E, R)
(94110_E, E, R)
(94110_E, E, F)
(94110_T, T, R)
Jennifer Williams
24. key-skew + black boxes == more sadness
● There is a worse way to do WordCount
● We can use the seemingly safe thing called groupByKey
● Then compute the sum
● But since it’s on a slide of “more sadness” we know where
this is going...
_torne
25. Bad word count :(
val lines = sc.textFile("boop")
val words = lines.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val grouped = wordPairs.groupByKey()
val wordCounts = grouped.mapValues(_.sum)
wordCounts.saveAsTextFile("snoop")
Tomomi
27. So what did we do instead?
● reduceByKey
○ Works when the types are the same (e.g. in our summing version)
● aggregateByKey
○ Doesn’t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)
tl;dr - the abstraction layer fails and we need to change our code
29. Mini “fix”: SQL
● Understandable to a higher degree by the systems
● Implemented across systems
○ Sometimes shared. Sometimes unique. Often hybrid (ala Spark
SQL/Hive and Beam/Calcite)
○ Easier for distributed systems to parse & understand
■ See KSQL + schema integrations
● Buuuuut (imho) way less fun
30. What can the optimizer do now?
● Understand which data is needed & which isn’t
○ More generally push down predicates to the data sources is now
possible without inspecting JVM bytecode. Doesn’t mean everyone
does it.
● Understand the aggregate (“partial aggregates”)
○ Could sort of do this before but not as awesomely, and only if we used
reduceByKey - not groupByKey
● Have a serializer that understand the data
○ Most of these systems allow arbitrary data types, but restrict the data
types in SQL based land.
Capes Treasures
31. Or the hybrid approach:
● In Spark done with DataFrames/Datasets
● In Kafka we can have KTransform + KSQL
● In Beam Calcite is used in the middle anyways
● etc.
● Generally still allow arbitrary lambdas
● But give you more options to “help” the optimizer
● Less standard
32. Using Datasets to mix functional & relational style
val ds: Dataset[RawPanda] = ...
val happiness = ds.filter($"happy" === true).
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
33. How does [R, Py]{BigData} work?
● Yet another level of indirection!
○ In Spark Py4J + pickling + magic, in Beam GRPC, in Flink mmapped
files
● RDDs are generally RDDs of pickled objects
● Spark SQL (and DataFrames) avoid some of this
kristin klein
36. What do the gnomes look like?
self.is_cached = True
javaStorageLevel =
self.ctx._getJavaStorageLevel(storageLevel)
self._jrdd.persist(javaStorageLevel)
return self
37. So how does this break?
● Data from Spark worker serialized and piped to Python
worker
○ Multiple iterator-to-iterator transformations are still pipelined :)
● Double serialization cost makes everything more
expensive
● Python worker startup takes a bit of extra time
● Python memory isn’t controlled by the JVM - easy to go
over container limits if deploying on YARN or similar
● etc.
38. Spark Videos
● Apache Spark Youtube Channel
● My Spark videos on YouTube -
○ http://bit.ly/holdenSparkVideos
● Spark Summit 2014 training
● Paco’s Introduction to Apache Spark
Paul Anderson
39. In summary
● Data Parallel distributed systems involve trade-offs
○ Not like good ones either
● It’s not your fault none of this works
● Be nice to the data engineers, they have seen Cthulhu
● Even though it’s a pain you really need to test & validate
these things
40. Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
42. Sign up for the mailing list @
http://www.distributedcomputing4kids.com
43. Office Hours?
● Is there interest in Spark office Hours tomorrow?
○ If so tweet @ me @holdenkarau
○ Otherwise I’m going to hang on a beach w/coffee and “work”
44. And some upcoming talks:
● Twilio Conference (SF) - October
● PyCon CA (Toronto) - November
● Big Data Spain - November
● Scala eXchange (London, UK) - December
45. (Time Permitted) Optional Demo Of Failures!
● One of the few demos today where success if failure
● We can totally skip this and go into Q&A
● Pick your favourite failure:
○ Serialization
○ OOM
○ Key-Skew
○ WiFi :p
● Or if we’re short on time come talk to me after
46. k thnx bye!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)