Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Beyond shuffling - Strata London 2016


Published on

Holden Karau walks attendees through a number of common mistakes that can keep your Spark programs from scaling and examines solutions and general techniques useful for moving beyond a proof of concept to production.

Topics include:

Working with key/value data
Replacing groupByKey for awesomeness
Key skew: your data probably has it and how to survive
Effective caching and checkpointing
Considerations for noisy clusters
Functional transformations with Spark Datasets: getting the benefits of Catalyst with the ease of functional development
How to make our code testable

Published in: Data & Analytics
  • Login to see the comments

Beyond shuffling - Strata London 2016

  1. 1. Beyond Shuffling tips & tricks for scaling Apache Spark Reviside May 2016
  2. 2. Who am I? My name is Holden Karau Prefered pronouns are she/her I’m a Principal Software Engineer at IBM’s Spark Technology Center previously Alpine, Databricks, Google, Foursquare & Amazon co-author of Learning Spark & Fast Data processing with Spark co-author of a new book focused on Spark performance coming out next year* @holdenkarau Slide share
  3. 3. What is going to be covered: What I think I might know about you RDD re-use (caching, persistence levels, and checkpointing) Working with key/value data Why group key is evil and what we can do about it When Spark SQL can be amazing and wonderful A brief introduction to Datasets (new in Spark 1.6) Iterator-to-Iterator transformations (or yet another way to go OOM in the night) How to test your Spark code :) Torsten Reuschling
  4. 4. Or…. Huang Yun Chung
  5. 5. The different pieces of Spark Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages Jon Ross
  6. 6. Who I think you wonderful humans are? Nice* people Don’t mind pictures of cats Know some Apache Spark Want to scale your Apache Spark jobs Don’t overly mind a grab-bag of topics Lori Erickson
  7. 7. Cat photo from Photo from Cocoa Dream
  8. 8. Lets look at some old stand bys: val words = rdd.flatMap(_.split(" ")) val wordPairs =, 1)) val grouped = wordPairs.groupByKey() grouped.mapValues(_.sum) val warnings = rdd.filter(_.toLower.contains("error")).count() Tomomi
  9. 9. RDD re-use - sadly not magic If we know we are going to re-use the RDD what should we do? If it fits nicely in memory caching in memory persisting at another level MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER checkpointing Noisey clusters _2 & checkpointing can help persist first for checkpointing Richard Gillin
  10. 10. Considerations for Key/Value Data What does the distribution of keys look like? What type of aggregations do we need to do? Do we want our data in any particular order? Are we joining with another RDD? Whats our partitioner? If we don’t have an explicit one: what is the partition structure? eleda 1
  11. 11. What is key skew and why do we care? Keys aren’t evenly distributed Sales by zip code, or records by city, etc. groupByKey will explode (but it's pretty easy to break) We can have really unbalanced partitions If we have enough key skew sortByKey could even fail Stragglers (uneven sharding can make some tasks take much longer) Mitchell Joyce
  12. 12. groupByKey - just how evil is it? Pretty evil Groups all of the records with the same key into a single record Even if we immediately reduce it (e.g. sum it or similar) This can be too big to fit in memory, then our job fails Unless we are in SQL then happy pandas PROgeckoam
  13. 13. So what does that look like? (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R) (94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)] Tomomi
  14. 14. Let’s revisit wordcount with groupByKey val words = rdd.flatMap(_.split(" ")) val wordPairs =, 1)) val grouped = wordPairs.groupByKey() grouped.mapValues(_.sum) Tomomi
  15. 15. And now back to the “normal” version val words = rdd.flatMap(_.split(" ")) val wordPairs =, 1)) val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts
  16. 16. Let’s see what it looks like when we run the two Quick pastebin of the code for the two: val rdd = sc.textFile("python/pyspark/*.py", 20) // Make sure we have many partitions // Evil group by key version val words = rdd.flatMap(_.split(" ")) val wordPairs =, 1)) val grouped = wordPairs.groupByKey() val evilWordCounts = grouped.mapValues(_.sum) evilWordCounts.take(5) // Less evil version val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts.take(5)
  17. 17. GroupByKey
  18. 18. reduceByKey
  19. 19. So what did we do instead? reduceByKey Works when the types are the same (e.g. in our summing version) aggregateByKey Doesn’t require the types to be the same (e.g. computing stats model or similar) Allows Spark to pipeline the reduction & skip making the list We also got a map-side reduction (note the difference in shuffled read)
  20. 20. So why did we read in python/*.py If we just read in the standard file there aren’t enough duplicated keys for the reduceByKey & groupByKey difference to be really apparent Which is why groupByKey can be safe sometimes
  21. 21. Can just the shuffle cause problems? Sorting by key can put all of the records in the same partition We can run into partition size limits (around 2GB) Or just get bad performance So we can handle data like the above we can add some “junk” to our key (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) PROTodd Klassy
  22. 22. Shuffle explosions :( (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (94110, A, B) (94110, A, C) (94110, E, F) (94110, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R) (10003, D, E) javier_artiles
  23. 23. 100% less explosions (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (94110_A, A, B) (94110_A, A, C) (94110_A, A, R) (94110_D, D, R) (94110_T, T, R) (10003_A, A, R) (10003_D, D, E) (67843_T, T, R) (94110_E, E, R) (94110_E, E, R) (94110_E, E, F) (94110_T, T, R)
  24. 24. Well there is a bit of magic in the shuffle…. We can reuse shuffle files But it can (and does) explode* Sculpture by Flaming Lotus Girls Photo by Zaskoda
  25. 25. Where can Spark SQL benefit perf? Structured or semi-structured data OK with having less* complex operations available to us We may only need to operate on a subset of the data The fastest data to process isn’t even read Remember that non-magic cat? Its got some magic** now In part from peeking inside of boxes non-JVM (aka Python & R) users: saved from double serialization cost! :) **Magic may cause stack overflow. Not valid in all states. Consult local magic bureau before attempting magic Matti Mattila
  26. 26. Why is Spark SQL good for those things? Space efficient columnar cached representation Able to push down operations to the data store Optimizer is able to look inside of our operations Regular spark can’t see inside our operations to spot the difference between (min(_, _)) and (append(_, _)) Matti Mattila
  27. 27. How much faster can it be? Andrew Skudder
  28. 28. But where will it explode? Iterative algorithms - large plans Some push downs are sad pandas :( Default shuffle size is too small for big data (200 partitions WTF?) Default partition size when reading in is also WTF
  29. 29. How to avoid lineage explosions: /** * Cut the lineage of a DataFrame which has too long a query plan. */ def cutLineage(df: DataFrame): DataFrame = { val sqlCtx = df.sqlContext //tag::cutLineage[] val rdd = df.rdd rdd.cache() sqlCtx.createDataFrame(rdd, df.schema) //end::cutLineage[] } karmablue
  30. 30. Introducing Datasets New in Spark 1.6 - coming to more languages in 2.0 Provide templated compile time strongly typed version of DataFrames Make it easier to intermix functional & relational code Do you hate writing UDFS? So do I! Still an experimental component (API will change in future versions) Although the next major version seems likely to be 2.0 anyways so lots of things may change regardless Houser Wolf
  31. 31. Using Datasets to mix functional & relational style: val ds: Dataset[RawPanda] = ... val happiness = ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y)
  32. 32. So what was that? ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y) convert a Dataset to a DataFrame to access more DataFrame functions. Shouldn’t be needed in 2.0 + & almost free anyways Convert DataFrame back to a Dataset A typed query (specifies the return type).Traditional functional reduction: arbitrary scala code :)
  33. 33. And functional style maps: /** * Functional map + Dataset, sums the positive attributes for the pandas */ def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {{rp => rp.attributes.filter(_ > 0).sum} }
  34. 34. Photo by Christian Heilmann
  35. 35. Iterator to Iterator transformations Iterator to Iterator transformations are super useful They allow Spark to spill to disk if reading an entire partition is too much Not to mention better pipelining when we put multiple transformations together Most of the default transformations are already set up for this map, filter, flatMap, etc. But when we start working directly with the iterators Sometimes to save setup time on expensive objects e.g. mapPartitions, mapPartitionsWithIndex etc. Christian Heilmann
  36. 36. Making our code testable Before we can refactor our code we need to have good tests Testing individual functions is easy if we factor them out - less lambdas :( Testing our Spark transformations is a bit trickier, but there are a variety of tools Property checking can save us from having to come up lots of test cases
  37. 37. A simple unit test with spark-testing-base class SampleRDDTest extends FunSuite with SharedSparkContext { test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected) } }
  38. 38. Ok but what about problems @ scale Maybe our program works fine on our local sized input If we are using Spark our actual workload is probably huge How do we test workloads too large for a single machine? we can’t just use parallelize and collect Qfamily
  39. 39. Distributed “set” operations to the rescue* Pretty close - already built into Spark Doesn’t do so well with floating points :( damn floating points keep showing up everywhere :p Doesn’t really handle duplicates very well {“coffee”, “coffee”, “panda”} != {“panda”, “coffee”} but with set operations... Matti Mattila
  40. 40. Or use RDDComparisions: def compareWithOrderSamePartitioner[T: ClassTag](expected: RDD[T], result: RDD[T]): Option[(T, T)] = {{case (x, y) => x != y}.take(1).headOption } Matti Mattila
  41. 41. Or use RDDComparisions: def compare[T: ClassTag](expected: RDD[T], result: RDD[T]): Option[(T, Int, Int)] = { val expectedKeyed = => (x, 1)).reduceByKey(_ + _) val resultKeyed = => (x, 1)).reduceByKey(_ + _) expectedKeyed.cogroup(resultKeyed).filter{case (_, (i1, i2)) => i1.isEmpty || i2.isEmpty || i1.head != i2.head}.take(1).headOption. map{case (v, (i1, i2)) => (v, i1.headOption.getOrElse(0), i2.headOption.getOrElse(0))} } Matti Mattila
  42. 42. But where do we get the data for those tests? If you have production data you can sample you are lucky! If possible you can try and save in the same format If our data is a bunch of Vectors or Doubles Spark’s got tools :) Coming up with good test data can take a long time Lori Rielly
  43. 43. QuickCheck / ScalaCheck QuickCheck generates tests data under a set of constraints Scala version is ScalaCheck - supported by the two unit testing libraries for Spark sscheck Awesome people*, supports generating DStreams too! spark-testing-base Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs PROtara hunt
  44. 44. With spark-testing-base test("map should not change number of elements") { val property = forAll(RDDGenerator.genRDD[String](sc) (Arbitrary.arbitrary[String])) { rdd => == rdd.count() } check(property) }
  45. 45. With spark-testing-base & a million entries test("map should not change number of elements") { implicit val generatorDrivenConfig = PropertyCheckConfig(minSize = 0, maxSize = 1000000) val property = forAll(RDDGenerator.genRDD[String](sc) (Arbitrary.arbitrary[String])) { rdd => == rdd.count() } check(property) }
  46. 46. Additional Spark Testing Resources Libraries Scala: spark-testing-base (scalacheck & unit) sscheck (scalacheck) example-spark (unit) Java: spark-testing-base (unit) Python: spark-testing-base (unittest2), pyspark.test (pytest) Strata San Jose Talk (up on YouTube) Blog posts Unit Testing Spark with Java by Jesse Anderson Making Apache Spark Testing Easy with Spark Testing Base Unit testing Apache Spark with py.test raider of gin
  47. 47. Additional Spark Resources Programming guide (along with JavaDoc, PyDoc, ScalaDoc, etc.) Kay Ousterhout’s work Books Videos Spark Office Hours Normally in the bay area - will do Google Hangouts ones soon follow me on twitter for future ones - raider of gin
  48. 48. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark
  49. 49. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action Coming soon: High Performance Spark
  50. 50. And the next book….. First four chapters are available in “Early Release”*: Buy from O’Reilly - Book signing Friday June 3rd 10:45** @ O’Reilly booth Get notified when updated & finished: * Early Release means extra mistakes, but also a chance to help us make a more awesome book. **Normally there are about 25 printed copies first come first serve and when we run out its just me, a stuffed animal, and a bud light lime (or local equivalent)
  51. 51. Spark Videos Apache Spark Youtube Channel My Spark videos on YouTube - Spark Summit 2014 training Paco’s Introduction to Apache Spark
  52. 52. k thnx bye! If you care about Spark testing and don’t hate surveys: Will tweet results “eventually” @holdenkarau PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?: