Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Summit EU talk by Nimbus Goehausen

708 views

Published on

Automatic Checkpointing in Spark

Published in: Data & Analytics
  • Be the first to comment

Spark Summit EU talk by Nimbus Goehausen

  1. 1. Automatic Checkpointing in Spark Nimbus Goehausen Spark Platform Engineer ngoehausen@bloomberg.net Copyright 2016 Bloomberg L.P. All rights reserved.
  2. 2. A Data Pipeline source outputProcess output = Process(source)
  3. 3. Spark development cycle Make code change Deploy jar (1m) Start spark submit / spark shell / notebook (1m) Wait for execution (1m – several hours) Observe results
  4. 4. This should work, right? source outputProcess output = Process(source)
  5. 5. What have I done to deserve this? source outputProcess java.lang.HotGarbageException: You messed up at some.obscure.library.Util$$anonfun$morefun$1.apply(Util.scala:eleventyBillion) at another.library.InputStream $$somuchfun$mcV$sp.read(InputStream.scala:42) at org.apache.spark.serializer.PotatoSerializer.readPotato(PotatoSerializer.scala:2792) …
  6. 6. Split it up source outputStage2Stage1 Stage1 Output Main1 Main2 stage1Out = stage1(source) stage1Out.save(“stage1Path”) stage1Out = load(“stage1Path”) output = stage2(stage1Out)
  7. 7. Automatic checkpoint source outputStage2Stage1 Stage1 Output stage1Out = stage1(source).checkpoint() Output = stage2(stage1Out)
  8. 8. First run source outputStage2Stage1 Stage1 Output stage1Out = stage1(source).checkpoint() Output = stage2(stage1Out)
  9. 9. Second run source outputStage2Stage1 Stage1 Output stage1Out = stage1(source).checkpoint() Output = stage2(stage1Out)
  10. 10. Change logic source outputStage2 modified Stage1 modified Stage1 Output stage1Out = stage1(source).checkpoint() Output = stage2(stage1Out) Stage1 Output
  11. 11. How do we do this? • Need some way of automatically generating a path to save to or load from • Should stay the same when we want to reload the same data • Should change when we want to generate new data
  12. 12. Refresher: what defines an RDD Dependencies Logic Result
  13. 13. Refresher: what defines an RDD Dependencies Logic Result [1,2,3,4,5…] [2,4,6,8,10…] map(x => x*2)
  14. 14. A logical signature result = logic(dependencies) signature(result) = hash(signature(dependencies) + hash(logic))
  15. 15. A logical signature val rdd2 = rdd1.map(f) signature(rdd2) == hash(signature(rdd1) + hash(f)) val sourceRDD = sc.textFile(uri) signature(sourceRDD) == hash(“textFile:” + uri)
  16. 16. Wait, how do you hash a function? val f = (x: Int) => x * 2 f compiles to a class file of jvm bytecode Hash those bytes Change f to val f = (x: Int) => x * 3 And the class file will change
  17. 17. What if the function references other functions or static methods? • f = (x: Int) => Values.v + staticMethod(x) + g(x) • This is the hard part • Must follow dependencies recursively, identify all classfiles that the function depends on • Make a combined hash of the classfiles and any runtime values
  18. 18. Putting this together source outputStage2Stage1 Stage1 Output stage1Out = stage1(source).checkpoint() Output = stage2(stage1Out)
  19. 19. How to switch in checkpoints • One option: change spark. • Other option: build on top of spark.
  20. 20. Distributed Collection (DC) • Mostly same api as RDD, maps, flatmaps and so on • Can be instantiated without a spark context • Pass in a spark context to get the corresponding RDD • Computes logical signature upon definition, before materialization
  21. 21. RDD vs DC val numbers: DC[Int] = parallelize(1 to 10) val filtered: DC[Int] = numbers.filter(_ < 5) val doubled: DC[Int] = filtered.map(_ * 2) val doubledRDD: RDD[Int] = doubled.getRDD(sc) val numbers: RDD[Int] = sc.parallelize(1 to 10) val filtered: RDD[Int] = numbers.filter(_ < 5) val doubled: RDD[Int] = filtered.map(_ * 2)
  22. 22. Easy checkpoint val numbers: DC[Int] = parallelize(1 to 10) val filtered: DC[Int] = numbers.filter(_ < 5).checkpoint() val doubled: DC[Int] = filtered.map(_ * 2)
  23. 23. Problem: Non lazy definition val numbers: RDD[Int] = sc.parallelize(1 to 10) val doubles: RDD[Double] = numbers.map(x => x.toDouble) val sum: Double = doubles.sum() val normalized: RDD[Double] = doubles.map(x => x / sum) We can’t get a signature of `normalized` without computing the sum!
  24. 24. Solution: Deferred Result • Defined by a function on an RDD • Like a DC but contains only one element • Pass in a spark context to compute the result • Also has a logical signature • We can now express entirely lazy pipelines
  25. 25. Deferred Result example val numbers: DC[Int] = parallelize(1 to 10) val doubles: DC[Double] = numbers.map(x => x.toDouble) val sum: DR[Double] = doubles.sum() val normalized: DC[Double] = doubles.withResult(sum).map{case (x, s) => x / s} We can compute a signature for `normalized` without computing a sum
  26. 26. Improving the workflow • Avoid adding complexity in order to speed things up. • Get the benefits of avoiding redundant computation, but avoid the costs of managing intermediate results. • Further automatic behaviors to explore, such as automatic failure traps, and generation of common statistics.
  27. 27. Can I use this? Yes! check it out: https://github.com/bloomberg/spark-flow • requires spark 2.0 • still early, must build from source currently • hopefully available on spark packages soon
  28. 28. Questions? Nimbus Goehausen ngoehausen@bloomberg.net @nimbusgo on Twitter / Github https://github.com/bloomberg/spark-flow techatbloomberg.com

×