Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

1

Share

Download to read offline

Building Recoverable (and optionally async) Pipelines with Apache Spark (+ small revisions)

Download to read offline

Have you ever had a Spark job fail in it’s second to last stage after a “trivial” update or been part of the way through debugging a pipeline to wish you could look at it’s data or had an “exploratory” notebook turn into something less exploratory? Come join me for a surprisingly simple adventure into how to build recoverable pipelines and have more debuggable pipelines. Then join me on the adventure where in we find out our “simple” solution has a bunch of hidden flaws, how to work around them, and end on the reminder of how important it is to test your code.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Building Recoverable (and optionally async) Pipelines with Apache Spark (+ small revisions)

  1. 1. Building Recoverable Pipelines With Apache Spark Holden Karau Open Source Developer Advocate @ Google
  2. 2. Some links (slides & recordings will be at): http://bit.ly/2QMUaRc ^ Slides & Code (only after the talk because early is hard) Shkumbin Saneja
  3. 3. Holden: ▪ Prefered pronouns are she/her ▪ Developer Advocate at Google ▪ Apache Spark PMC/Committer, contribute to many other projects ▪ previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ▪ co-author of Learning Spark & High Performance Spark ▪ Twitter: @holdenkarau ▪ Slide share http://www.slideshare.net/hkarau ▪ Code review livestreams: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ▪ Spark Talk Videos http://bit.ly/holdenSparkVideos
  4. 4. Who y’all are? ▪ Nice folk ▪ Like databases of a certain kind ▪ Occasionally have big data jobs on your big data fail mxmstryo
  5. 5. What are we going to explore? ▪ Brief: what is Spark and why it’s related to this conference ▪ Also brief: Some of the ways Spark can fail in hour 23 ▪ Less brief: a first stab at making it recoverable ▪ How that goes boom ▪ Repeat ? times until it stops going boom ▪ Summary and github link Stuart
  6. 6. What is Spark? • General purpose distributed system • With a really nice API including Python :) • Apache project (one of the most active) • Must faster than Hadoop Map/Reduce • Good when too big for a single machine • Built on top of two abstractions for distributed data: RDDs & Datasets
  7. 7. The different pieces of Spark Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  8. 8. Why people come to Spark: Well this MapReduce job is going to take 16 hours - how long could it take to learn Spark? dougwoods
  9. 9. Why people come to Spark: My DataFrame won’t fit in memory on my cluster anymore, let alone my MacBook Pro :( Maybe this Spark business will solve that... brownpau
  10. 10. Big Data == Wordcount lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(“output”) Chris
  11. 11. Big Data != Wordcount ▪ ETL (keeping your databases in sync) ▪ SQL on top of non-SQL (hey what about if we added a SQL engine to this?) ▪ ML - Everyone’s doing it, we should too ▪ DL - VC’s won’t give us money for ML anymore so we changed its name ▪ But for this talk we’re just looking at Wordcount because it fits on a slide
  12. 12. f ford Pinto by Morven
  13. 13. Why Spark fails & fails late ▪ Lazy evaluation can make predicting behaviour difficulty ▪ Out of memory errors (from JVM heap to container limits) ▪ Errors in our own code ▪ Driver failure ▪ Data size increases without required tuning changes ▪ Key-skew (consistent partitioning is a great idea right? Oh wait…) ▪ Serialization ▪ Limited type checking in non-JVM languages with ML pipelines ▪ etc.
  14. 14. f ford Pinto by Morven ayphen
  15. 15. Why isn’t it recoverable? ▪ Seperate jobs - no files, no VMs, only sadness ▪ If same job (e.g. notebook failure and retry) cache & files recovery Jennifer C.
  16. 16. “Recoverable” Wordcount: Take 1 lines = sc.textFile(src) words_raw = lines.flatMap(lambda x: x.split(" ")) words_path = "words" if fs.exists(sc._jvm.org.apache.hadoop.fs.Path(words)): words = sc.textFile(words_path) else: word.saveAsTextFile(words_path) words = words_raw # Continue with previous code KLMircea
  17. 17. So what can we do better? ▪ Well if the pipeline fails in certain ways this will fail ▪ We don’t have any clean up on success ▪ sc._jvm is weird ▪ Functions -- the future! ▪ Not async Jennifer C.
  18. 18. “Recoverable” Wordcount: Take 2 lines = sc.textFile(src) words_raw = lines.flatMap(lambda x: x.split(" ")) words_path = "words/SUCCESS.txt" if fs.exists(sc._jvm.org.apache.hadoop.fs.Path(words)): words = sc.textFile(words_path) else: word.saveAsTextFile(words_path) words = words_raw # Continue with previous code Susanne Nilsson
  19. 19. So what can we do better? ▪ Well if the pipeline fails in certain ways this will fail • Fixed ▪ We don’t have any clean up on success • …. ▪ sc._jvm is weird • Yeah we’re not fixing this one unless we use scala ▪ Functions -- the future! • sure! ▪ Have to wait to finish writing file • Hold your horses ivva
  20. 20. “Recoverable” [X]: Take 3 def non_blocking_df_save_or_load(df, target): success_files = ["{0}/SUCCESS.txt", "{0}/_SUCCESS"] if any(fs.exists(hadoop_fs_path(t.format(target))) for t in success_files): print("Reusing") return session.read.load(target).persist() else: print("Saving") df.save(target) return df Jennifer C.
  21. 21. So what can we do better? ▪ Try and not slow down our code on the happy path • async? ▪ Cleanup on success (damn meant to do that earlier) hkase
  22. 22. Adding async? def non_blocking_df_save(df, target): import threading def save_panda(): df.write.mode("overwrite").save(target) thread = threading.Thread(target=save_panda) thread.start()
  23. 23. What could go wrong? ▪ Turns out… a lot ▪ Multiple executions on the DAG are not super great (getting better but) ▪ How do we work around this?
  24. 24. Spark’s (core) magic: the DAG ▪ In Spark most of our work is done by transformations • Things like map ▪ Transformations return new RDDs or DataFrames representing this data ▪ The RDD or DataFrame however doesn’t really “exist” ▪ RDD & DataFrames are really just “plans” of how to make the data show up if we force Spark’s hand ▪ tl;dr - the data doesn’t exist until it “has” to Photo by Dan G
  25. 25. The DAG The query plan Susanne Nilsson
  26. 26. cache + sync count + async save def non_blocking_df_save_or_load(df, target): s = "{0}/SUCCESS.txt" if fs.exists(hadoop_fs_path(s.format(target))): return session.read.load(target).persist() else: print("Saving") df.cache() df.count() non_blocking_df_save(df, target) return df
  27. 27. Well that was “fun”? ▪ Replace wordcount with your back-fill operation and it becomes less fun ▪ You also need to clean up the files ▪ Use job IDS to avoid stomping on other jobs
  28. 28. Spark Videos ▪ Apache Spark Youtube Channel ▪ My Spark videos on YouTube - • http://bit.ly/holdenSparkVideos ▪ Spark Summit 2014 training ▪ Paco’s Introduction to Apache Spark Paul Anderson
  29. 29. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance Spark Learning PySpark
  30. 30. I also have a book... High Performance Spark, it’s available today & the gift of the season. Unrelated to this talk, but if you have a corporate credit card (and or care about distributed systems)…. http://bit.ly/hkHighPerfSpark
  31. 31. Cat wave photo by Quinn Dombrowski k thnx bye! If you <3 Spark testing & want to fill out survey: http://bit.ly/holdenTestingSpark Want to tell me (and or my boss) how I’m doing? http://bit.ly/holdenTalkFeedback Want to e-mail me? Promise not to be creepy? Ok: holden@pigscanfly.ca
  • renderaid

    Aug. 21, 2019

Have you ever had a Spark job fail in it’s second to last stage after a “trivial” update or been part of the way through debugging a pipeline to wish you could look at it’s data or had an “exploratory” notebook turn into something less exploratory? Come join me for a surprisingly simple adventure into how to build recoverable pipelines and have more debuggable pipelines. Then join me on the adventure where in we find out our “simple” solution has a bunch of hidden flaws, how to work around them, and end on the reminder of how important it is to test your code.

Views

Total views

2,119

On Slideshare

0

From embeds

0

Number of embeds

1,850

Actions

Downloads

7

Shares

0

Comments

0

Likes

1

×