SlideShare a Scribd company logo
1 of 32
Download to read offline
Building Recoverable Pipelines
With Apache Spark
Holden Karau
Open Source Developer Advocate @ Google
Some links (slides & recordings
will be at):
http://bit.ly/2QMUaRc
^ Slides & Code
(only after the talk because early is hard)
Shkumbin Saneja
Holden:
▪ Prefered pronouns are she/her
▪ Developer Advocate at Google
▪ Apache Spark PMC/Committer, contribute to many other projects
▪ previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
▪ co-author of Learning Spark & High Performance Spark
▪ Twitter: @holdenkarau
▪ Slide share http://www.slideshare.net/hkarau
▪ Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
▪ Spark Talk Videos http://bit.ly/holdenSparkVideos
Who y’all are?
▪ Nice folk
▪ Like databases of a certain kind
▪ Occasionally have big data jobs on your big data fail
mxmstryo
What are we going to explore?
▪ Brief: what is Spark and why it’s related to this conference
▪ Also brief: Some of the ways Spark can fail in hour 23
▪ Less brief: a first stab at making it recoverable
▪ How that goes boom
▪ Repeat ? times until it stops going boom
▪ Summary and github link
Stuart
What is Spark?
• General purpose distributed system
• With a really nice API including Python :)
• Apache project (one of the most active)
• Must faster than Hadoop Map/Reduce
• Good when too big for a single machine
• Built on top of two abstractions for
distributed data: RDDs & Datasets
The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
Why people come to Spark:
Well this MapReduce job
is going to take 16 hours -
how long could it take to
learn Spark?
dougwoods
Why people come to Spark:
My DataFrame won’t fit in
memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau
Big Data == Wordcount
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(“output”)
Chris
Big Data != Wordcount
▪ ETL (keeping your databases in sync)
▪ SQL on top of non-SQL (hey what about if we added a SQL
engine to this?)
▪ ML - Everyone’s doing it, we should too
▪ DL - VC’s won’t give us money for ML anymore so we changed
its name
▪ But for this talk we’re just looking at Wordcount because it fits
on a slide
f ford Pinto by Morven
Why Spark fails & fails late
▪ Lazy evaluation can make predicting behaviour difficulty
▪ Out of memory errors (from JVM heap to container limits)
▪ Errors in our own code
▪ Driver failure
▪ Data size increases without required tuning changes
▪ Key-skew (consistent partitioning is a great idea right? Oh wait…)
▪ Serialization
▪ Limited type checking in non-JVM languages with ML pipelines
▪ etc.
f ford Pinto by Morven
ayphen
Why isn’t it recoverable?
▪ Seperate jobs - no files, no VMs, only sadness
▪ If same job (e.g. notebook failure and retry) cache & files
recovery
Jennifer C.
“Recoverable” Wordcount: Take 1
lines = sc.textFile(src)
words_raw = lines.flatMap(lambda x: x.split(" "))
words_path = "words"
if fs.exists(sc._jvm.org.apache.hadoop.fs.Path(words)):
words = sc.textFile(words_path)
else:
word.saveAsTextFile(words_path)
words = words_raw
# Continue with previous code
KLMircea
So what can we do better?
▪ Well if the pipeline fails in certain ways this will fail
▪ We don’t have any clean up on success
▪ sc._jvm is weird
▪ Functions -- the future!
▪ Not async
Jennifer C.
“Recoverable” Wordcount: Take 2
lines = sc.textFile(src)
words_raw = lines.flatMap(lambda x: x.split(" "))
words_path = "words/SUCCESS.txt"
if fs.exists(sc._jvm.org.apache.hadoop.fs.Path(words)):
words = sc.textFile(words_path)
else:
word.saveAsTextFile(words_path)
words = words_raw
# Continue with previous code
Susanne Nilsson
So what can we do better?
▪ Well if the pipeline fails in certain ways this will fail
• Fixed
▪ We don’t have any clean up on success
• ….
▪ sc._jvm is weird
• Yeah we’re not fixing this one unless we use scala
▪ Functions -- the future!
• sure!
▪ Have to wait to finish writing file
• Hold your horses
ivva
“Recoverable” [X]: Take 3
def non_blocking_df_save_or_load(df, target):
success_files = ["{0}/SUCCESS.txt", "{0}/_SUCCESS"]
if any(fs.exists(hadoop_fs_path(t.format(target))) for t in
success_files):
print("Reusing")
return session.read.load(target).persist()
else:
print("Saving")
df.save(target)
return df
Jennifer C.
So what can we do better?
▪ Try and not slow down our code on the happy path
• async?
▪ Cleanup on success (damn meant to do that earlier)
hkase
Adding async?
def non_blocking_df_save(df, target):
import threading
def save_panda():
df.write.mode("overwrite").save(target)
thread = threading.Thread(target=save_panda)
thread.start()
What could go wrong?
▪ Turns out… a lot
▪ Multiple executions on the DAG are not super great
(getting better but)
▪ How do we work around this?
Spark’s (core) magic: the DAG
▪ In Spark most of our work is done by transformations
• Things like map
▪ Transformations return new RDDs or DataFrames representing
this data
▪ The RDD or DataFrame however doesn’t really “exist”
▪ RDD & DataFrames are really just “plans” of how to make the
data show up if we force Spark’s hand
▪ tl;dr - the data doesn’t exist until it “has” to
Photo by Dan G
The DAG The query
plan Susanne Nilsson
cache + sync count + async save
def non_blocking_df_save_or_load(df, target):
s = "{0}/SUCCESS.txt"
if fs.exists(hadoop_fs_path(s.format(target))):
return session.read.load(target).persist()
else:
print("Saving")
df.cache()
df.count()
non_blocking_df_save(df, target)
return df
Well that was “fun”?
▪ Replace wordcount with your back-fill operation and it
becomes less fun
▪ You also need to clean up the files
▪ Use job IDS to avoid stomping on other jobs
Spark Videos
▪ Apache Spark Youtube Channel
▪ My Spark videos on YouTube -
• http://bit.ly/holdenSparkVideos
▪ Spark Summit 2014 training
▪ Paco’s Introduction to Apache Spark
Paul Anderson
Learning Spark
Fast Data Processing
with Spark
(Out of Date)
Fast Data Processing with
Spark (2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance
Spark
Learning PySpark
I also have a book...
High Performance Spark, it’s available today & the gift of
the season.
Unrelated to this talk, but if you have a corporate credit
card (and or care about distributed systems)….
http://bit.ly/hkHighPerfSpark
Cat wave photo by Quinn Dombrowski
k thnx bye!
If you <3 Spark testing & want to fill out
survey: http://bit.ly/holdenTestingSpark
Want to tell me (and or my boss) how
I’m doing?
http://bit.ly/holdenTalkFeedback
Want to e-mail me?
Promise not to be creepy? Ok:
holden@pigscanfly.ca

More Related Content

What's hot

Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkHolden Karau
 
Contributing to Apache Spark 3
Contributing to Apache Spark 3Contributing to Apache Spark 3
Contributing to Apache Spark 3Holden Karau
 
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018Holden Karau
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupHolden Karau
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...Holden Karau
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...Databricks
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Holden Karau
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupHadoop User Group
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Chris Fregly
 
Node.js: CAMTA Presentation
Node.js: CAMTA PresentationNode.js: CAMTA Presentation
Node.js: CAMTA PresentationRob Tweed
 
Whats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Whats wrong with postgres | PGConf EU 2019 | Craig KerstiensWhats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Whats wrong with postgres | PGConf EU 2019 | Craig KerstiensCitus Data
 

What's hot (20)

Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New York
 
Contributing to Apache Spark 3
Contributing to Apache Spark 3Contributing to Apache Spark 3
Contributing to Apache Spark 3
 
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
 
Node.js: CAMTA Presentation
Node.js: CAMTA PresentationNode.js: CAMTA Presentation
Node.js: CAMTA Presentation
 
Os Whitaker
Os WhitakerOs Whitaker
Os Whitaker
 
Whats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Whats wrong with postgres | PGConf EU 2019 | Craig KerstiensWhats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Whats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
 

Similar to Building Recoverable Pipelines With Apache Spark

Scylla Summit 2018: Building Recoverable (and optionally Async) Spark Pipelines
Scylla Summit 2018: Building Recoverable (and optionally Async) Spark PipelinesScylla Summit 2018: Building Recoverable (and optionally Async) Spark Pipelines
Scylla Summit 2018: Building Recoverable (and optionally Async) Spark PipelinesScyllaDB
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...Alexander Dean
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
 
Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Ian Pointer
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk finalRachel Warren
 
Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark MLHolden Karau
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
 

Similar to Building Recoverable Pipelines With Apache Spark (20)

Scylla Summit 2018: Building Recoverable (and optionally Async) Spark Pipelines
Scylla Summit 2018: Building Recoverable (and optionally Async) Spark PipelinesScylla Summit 2018: Building Recoverable (and optionally Async) Spark Pipelines
Scylla Summit 2018: Building Recoverable (and optionally Async) Spark Pipelines
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk final
 
Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 

Recently uploaded

✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663Call Girls Mumbai
 
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.soniya singh
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Sheetaleventcompany
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...tanu pandey
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...APNIC
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...tanu pandey
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLimonikaupta
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersDamian Radcliffe
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)Damian Radcliffe
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...SofiyaSharma5
 
Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGAPNIC
 
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Onlineanilsa9823
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Delhi Call girls
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445ruhi
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Delhi Call girls
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.soniya singh
 

Recently uploaded (20)

✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
 
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
 
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
 
Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOG
 
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 

Building Recoverable Pipelines With Apache Spark

  • 1. Building Recoverable Pipelines With Apache Spark Holden Karau Open Source Developer Advocate @ Google
  • 2. Some links (slides & recordings will be at): http://bit.ly/2QMUaRc ^ Slides & Code (only after the talk because early is hard) Shkumbin Saneja
  • 3. Holden: ▪ Prefered pronouns are she/her ▪ Developer Advocate at Google ▪ Apache Spark PMC/Committer, contribute to many other projects ▪ previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ▪ co-author of Learning Spark & High Performance Spark ▪ Twitter: @holdenkarau ▪ Slide share http://www.slideshare.net/hkarau ▪ Code review livestreams: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ▪ Spark Talk Videos http://bit.ly/holdenSparkVideos
  • 4.
  • 5. Who y’all are? ▪ Nice folk ▪ Like databases of a certain kind ▪ Occasionally have big data jobs on your big data fail mxmstryo
  • 6. What are we going to explore? ▪ Brief: what is Spark and why it’s related to this conference ▪ Also brief: Some of the ways Spark can fail in hour 23 ▪ Less brief: a first stab at making it recoverable ▪ How that goes boom ▪ Repeat ? times until it stops going boom ▪ Summary and github link Stuart
  • 7. What is Spark? • General purpose distributed system • With a really nice API including Python :) • Apache project (one of the most active) • Must faster than Hadoop Map/Reduce • Good when too big for a single machine • Built on top of two abstractions for distributed data: RDDs & Datasets
  • 8. The different pieces of Spark Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  • 9. Why people come to Spark: Well this MapReduce job is going to take 16 hours - how long could it take to learn Spark? dougwoods
  • 10. Why people come to Spark: My DataFrame won’t fit in memory on my cluster anymore, let alone my MacBook Pro :( Maybe this Spark business will solve that... brownpau
  • 11. Big Data == Wordcount lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(“output”) Chris
  • 12. Big Data != Wordcount ▪ ETL (keeping your databases in sync) ▪ SQL on top of non-SQL (hey what about if we added a SQL engine to this?) ▪ ML - Everyone’s doing it, we should too ▪ DL - VC’s won’t give us money for ML anymore so we changed its name ▪ But for this talk we’re just looking at Wordcount because it fits on a slide
  • 13. f ford Pinto by Morven
  • 14. Why Spark fails & fails late ▪ Lazy evaluation can make predicting behaviour difficulty ▪ Out of memory errors (from JVM heap to container limits) ▪ Errors in our own code ▪ Driver failure ▪ Data size increases without required tuning changes ▪ Key-skew (consistent partitioning is a great idea right? Oh wait…) ▪ Serialization ▪ Limited type checking in non-JVM languages with ML pipelines ▪ etc.
  • 15. f ford Pinto by Morven ayphen
  • 16. Why isn’t it recoverable? ▪ Seperate jobs - no files, no VMs, only sadness ▪ If same job (e.g. notebook failure and retry) cache & files recovery Jennifer C.
  • 17. “Recoverable” Wordcount: Take 1 lines = sc.textFile(src) words_raw = lines.flatMap(lambda x: x.split(" ")) words_path = "words" if fs.exists(sc._jvm.org.apache.hadoop.fs.Path(words)): words = sc.textFile(words_path) else: word.saveAsTextFile(words_path) words = words_raw # Continue with previous code KLMircea
  • 18. So what can we do better? ▪ Well if the pipeline fails in certain ways this will fail ▪ We don’t have any clean up on success ▪ sc._jvm is weird ▪ Functions -- the future! ▪ Not async Jennifer C.
  • 19. “Recoverable” Wordcount: Take 2 lines = sc.textFile(src) words_raw = lines.flatMap(lambda x: x.split(" ")) words_path = "words/SUCCESS.txt" if fs.exists(sc._jvm.org.apache.hadoop.fs.Path(words)): words = sc.textFile(words_path) else: word.saveAsTextFile(words_path) words = words_raw # Continue with previous code Susanne Nilsson
  • 20. So what can we do better? ▪ Well if the pipeline fails in certain ways this will fail • Fixed ▪ We don’t have any clean up on success • …. ▪ sc._jvm is weird • Yeah we’re not fixing this one unless we use scala ▪ Functions -- the future! • sure! ▪ Have to wait to finish writing file • Hold your horses ivva
  • 21. “Recoverable” [X]: Take 3 def non_blocking_df_save_or_load(df, target): success_files = ["{0}/SUCCESS.txt", "{0}/_SUCCESS"] if any(fs.exists(hadoop_fs_path(t.format(target))) for t in success_files): print("Reusing") return session.read.load(target).persist() else: print("Saving") df.save(target) return df Jennifer C.
  • 22. So what can we do better? ▪ Try and not slow down our code on the happy path • async? ▪ Cleanup on success (damn meant to do that earlier) hkase
  • 23. Adding async? def non_blocking_df_save(df, target): import threading def save_panda(): df.write.mode("overwrite").save(target) thread = threading.Thread(target=save_panda) thread.start()
  • 24. What could go wrong? ▪ Turns out… a lot ▪ Multiple executions on the DAG are not super great (getting better but) ▪ How do we work around this?
  • 25. Spark’s (core) magic: the DAG ▪ In Spark most of our work is done by transformations • Things like map ▪ Transformations return new RDDs or DataFrames representing this data ▪ The RDD or DataFrame however doesn’t really “exist” ▪ RDD & DataFrames are really just “plans” of how to make the data show up if we force Spark’s hand ▪ tl;dr - the data doesn’t exist until it “has” to Photo by Dan G
  • 26. The DAG The query plan Susanne Nilsson
  • 27. cache + sync count + async save def non_blocking_df_save_or_load(df, target): s = "{0}/SUCCESS.txt" if fs.exists(hadoop_fs_path(s.format(target))): return session.read.load(target).persist() else: print("Saving") df.cache() df.count() non_blocking_df_save(df, target) return df
  • 28. Well that was “fun”? ▪ Replace wordcount with your back-fill operation and it becomes less fun ▪ You also need to clean up the files ▪ Use job IDS to avoid stomping on other jobs
  • 29. Spark Videos ▪ Apache Spark Youtube Channel ▪ My Spark videos on YouTube - • http://bit.ly/holdenSparkVideos ▪ Spark Summit 2014 training ▪ Paco’s Introduction to Apache Spark Paul Anderson
  • 30. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance Spark Learning PySpark
  • 31. I also have a book... High Performance Spark, it’s available today & the gift of the season. Unrelated to this talk, but if you have a corporate credit card (and or care about distributed systems)…. http://bit.ly/hkHighPerfSpark
  • 32. Cat wave photo by Quinn Dombrowski k thnx bye! If you <3 Spark testing & want to fill out survey: http://bit.ly/holdenTestingSpark Want to tell me (and or my boss) how I’m doing? http://bit.ly/holdenTalkFeedback Want to e-mail me? Promise not to be creepy? Ok: holden@pigscanfly.ca