SlideShare a Scribd company logo
1 of 40
Apache Spark as a gateway
drug to functional programming
Concepts taught & broken
@holdenkarau
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
○ Currently out of print, discussing a reprint re-run with my wife
● On twitter @BooProgrammer
Why Google Cloud cares about Spark?
● Lots of data!
○ We mostly use different, although similar FP inspired, tools internally
● We have two hosted solutions for using Spark (dataproc & GKE)
○ I have a blog post on how to try out custom/new versions of Spark if you want to help us test
the current RC - https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark-
releases-and-changes-on-google-kubernetes-engine-and-cloud-dataproc
Who do I think y’all are?
● Friendly[ish] people
● Don’t mind pictures of cats or stuffed animals
● Like functional programming
● Want to keep growing the functional programming community
Lori Erickson
What will be covered?
● What is Spark (super brief) & how it’s helped drive FP to enterprise
● Wordcount example (as required by license)
● What concepts Spark does a good of teaching folks
● What concepts Spark does a so-so job of teaching folks
● Some examples
● Format is going to be happy-sad (repeating) so we can end on a downer
● But with lots of cat pictures I promise
What is Spark?
● General purpose distributed system
○ Built in Scala with an FP inspired API
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
When we say distributed we mean...
Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods
Why people come to Spark:
My DataFrame won’t fit
in memory on my
cluster anymore, let
alone my MacBook Pro
:( Maybe this Spark
business will solve
that...
brownpau
Plus a little magic :)
Steven Saus
What is the “magic” of Spark?
● Automatically distributed functional programming :)
● DAG / “query plan” is the root of much of it
● Optimizer to combine steps
● Resiliency: recover from failures rather than protecting
from failures.
● “In-memory” + “spill-to-disk”
● Functional programming to build the DAG for “free”
● Select operations without deserialization
● The best way to trick people into learning functional
programming
Richard Gillin
The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
What Spark got right (for Scala/FP):
● Strong enforced[ish] requirement for immutable data
○ Use recompute for failure so a core part of the logic
● Functional operators (map, filter, flatMap, etc.)
● Lambdas for everyone!
○ Sometime too many….
● Solved a “business need”
○ Even if that need was imaginary
● Made it hard to have side effects against external variables without being very
explicit & verbose
○ Even then discouraged strongly through lack of documentation :p
Stuart
What Spark got … less right (for Scala/FP):
● Serialization… complications
○ Makes people think closures are more limited than they can be
● Lots of Map[String, String] (equivalent) settings
○ Hey buddy can you spare a type checker?
● Hard to debug, could be confused with Scala hard to debug
○ Not completely unjustified sometimes
● New ML & SQL APIs without “any” types (initially)
● Heavy mutation focused API for Machine Learning
● Internals are…. very not functional programming best practices
indamage
Before the technical details: positioning
● “100x faster than hadoop” sounds nice
● wc(wc) < 17 also sounds nice if you’re current wordcount doesn’t fit on a page
● “Integrated solution” - sounds nice if you have to write your data to disk
between querying and training
● “Works with your existing JVM or Python code base.”
● Part of the Apache Software Foundation
○ e.g. you already run software from these folks, don’t worry about lock-in they’ve got it handled
● Like all positioning each of these have some implied “*”s associated with
them
● Focused on the benefits made possible with FP rather than the fact it was FP
● Lead to large commercial install base of folks being exposed to functional
programming
Hello World (Word count) - 1 of ?
val lines = sc.textFile("boop")
val words = lines.flatMap(line => line.split(" "))
val wordPairs = words.map(word => (word, 1))
val wordCounts = wordPairs.reduceByKey(c1, c2: c1 + c2)
wordCounts.saveAsTextFile("snoop")
Photo By: Will
Keightley
Spark & Lazy Evaluation: True BFFs
● Allows Spark to build a graph of operations and use it for recomputing on
failure
● Less passes over the data without thinking*
● Except…. Debates around limiting it due to developer confusion with
debugging
● And library level rather than macro or compiler level means needing to
kcxd
Spark & Immutability: Really good friends
● Many systems enforce some level of immutability but...
● Scala allows mutability (vars & friends), but Spark get’s not so happy
● Execution model is recompute on failure, and mutation would break this
● Provides a clear benefit (e.g. “Instead of writing data out to 3 network disks
we just recompute on failure leading to 100x improvement”*)
Bartlomiej Mostek
Even if you try to use, no mutation for you*
scala> var x = 0
x: Int = 0
scala> val rdd = sc.parallelize(1.to(100))
scala> val added = rdd.map{e => x += e}
scala> added.count()
res0: Long = 100
scala> x
res1: Int = 0
PROJennifer C.
Well…. only local mutation
scala> val result = rdd.map{e => x += e; x}
scala> result.collect()
res1: Array[Int] = Array(1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120,
136, 153, 171, 190, 210, 231, 253, 276, 300, 325, 26, 53, 81, 110, 140, 171, 203,
236, 270, 305, 341, 378, 416, 455, 495, 536, 578, 621, 665, 710, 756, 803, 851,
900, 950, 51, 103, 156, 210, 265, 321, 378, 436, 495, 555, 616, 678, 741, 805,
870, 936, 1003, 1071, 1140, 1210, 1281, 1353, 1426, 1500, 1575, 76, 153, 231,
310, 390, 471, 553, 636, 720, 805, 891, 978, 1066, 1155, 1245, 1336, 1428, 1521,
1615, 1710, 1806, 1903, 2001, 2100, 2200)
Raita Futo
Well…. only very local mutation
scala> val rdd = sc.parallelize(1.to(100), 10)
scala> val result = rdd.map{e => x += e; x}
scala> result.collect()
res2: Array[Int] = Array(1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 11, 23, 36, 50, 65, 81,
98, 116, 135, 155, 21, 43, 66, 90, 115, 141, 168, 196, 225, 255, 31, 63, 96, 130,
165, 201, 238, 276, 315, 355, 41, 83, 126, 170, 215, 261, 308, 356, 405, 455, 51,
103, 156, 210, 265, 321, 378, 436, 495, 555, 61, 123, 186, 250, 315, 381, 448,
516, 585, 655, 71, 143, 216, 290, 365, 441, 518, 596, 675, 755, 81, 163, 246, 330,
415, 501, 588, 676, 765, 855, 91, 183, 276, 370, 465, 561, 658, 756, 855, 955)
PROSusanne Nilsson
Spark & Immutability: A few small challenges
● Accumulators:
○ Accumulators “leak” on recompute and partial compute (another Spark trick)
● You _can_ mutate state inside of the worker or master it just won’t propagate
○ No error messages here, just unexpected future “happiness”
Caroline
Spark & Lambdas: Everyone gets a lambda!
● Maps, flatMaps, filters, reducers oh my (and that’s just wordcount :p )
● Writing in Java 7?
○ You get a weird looking lambda!
Mark Jensen
Spark & Lambdas - closures & the dark side
● Python & Scala lambda serialization is (understandably) cautious
● Referencing a variable in the class brings the whole class with you
● Can create the impression closures are limited to serializable data
● ClosureCleaner.scala - “You weren’t using that class right?” & CloudPickle -
“Welllll…. What about if we stored this differently?”
● pre-Java 8 (new FunctionX<A, B, C>... ugh)
● SQL API custom aggregates don’t currently support lambdas :(
David Goehring
Oh right: serialization :(
● Many things in JVM & Python are not well designed to be rehydrated on
another VM let alone another machine
● Space efficiency: well…. At least its not XML?
● Oh wait, we have to parse Python serialized data in the JVM? Ruh roh!
● Oh hey let’s make it configurable… oh wait...
Reaching non-“traditional” FP languages
● Spark works in many languages - Python, R, Java, etc.
○ Allows Spark to have a wider audience that doesn’t have the time (or tooling) to learn another
language
○ e.g. Numerical Scala libraries are… rough
● Meeting developers where they are
● Some overhead (often) pushes systemy folks towards learning Scala for
performance
● Sometimes we could do a better job of working with the existing FP tools
○ e.g. we could do more to make an RDD look like “normal” Python and maybe I’d stop typing
.flatMap on Python itrs
Reaching non-“traditional” FP languages
lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) 
.map(lambda x: (x, 1)) 
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
photobom
Reaching non-“traditional” FP languages Petful
ML: Enough getters and setters for the 90s!
● Took inspiration from ski-kit-learn
○ Which is a cool system, just not super functionally oriented
● Added hidden metadata, which got dropped in lots of places (not quite as bad
as global state buuuut….)
● Threw away compile time type information
● I really don’t have a + for this one in the FP column teaching
Basic Dataprep pipeline for “ML”
// Combines a list of double input features into a vector
val assembler = new VectorAssembler().setInputCols(Array("age",
"education-num")).setOutputCol("features")
// String indexer converts a set of strings into doubles
val indexer = StringIndexer().setInputCol("category")
.setOutputCol("category-index")
// Can be used to combine pipeline components together
val pipeline = Pipeline().setStages(Array(assembler, indexer))
Huang
Yun
Chung
Adding some ML (no longer cool -- DL)
// Specify model
val dt = new DecisionTreeClassifier()
.setLabelCol("category-index")
.setFeaturesCol("features")
// Add it to the pipeline
val pipeline_and_model = Pipeline().setStages(
List(assembler, indexer, dt))
val pipeline_model = pipeline_and_model.fit(df)
The internals… are fairly imperative
● Not everywhere and not without reason (at the time they were created)
● No macros for historic performance reasons
○ Code generation is done by concating strings of Java code
● This would matter less, but Spark is pretty aggressive at trying to hide things,
enough that some major projects end up having to peek a little bit at the code
● Can feel as a “do what I say not what I do” sometimes
○ Especially in ML lot’s of hacks to make things go a little bit faster using internal APIs
ivva
And folks need to access them :(
ivva
Let’s end on a happy-ish note
● Yes Spark isn’t perfect, but it’s APIs are teaching a generation of “big data”
developers & data scientists functional programming without calling it that (at
the start)
● Once we get them hooked we can show them cool things!
● If we can find other areas where we make tools that expose FP apis to more
non-FP folks (and not just for the sake of FP) we can make better software
● Or we can all go back to writing 90s style enterprise Java
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
Available today! A great second Spark book to read (but
please buy it first)
You can buy it from that scrappy Seattle bookstore, Jeff
Bezos needs another newspaper and I want a cup of
coffee.
http://bit.ly/hkHighPerfSpark
And some upcoming talks:
● July
○ OSCON Portland & meetup
● August
○ JupyterCon NYC
● September
○ Strata NYC
○ Strangeloop STL
● October
○ Spark Summit London
○ Reversim Tel Aviv
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Pssst: Have feedback on the presentation? Give me a shout
(holden@pigscanfly.ca or http://bit.ly/holdenTalkFeedback ) if
you feel comfortable doing so :)
Feedback (if you are so inclined):
http://bit.ly/holdenTalkFeedback

More Related Content

What's hot

Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...Holden Karau
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Holden Karau
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using SwiftDiego Freniche Brito
 
Functional Programming for Busy Object Oriented Programmers
Functional Programming for Busy Object Oriented ProgrammersFunctional Programming for Busy Object Oriented Programmers
Functional Programming for Busy Object Oriented ProgrammersDiego Freniche Brito
 
A peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserA peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserKoan-Sin Tan
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Databricks
 
Puppetizing Your Organization
Puppetizing Your OrganizationPuppetizing Your Organization
Puppetizing Your OrganizationRobert Nelson
 
Guglielmo iozzia - Google I/O extended dublin 2018
Guglielmo iozzia - Google  I/O extended dublin 2018Guglielmo iozzia - Google  I/O extended dublin 2018
Guglielmo iozzia - Google I/O extended dublin 2018Guglielmo Iozzia
 
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)Tim Bunce
 
Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...Pôle Systematic Paris-Region
 
Julia language: inside the corporation
Julia language: inside the corporationJulia language: inside the corporation
Julia language: inside the corporationAndre Pemmelaar
 
From Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François SaussetFrom Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François SaussetPôle Systematic Paris-Region
 
Golang 101
Golang 101Golang 101
Golang 101宇 傅
 

What's hot (20)

Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
 
Functional Programming for Busy Object Oriented Programmers
Functional Programming for Busy Object Oriented ProgrammersFunctional Programming for Busy Object Oriented Programmers
Functional Programming for Busy Object Oriented Programmers
 
A peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserA peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk User
 
Go fundamentals
Go fundamentalsGo fundamentals
Go fundamentals
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
Puppetizing Your Organization
Puppetizing Your OrganizationPuppetizing Your Organization
Puppetizing Your Organization
 
Guglielmo iozzia - Google I/O extended dublin 2018
Guglielmo iozzia - Google  I/O extended dublin 2018Guglielmo iozzia - Google  I/O extended dublin 2018
Guglielmo iozzia - Google I/O extended dublin 2018
 
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
 
Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...
 
Julia language: inside the corporation
Julia language: inside the corporationJulia language: inside the corporation
Julia language: inside the corporation
 
From Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François SaussetFrom Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François Sausset
 
Iron Sprog Tech Talk
Iron Sprog Tech TalkIron Sprog Tech Talk
Iron Sprog Tech Talk
 
Golang 101
Golang 101Golang 101
Golang 101
 

Similar to Apache Spark as a gateway to functional concepts

Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Holden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopHolden Karau
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Holden Karau
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Holden Karau
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemAdarsh Pannu
 

Similar to Apache Spark as a gateway to functional concepts (20)

Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 

Recently uploaded

Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Lucknow
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 

Recently uploaded (20)

Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 

Apache Spark as a gateway to functional concepts

  • 1. Apache Spark as a gateway drug to functional programming Concepts taught & broken @holdenkarau
  • 2. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC, Beam contributor ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Code review livestreams: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ● Spark Talk Videos http://bit.ly/holdenSparkVideos
  • 3.
  • 4. Who is Boo? ● Boo uses she/her pronouns (as I told the Texas house committee) ● Best doge ● Lot’s of experience barking at computers to make them go faster ● Author of “Learning to Bark” & “High Performance Barking” ○ Currently out of print, discussing a reprint re-run with my wife ● On twitter @BooProgrammer
  • 5. Why Google Cloud cares about Spark? ● Lots of data! ○ We mostly use different, although similar FP inspired, tools internally ● We have two hosted solutions for using Spark (dataproc & GKE) ○ I have a blog post on how to try out custom/new versions of Spark if you want to help us test the current RC - https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark- releases-and-changes-on-google-kubernetes-engine-and-cloud-dataproc
  • 6. Who do I think y’all are? ● Friendly[ish] people ● Don’t mind pictures of cats or stuffed animals ● Like functional programming ● Want to keep growing the functional programming community Lori Erickson
  • 7. What will be covered? ● What is Spark (super brief) & how it’s helped drive FP to enterprise ● Wordcount example (as required by license) ● What concepts Spark does a good of teaching folks ● What concepts Spark does a so-so job of teaching folks ● Some examples ● Format is going to be happy-sad (repeating) so we can end on a downer ● But with lots of cat pictures I promise
  • 8. What is Spark? ● General purpose distributed system ○ Built in Scala with an FP inspired API ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Good when too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  • 9. When we say distributed we mean...
  • 10. Why people come to Spark: Well this MapReduce job is going to take 16 hours - how long could it take to learn Spark? dougwoods
  • 11. Why people come to Spark: My DataFrame won’t fit in memory on my cluster anymore, let alone my MacBook Pro :( Maybe this Spark business will solve that... brownpau
  • 12. Plus a little magic :) Steven Saus
  • 13. What is the “magic” of Spark? ● Automatically distributed functional programming :) ● DAG / “query plan” is the root of much of it ● Optimizer to combine steps ● Resiliency: recover from failures rather than protecting from failures. ● “In-memory” + “spill-to-disk” ● Functional programming to build the DAG for “free” ● Select operations without deserialization ● The best way to trick people into learning functional programming Richard Gillin
  • 14. The different pieces of Spark Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  • 15. What Spark got right (for Scala/FP): ● Strong enforced[ish] requirement for immutable data ○ Use recompute for failure so a core part of the logic ● Functional operators (map, filter, flatMap, etc.) ● Lambdas for everyone! ○ Sometime too many…. ● Solved a “business need” ○ Even if that need was imaginary ● Made it hard to have side effects against external variables without being very explicit & verbose ○ Even then discouraged strongly through lack of documentation :p Stuart
  • 16. What Spark got … less right (for Scala/FP): ● Serialization… complications ○ Makes people think closures are more limited than they can be ● Lots of Map[String, String] (equivalent) settings ○ Hey buddy can you spare a type checker? ● Hard to debug, could be confused with Scala hard to debug ○ Not completely unjustified sometimes ● New ML & SQL APIs without “any” types (initially) ● Heavy mutation focused API for Machine Learning ● Internals are…. very not functional programming best practices indamage
  • 17. Before the technical details: positioning ● “100x faster than hadoop” sounds nice ● wc(wc) < 17 also sounds nice if you’re current wordcount doesn’t fit on a page ● “Integrated solution” - sounds nice if you have to write your data to disk between querying and training ● “Works with your existing JVM or Python code base.” ● Part of the Apache Software Foundation ○ e.g. you already run software from these folks, don’t worry about lock-in they’ve got it handled ● Like all positioning each of these have some implied “*”s associated with them ● Focused on the benefits made possible with FP rather than the fact it was FP ● Lead to large commercial install base of folks being exposed to functional programming
  • 18. Hello World (Word count) - 1 of ? val lines = sc.textFile("boop") val words = lines.flatMap(line => line.split(" ")) val wordPairs = words.map(word => (word, 1)) val wordCounts = wordPairs.reduceByKey(c1, c2: c1 + c2) wordCounts.saveAsTextFile("snoop") Photo By: Will Keightley
  • 19. Spark & Lazy Evaluation: True BFFs ● Allows Spark to build a graph of operations and use it for recomputing on failure ● Less passes over the data without thinking* ● Except…. Debates around limiting it due to developer confusion with debugging ● And library level rather than macro or compiler level means needing to kcxd
  • 20. Spark & Immutability: Really good friends ● Many systems enforce some level of immutability but... ● Scala allows mutability (vars & friends), but Spark get’s not so happy ● Execution model is recompute on failure, and mutation would break this ● Provides a clear benefit (e.g. “Instead of writing data out to 3 network disks we just recompute on failure leading to 100x improvement”*) Bartlomiej Mostek
  • 21. Even if you try to use, no mutation for you* scala> var x = 0 x: Int = 0 scala> val rdd = sc.parallelize(1.to(100)) scala> val added = rdd.map{e => x += e} scala> added.count() res0: Long = 100 scala> x res1: Int = 0 PROJennifer C.
  • 22. Well…. only local mutation scala> val result = rdd.map{e => x += e; x} scala> result.collect() res1: Array[Int] = Array(1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120, 136, 153, 171, 190, 210, 231, 253, 276, 300, 325, 26, 53, 81, 110, 140, 171, 203, 236, 270, 305, 341, 378, 416, 455, 495, 536, 578, 621, 665, 710, 756, 803, 851, 900, 950, 51, 103, 156, 210, 265, 321, 378, 436, 495, 555, 616, 678, 741, 805, 870, 936, 1003, 1071, 1140, 1210, 1281, 1353, 1426, 1500, 1575, 76, 153, 231, 310, 390, 471, 553, 636, 720, 805, 891, 978, 1066, 1155, 1245, 1336, 1428, 1521, 1615, 1710, 1806, 1903, 2001, 2100, 2200) Raita Futo
  • 23. Well…. only very local mutation scala> val rdd = sc.parallelize(1.to(100), 10) scala> val result = rdd.map{e => x += e; x} scala> result.collect() res2: Array[Int] = Array(1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 11, 23, 36, 50, 65, 81, 98, 116, 135, 155, 21, 43, 66, 90, 115, 141, 168, 196, 225, 255, 31, 63, 96, 130, 165, 201, 238, 276, 315, 355, 41, 83, 126, 170, 215, 261, 308, 356, 405, 455, 51, 103, 156, 210, 265, 321, 378, 436, 495, 555, 61, 123, 186, 250, 315, 381, 448, 516, 585, 655, 71, 143, 216, 290, 365, 441, 518, 596, 675, 755, 81, 163, 246, 330, 415, 501, 588, 676, 765, 855, 91, 183, 276, 370, 465, 561, 658, 756, 855, 955) PROSusanne Nilsson
  • 24. Spark & Immutability: A few small challenges ● Accumulators: ○ Accumulators “leak” on recompute and partial compute (another Spark trick) ● You _can_ mutate state inside of the worker or master it just won’t propagate ○ No error messages here, just unexpected future “happiness” Caroline
  • 25. Spark & Lambdas: Everyone gets a lambda! ● Maps, flatMaps, filters, reducers oh my (and that’s just wordcount :p ) ● Writing in Java 7? ○ You get a weird looking lambda! Mark Jensen
  • 26. Spark & Lambdas - closures & the dark side ● Python & Scala lambda serialization is (understandably) cautious ● Referencing a variable in the class brings the whole class with you ● Can create the impression closures are limited to serializable data ● ClosureCleaner.scala - “You weren’t using that class right?” & CloudPickle - “Welllll…. What about if we stored this differently?” ● pre-Java 8 (new FunctionX<A, B, C>... ugh) ● SQL API custom aggregates don’t currently support lambdas :( David Goehring
  • 27. Oh right: serialization :( ● Many things in JVM & Python are not well designed to be rehydrated on another VM let alone another machine ● Space efficiency: well…. At least its not XML? ● Oh wait, we have to parse Python serialized data in the JVM? Ruh roh! ● Oh hey let’s make it configurable… oh wait...
  • 28. Reaching non-“traditional” FP languages ● Spark works in many languages - Python, R, Java, etc. ○ Allows Spark to have a wider audience that doesn’t have the time (or tooling) to learn another language ○ e.g. Numerical Scala libraries are… rough ● Meeting developers where they are ● Some overhead (often) pushes systemy folks towards learning Scala for performance ● Sometimes we could do a better job of working with the existing FP tools ○ e.g. we could do more to make an RDD look like “normal” Python and maybe I’d stop typing .flatMap on Python itrs
  • 29. Reaching non-“traditional” FP languages lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0]) counts = lines.flatMap(lambda x: x.split(' ')) .map(lambda x: (x, 1)) .reduceByKey(add) output = counts.collect() for (word, count) in output: print("%s: %i" % (word, count)) photobom
  • 31. ML: Enough getters and setters for the 90s! ● Took inspiration from ski-kit-learn ○ Which is a cool system, just not super functionally oriented ● Added hidden metadata, which got dropped in lots of places (not quite as bad as global state buuuut….) ● Threw away compile time type information ● I really don’t have a + for this one in the FP column teaching
  • 32. Basic Dataprep pipeline for “ML” // Combines a list of double input features into a vector val assembler = new VectorAssembler().setInputCols(Array("age", "education-num")).setOutputCol("features") // String indexer converts a set of strings into doubles val indexer = StringIndexer().setInputCol("category") .setOutputCol("category-index") // Can be used to combine pipeline components together val pipeline = Pipeline().setStages(Array(assembler, indexer)) Huang Yun Chung
  • 33. Adding some ML (no longer cool -- DL) // Specify model val dt = new DecisionTreeClassifier() .setLabelCol("category-index") .setFeaturesCol("features") // Add it to the pipeline val pipeline_and_model = Pipeline().setStages( List(assembler, indexer, dt)) val pipeline_model = pipeline_and_model.fit(df)
  • 34. The internals… are fairly imperative ● Not everywhere and not without reason (at the time they were created) ● No macros for historic performance reasons ○ Code generation is done by concating strings of Java code ● This would matter less, but Spark is pretty aggressive at trying to hide things, enough that some major projects end up having to peek a little bit at the code ● Can feel as a “do what I say not what I do” sometimes ○ Especially in ML lot’s of hacks to make things go a little bit faster using internal APIs ivva
  • 35. And folks need to access them :( ivva
  • 36. Let’s end on a happy-ish note ● Yes Spark isn’t perfect, but it’s APIs are teaching a generation of “big data” developers & data scientists functional programming without calling it that (at the start) ● Once we get them hooked we can show them cool things! ● If we can find other areas where we make tools that expose FP apis to more non-FP folks (and not just for the sake of FP) we can make better software ● Or we can all go back to writing 90s style enterprise Java
  • 37. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 38. High Performance Spark! Available today! A great second Spark book to read (but please buy it first) You can buy it from that scrappy Seattle bookstore, Jeff Bezos needs another newspaper and I want a cup of coffee. http://bit.ly/hkHighPerfSpark
  • 39. And some upcoming talks: ● July ○ OSCON Portland & meetup ● August ○ JupyterCon NYC ● September ○ Strata NYC ○ Strangeloop STL ● October ○ Spark Summit London ○ Reversim Tel Aviv
  • 40. k thnx bye :) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau Pssst: Have feedback on the presentation? Give me a shout (holden@pigscanfly.ca or http://bit.ly/holdenTalkFeedback ) if you feel comfortable doing so :) Feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback

Editor's Notes

  1. introduce spark rdds, purple blog diagrams
  2. https://www.flickr.com/photos/jon_a_ross/2679856182/in/photolist-55NXSW-4UZZHe-e1Ubar-8oA19X-4V2hrU-4UX6dT-4HpqVm-58CV9k-ardHmQ-72uLB3-6p6gqL-58gez2-hjhDoA-4MqZrU-8ZMidf-4NFd8N-4NFcMQ-9R6Dr6-55JQDr-rxeWPU-oDVKTS-arbcbX-arbbTp-aVNBqi-47TCvC-4NFctq-b4BE3p-7WcAGh-9w8FFR-6HYNpP-662zun-5LX51n-5BWeR2-oZc3Xk-ewax6c-7Z3vKE-e5W5AJ-bi3HtM-bEBTUZ-s1c3gw-qMbK5K-6heJzF-g6YbwT-aoRa8z-kNDkqL-YRwm-4BESNo-iRhKvk-ib7bUU-nmuxdF
  3. We can examine how RDD’s work in practice with the traditonal word count example. If you’ve taken another intro to big data class, or just worked with mapreduce you’ll notice that this is a lot less code than we normally have to do. https://www.flickr.com/photos/feverblue/1166368091/in/photolist-2M4WAF-HKi9Wb-bPUuHF-bAZRBu-bAZRJs-cPgNi9-cPgM5y-ecdXE4-qiy3fi-ece5nR-8DSTJT-ekXn2J-ekXkQA-ekRMVx-p5EdCT-424qe7-41ZhFv-cHJ84A-ekRLzc-cS67Hu-cS6sKq-cEcmt5-9ae42r-eoqvBm-HH9A1M-846gJQ-dk6J-nqxTYN-J1PZUH-5q2iLC-HVDkA7-g6xvD3-96MiqV-hbZene-46uZse-6SUYhp-dGaXzD-oUC1ic-DayAdv-aoPJSZ-73L8Lm-5qVCiG-5qVC5q-3Kyyzj-Bf5UzP-4CbKo9-9ae8KM-5CXLKZ-pjFpPT-eccLSR
  4. The 3rd cat hiding in the background is serialization
  5. The 3rd cat hiding in the background is serialization
  6. The 3rd cat hiding in the background is serialization
  7. The 3rd cat hiding in the background is serialization
  8. Red panda image is public domain https://www.flickr.com/photos/mathiasappel/25901445745/in/photolist-FsPFqM-8V1c5-EfF7ct-P6AMfB-oUV39B-EenZCo-oVbE3K-DDMMtE-295N7-dtVySZ-dtVzLP-7yXK47-6DEn2P-oY28m2-DnHaur-qtt3m9-DzTW6Q-E6JQmA-7yNdac-8gAHMU-8469Y8-pCUWfm-qLtvZk-E1BDYY-A5UiNX-bEw84A-yXwV8w-dAyief-z7erYS-BgXdnN-E2yyjM-fhXmyG-Dvyp78-qW9LEZ-qDHaZv-B9BzuN-nL11Wn-C4spLb-8wnAdN-6S1ECS-ntw1gj-7zbXAA-sanjgW-Ci72XP-oNYs5j-rBErd-awBvQG-4YDCuv-G82zkD-zwcfSu
  9. https://www.flickr.com/photos/wapiko57/6514540899/in/photolist-82QaA6-aVfJNM-oX8Dp7-aVEJ3F-qTG9ni-97uBZ7-97SVrH-qWFs4R-cgE8rJ-a9mSXv-qm83Bv-cUPhgC-988EVA-kUgwo-4sqj48-8e6MB6-apVrgH-3KAUyx-5F373J-qyD7E9-j17GZ-eakbAD-VrPk79-4GSqUt-9Kwe3v/