Apache Spark as a gateway to functional concepts

Apache Spark as a gateway
drug to functional programming
Concepts taught & broken
@holdenkarau

Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos

Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
○ Currently out of print, discussing a reprint re-run with my wife
● On twitter @BooProgrammer

Why Google Cloud cares about Spark?
● Lots of data!
○ We mostly use different, although similar FP inspired, tools internally
● We have two hosted solutions for using Spark (dataproc & GKE)
○ I have a blog post on how to try out custom/new versions of Spark if you want to help us test
the current RC - https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark-
releases-and-changes-on-google-kubernetes-engine-and-cloud-dataproc

Who do I think y’all are?
● Friendly[ish] people
● Don’t mind pictures of cats or stuffed animals
● Like functional programming
● Want to keep growing the functional programming community
Lori Erickson

What will be covered?
● What is Spark (super brief) & how it’s helped drive FP to enterprise
● Wordcount example (as required by license)
● What concepts Spark does a good of teaching folks
● What concepts Spark does a so-so job of teaching folks
● Some examples
● Format is going to be happy-sad (repeating) so we can end on a downer
● But with lots of cat pictures I promise

What is Spark?
● General purpose distributed system
○ Built in Scala with an FP inspired API
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets

When we say distributed we mean...

Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods

Why people come to Spark:
My DataFrame won’t fit
in memory on my
cluster anymore, let
alone my MacBook Pro
:( Maybe this Spark
business will solve
that...
brownpau

Plus a little magic :)
Steven Saus

What is the “magic” of Spark?
● Automatically distributed functional programming :)
● DAG / “query plan” is the root of much of it
● Optimizer to combine steps
● Resiliency: recover from failures rather than protecting
from failures.
● “In-memory” + “spill-to-disk”
● Functional programming to build the DAG for “free”
● Select operations without deserialization
● The best way to trick people into learning functional
programming
Richard Gillin

The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson

What Spark got right (for Scala/FP):
● Strong enforced[ish] requirement for immutable data
○ Use recompute for failure so a core part of the logic
● Functional operators (map, filter, flatMap, etc.)
● Lambdas for everyone!
○ Sometime too many….
● Solved a “business need”
○ Even if that need was imaginary
● Made it hard to have side effects against external variables without being very
explicit & verbose
○ Even then discouraged strongly through lack of documentation :p
Stuart

What Spark got … less right (for Scala/FP):
● Serialization… complications
○ Makes people think closures are more limited than they can be
● Lots of Map[String, String] (equivalent) settings
○ Hey buddy can you spare a type checker?
● Hard to debug, could be confused with Scala hard to debug
○ Not completely unjustified sometimes
● New ML & SQL APIs without “any” types (initially)
● Heavy mutation focused API for Machine Learning
● Internals are…. very not functional programming best practices
indamage

Before the technical details: positioning
● “100x faster than hadoop” sounds nice
● wc(wc) < 17 also sounds nice if you’re current wordcount doesn’t fit on a page
● “Integrated solution” - sounds nice if you have to write your data to disk
between querying and training
● “Works with your existing JVM or Python code base.”
● Part of the Apache Software Foundation
○ e.g. you already run software from these folks, don’t worry about lock-in they’ve got it handled
● Like all positioning each of these have some implied “*”s associated with
them
● Focused on the benefits made possible with FP rather than the fact it was FP
● Lead to large commercial install base of folks being exposed to functional
programming

Hello World (Word count) - 1 of ?
val lines = sc.textFile("boop")
val words = lines.flatMap(line => line.split(" "))
val wordPairs = words.map(word => (word, 1))
val wordCounts = wordPairs.reduceByKey(c1, c2: c1 + c2)
wordCounts.saveAsTextFile("snoop")
Photo By: Will
Keightley

Spark & Lazy Evaluation: True BFFs
● Allows Spark to build a graph of operations and use it for recomputing on
failure
● Less passes over the data without thinking*
● Except…. Debates around limiting it due to developer confusion with
debugging
● And library level rather than macro or compiler level means needing to
kcxd

Spark & Immutability: Really good friends
● Many systems enforce some level of immutability but...
● Scala allows mutability (vars & friends), but Spark get’s not so happy
● Execution model is recompute on failure, and mutation would break this
● Provides a clear benefit (e.g. “Instead of writing data out to 3 network disks
we just recompute on failure leading to 100x improvement”*)
Bartlomiej Mostek

Even if you try to use, no mutation for you*
scala> var x = 0
x: Int = 0
scala> val rdd = sc.parallelize(1.to(100))
scala> val added = rdd.map{e => x += e}
scala> added.count()
res0: Long = 100
scala> x
res1: Int = 0
PROJennifer C.

Well…. only local mutation
scala> val result = rdd.map{e => x += e; x}
scala> result.collect()
res1: Array[Int] = Array(1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120,
136, 153, 171, 190, 210, 231, 253, 276, 300, 325, 26, 53, 81, 110, 140, 171, 203,
236, 270, 305, 341, 378, 416, 455, 495, 536, 578, 621, 665, 710, 756, 803, 851,
900, 950, 51, 103, 156, 210, 265, 321, 378, 436, 495, 555, 616, 678, 741, 805,
870, 936, 1003, 1071, 1140, 1210, 1281, 1353, 1426, 1500, 1575, 76, 153, 231,
310, 390, 471, 553, 636, 720, 805, 891, 978, 1066, 1155, 1245, 1336, 1428, 1521,
1615, 1710, 1806, 1903, 2001, 2100, 2200)
Raita Futo

Well…. only very local mutation
scala> val rdd = sc.parallelize(1.to(100), 10)
scala> val result = rdd.map{e => x += e; x}
scala> result.collect()
res2: Array[Int] = Array(1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 11, 23, 36, 50, 65, 81,
98, 116, 135, 155, 21, 43, 66, 90, 115, 141, 168, 196, 225, 255, 31, 63, 96, 130,
165, 201, 238, 276, 315, 355, 41, 83, 126, 170, 215, 261, 308, 356, 405, 455, 51,
103, 156, 210, 265, 321, 378, 436, 495, 555, 61, 123, 186, 250, 315, 381, 448,
516, 585, 655, 71, 143, 216, 290, 365, 441, 518, 596, 675, 755, 81, 163, 246, 330,
415, 501, 588, 676, 765, 855, 91, 183, 276, 370, 465, 561, 658, 756, 855, 955)
PROSusanne Nilsson

Spark & Immutability: A few small challenges
● Accumulators:
○ Accumulators “leak” on recompute and partial compute (another Spark trick)
● You _can_ mutate state inside of the worker or master it just won’t propagate
○ No error messages here, just unexpected future “happiness”
Caroline

Spark & Lambdas: Everyone gets a lambda!
● Maps, flatMaps, filters, reducers oh my (and that’s just wordcount :p )
● Writing in Java 7?
○ You get a weird looking lambda!
Mark Jensen

Spark & Lambdas - closures & the dark side
● Python & Scala lambda serialization is (understandably) cautious
● Referencing a variable in the class brings the whole class with you
● Can create the impression closures are limited to serializable data
● ClosureCleaner.scala - “You weren’t using that class right?” & CloudPickle -
“Welllll…. What about if we stored this differently?”
● pre-Java 8 (new FunctionX<A, B, C>... ugh)
● SQL API custom aggregates don’t currently support lambdas :(
David Goehring

Oh right: serialization :(
● Many things in JVM & Python are not well designed to be rehydrated on
another VM let alone another machine
● Space efficiency: well…. At least its not XML?
● Oh wait, we have to parse Python serialized data in the JVM? Ruh roh!
● Oh hey let’s make it configurable… oh wait...

Reaching non-“traditional” FP languages
● Spark works in many languages - Python, R, Java, etc.
○ Allows Spark to have a wider audience that doesn’t have the time (or tooling) to learn another
language
○ e.g. Numerical Scala libraries are… rough
● Meeting developers where they are
● Some overhead (often) pushes systemy folks towards learning Scala for
performance
● Sometimes we could do a better job of working with the existing FP tools
○ e.g. we could do more to make an RDD look like “normal” Python and maybe I’d stop typing
.flatMap on Python itrs

Reaching non-“traditional” FP languages
lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' '))
.map(lambda x: (x, 1))
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
photobom

Reaching non-“traditional” FP languages Petful

ML: Enough getters and setters for the 90s!
● Took inspiration from ski-kit-learn
○ Which is a cool system, just not super functionally oriented
● Added hidden metadata, which got dropped in lots of places (not quite as bad
as global state buuuut….)
● Threw away compile time type information
● I really don’t have a + for this one in the FP column teaching

Basic Dataprep pipeline for “ML”
// Combines a list of double input features into a vector
val assembler = new VectorAssembler().setInputCols(Array("age",
"education-num")).setOutputCol("features")
// String indexer converts a set of strings into doubles
val indexer = StringIndexer().setInputCol("category")
.setOutputCol("category-index")
// Can be used to combine pipeline components together
val pipeline = Pipeline().setStages(Array(assembler, indexer))
Huang
Yun
Chung

Adding some ML (no longer cool -- DL)
// Specify model
val dt = new DecisionTreeClassifier()
.setLabelCol("category-index")
.setFeaturesCol("features")
// Add it to the pipeline
val pipeline_and_model = Pipeline().setStages(
List(assembler, indexer, dt))
val pipeline_model = pipeline_and_model.fit(df)

The internals… are fairly imperative
● Not everywhere and not without reason (at the time they were created)
● No macros for historic performance reasons
○ Code generation is done by concating strings of Java code
● This would matter less, but Spark is pretty aggressive at trying to hide things,
enough that some major projects end up having to peek a little bit at the code
● Can feel as a “do what I say not what I do” sometimes
○ Especially in ML lot’s of hacks to make things go a little bit faster using internal APIs
ivva

And folks need to access them :(
ivva

Let’s end on a happy-ish note
● Yes Spark isn’t perfect, but it’s APIs are teaching a generation of “big data”
developers & data scientists functional programming without calling it that (at
the start)
● Once we get them hooked we can show them cool things!
● If we can find other areas where we make tools that expose FP apis to more
non-FP folks (and not just for the sake of FP) we can make better software
● Or we can all go back to writing 90s style enterprise Java

Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark

High Performance Spark!
Available today! A great second Spark book to read (but
please buy it first)
You can buy it from that scrappy Seattle bookstore, Jeff
Bezos needs another newspaper and I want a cup of
coffee.
http://bit.ly/hkHighPerfSpark

And some upcoming talks:
● July
○ OSCON Portland & meetup
● August
○ JupyterCon NYC
● September
○ Strata NYC
○ Strangeloop STL
● October
○ Spark Summit London
○ Reversim Tel Aviv

k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Pssst: Have feedback on the presentation? Give me a shout
(holden@pigscanfly.ca or http://bit.ly/holdenTalkFeedback ) if
you feel comfortable doing so :)
Feedback (if you are so inclined):
http://bit.ly/holdenTalkFeedback

Apache Spark as a gateway to functional concepts

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark as a gateway to functional concepts

Similar to Apache Spark as a gateway to functional concepts (20)

Recently uploaded

Recently uploaded (20)

Apache Spark as a gateway to functional concepts

Editor's Notes