Apache Spark has driven a lot of adoption of both Scala and functional programming concepts in non-traditionally industries. For many programmers in the big data world they coming looking for a solution to scaling their code, and quickly find themselves dealing with immutable data structures and lambdas, and those who love it stay. However, there is a dark side (of escape), much of Spark’s functional programming is changing, and even though it encourages functional programming it’s in a variety of languages with different expectations (in-line XML as a valid part of your language is fun!). This talk will look at how Spark does a good job of introduce folks to concepts like immutability, but also places where we maybe don’t do a great job of setting up developers for a life of functional programming. Things like accumulators, our three different models for streaming data, and an “interesting” approach to closures (come to find out what the ClosuerCleaner does, stay to find out why). The talk will close out with a look at how the functional inspired API is in exposed in the different languages, and how this impacts the kind of code written (Scala, Java, and Python – other languages are supported by Spark but I don’t want to re-learn Javascript or learn R just for this talk). Pictures of cute animals will be included in the slides to distract from the sad parts.
Video: https://www.youtube.com/watch?v=EDJfpkDpoE4
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Apache Spark as a gateway to functional concepts
1. Apache Spark as a gateway
drug to functional programming
Concepts taught & broken
@holdenkarau
2. Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
3.
4. Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
○ Currently out of print, discussing a reprint re-run with my wife
● On twitter @BooProgrammer
5. Why Google Cloud cares about Spark?
● Lots of data!
○ We mostly use different, although similar FP inspired, tools internally
● We have two hosted solutions for using Spark (dataproc & GKE)
○ I have a blog post on how to try out custom/new versions of Spark if you want to help us test
the current RC - https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark-
releases-and-changes-on-google-kubernetes-engine-and-cloud-dataproc
6. Who do I think y’all are?
● Friendly[ish] people
● Don’t mind pictures of cats or stuffed animals
● Like functional programming
● Want to keep growing the functional programming community
Lori Erickson
7. What will be covered?
● What is Spark (super brief) & how it’s helped drive FP to enterprise
● Wordcount example (as required by license)
● What concepts Spark does a good of teaching folks
● What concepts Spark does a so-so job of teaching folks
● Some examples
● Format is going to be happy-sad (repeating) so we can end on a downer
● But with lots of cat pictures I promise
8. What is Spark?
● General purpose distributed system
○ Built in Scala with an FP inspired API
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
10. Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods
11. Why people come to Spark:
My DataFrame won’t fit
in memory on my
cluster anymore, let
alone my MacBook Pro
:( Maybe this Spark
business will solve
that...
brownpau
13. What is the “magic” of Spark?
● Automatically distributed functional programming :)
● DAG / “query plan” is the root of much of it
● Optimizer to combine steps
● Resiliency: recover from failures rather than protecting
from failures.
● “In-memory” + “spill-to-disk”
● Functional programming to build the DAG for “free”
● Select operations without deserialization
● The best way to trick people into learning functional
programming
Richard Gillin
14. The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
15. What Spark got right (for Scala/FP):
● Strong enforced[ish] requirement for immutable data
○ Use recompute for failure so a core part of the logic
● Functional operators (map, filter, flatMap, etc.)
● Lambdas for everyone!
○ Sometime too many….
● Solved a “business need”
○ Even if that need was imaginary
● Made it hard to have side effects against external variables without being very
explicit & verbose
○ Even then discouraged strongly through lack of documentation :p
Stuart
16. What Spark got … less right (for Scala/FP):
● Serialization… complications
○ Makes people think closures are more limited than they can be
● Lots of Map[String, String] (equivalent) settings
○ Hey buddy can you spare a type checker?
● Hard to debug, could be confused with Scala hard to debug
○ Not completely unjustified sometimes
● New ML & SQL APIs without “any” types (initially)
● Heavy mutation focused API for Machine Learning
● Internals are…. very not functional programming best practices
indamage
17. Before the technical details: positioning
● “100x faster than hadoop” sounds nice
● wc(wc) < 17 also sounds nice if you’re current wordcount doesn’t fit on a page
● “Integrated solution” - sounds nice if you have to write your data to disk
between querying and training
● “Works with your existing JVM or Python code base.”
● Part of the Apache Software Foundation
○ e.g. you already run software from these folks, don’t worry about lock-in they’ve got it handled
● Like all positioning each of these have some implied “*”s associated with
them
● Focused on the benefits made possible with FP rather than the fact it was FP
● Lead to large commercial install base of folks being exposed to functional
programming
18. Hello World (Word count) - 1 of ?
val lines = sc.textFile("boop")
val words = lines.flatMap(line => line.split(" "))
val wordPairs = words.map(word => (word, 1))
val wordCounts = wordPairs.reduceByKey(c1, c2: c1 + c2)
wordCounts.saveAsTextFile("snoop")
Photo By: Will
Keightley
19. Spark & Lazy Evaluation: True BFFs
● Allows Spark to build a graph of operations and use it for recomputing on
failure
● Less passes over the data without thinking*
● Except…. Debates around limiting it due to developer confusion with
debugging
● And library level rather than macro or compiler level means needing to
kcxd
20. Spark & Immutability: Really good friends
● Many systems enforce some level of immutability but...
● Scala allows mutability (vars & friends), but Spark get’s not so happy
● Execution model is recompute on failure, and mutation would break this
● Provides a clear benefit (e.g. “Instead of writing data out to 3 network disks
we just recompute on failure leading to 100x improvement”*)
Bartlomiej Mostek
21. Even if you try to use, no mutation for you*
scala> var x = 0
x: Int = 0
scala> val rdd = sc.parallelize(1.to(100))
scala> val added = rdd.map{e => x += e}
scala> added.count()
res0: Long = 100
scala> x
res1: Int = 0
PROJennifer C.
24. Spark & Immutability: A few small challenges
● Accumulators:
○ Accumulators “leak” on recompute and partial compute (another Spark trick)
● You _can_ mutate state inside of the worker or master it just won’t propagate
○ No error messages here, just unexpected future “happiness”
Caroline
25. Spark & Lambdas: Everyone gets a lambda!
● Maps, flatMaps, filters, reducers oh my (and that’s just wordcount :p )
● Writing in Java 7?
○ You get a weird looking lambda!
Mark Jensen
26. Spark & Lambdas - closures & the dark side
● Python & Scala lambda serialization is (understandably) cautious
● Referencing a variable in the class brings the whole class with you
● Can create the impression closures are limited to serializable data
● ClosureCleaner.scala - “You weren’t using that class right?” & CloudPickle -
“Welllll…. What about if we stored this differently?”
● pre-Java 8 (new FunctionX<A, B, C>... ugh)
● SQL API custom aggregates don’t currently support lambdas :(
David Goehring
27. Oh right: serialization :(
● Many things in JVM & Python are not well designed to be rehydrated on
another VM let alone another machine
● Space efficiency: well…. At least its not XML?
● Oh wait, we have to parse Python serialized data in the JVM? Ruh roh!
● Oh hey let’s make it configurable… oh wait...
28. Reaching non-“traditional” FP languages
● Spark works in many languages - Python, R, Java, etc.
○ Allows Spark to have a wider audience that doesn’t have the time (or tooling) to learn another
language
○ e.g. Numerical Scala libraries are… rough
● Meeting developers where they are
● Some overhead (often) pushes systemy folks towards learning Scala for
performance
● Sometimes we could do a better job of working with the existing FP tools
○ e.g. we could do more to make an RDD look like “normal” Python and maybe I’d stop typing
.flatMap on Python itrs
31. ML: Enough getters and setters for the 90s!
● Took inspiration from ski-kit-learn
○ Which is a cool system, just not super functionally oriented
● Added hidden metadata, which got dropped in lots of places (not quite as bad
as global state buuuut….)
● Threw away compile time type information
● I really don’t have a + for this one in the FP column teaching
32. Basic Dataprep pipeline for “ML”
// Combines a list of double input features into a vector
val assembler = new VectorAssembler().setInputCols(Array("age",
"education-num")).setOutputCol("features")
// String indexer converts a set of strings into doubles
val indexer = StringIndexer().setInputCol("category")
.setOutputCol("category-index")
// Can be used to combine pipeline components together
val pipeline = Pipeline().setStages(Array(assembler, indexer))
Huang
Yun
Chung
33. Adding some ML (no longer cool -- DL)
// Specify model
val dt = new DecisionTreeClassifier()
.setLabelCol("category-index")
.setFeaturesCol("features")
// Add it to the pipeline
val pipeline_and_model = Pipeline().setStages(
List(assembler, indexer, dt))
val pipeline_model = pipeline_and_model.fit(df)
34. The internals… are fairly imperative
● Not everywhere and not without reason (at the time they were created)
● No macros for historic performance reasons
○ Code generation is done by concating strings of Java code
● This would matter less, but Spark is pretty aggressive at trying to hide things,
enough that some major projects end up having to peek a little bit at the code
● Can feel as a “do what I say not what I do” sometimes
○ Especially in ML lot’s of hacks to make things go a little bit faster using internal APIs
ivva
36. Let’s end on a happy-ish note
● Yes Spark isn’t perfect, but it’s APIs are teaching a generation of “big data”
developers & data scientists functional programming without calling it that (at
the start)
● Once we get them hooked we can show them cool things!
● If we can find other areas where we make tools that expose FP apis to more
non-FP folks (and not just for the sake of FP) we can make better software
● Or we can all go back to writing 90s style enterprise Java
37. Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
38. High Performance Spark!
Available today! A great second Spark book to read (but
please buy it first)
You can buy it from that scrappy Seattle bookstore, Jeff
Bezos needs another newspaper and I want a cup of
coffee.
http://bit.ly/hkHighPerfSpark
39. And some upcoming talks:
● July
○ OSCON Portland & meetup
● August
○ JupyterCon NYC
● September
○ Strata NYC
○ Strangeloop STL
● October
○ Spark Summit London
○ Reversim Tel Aviv
40. k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Pssst: Have feedback on the presentation? Give me a shout
(holden@pigscanfly.ca or http://bit.ly/holdenTalkFeedback ) if
you feel comfortable doing so :)
Feedback (if you are so inclined):
http://bit.ly/holdenTalkFeedback
We can examine how RDD’s work in practice with the traditonal word count example. If you’ve taken another intro to big data class, or just worked with mapreduce you’ll notice that this is a lot less code than we normally have to do.
https://www.flickr.com/photos/feverblue/1166368091/in/photolist-2M4WAF-HKi9Wb-bPUuHF-bAZRBu-bAZRJs-cPgNi9-cPgM5y-ecdXE4-qiy3fi-ece5nR-8DSTJT-ekXn2J-ekXkQA-ekRMVx-p5EdCT-424qe7-41ZhFv-cHJ84A-ekRLzc-cS67Hu-cS6sKq-cEcmt5-9ae42r-eoqvBm-HH9A1M-846gJQ-dk6J-nqxTYN-J1PZUH-5q2iLC-HVDkA7-g6xvD3-96MiqV-hbZene-46uZse-6SUYhp-dGaXzD-oUC1ic-DayAdv-aoPJSZ-73L8Lm-5qVCiG-5qVC5q-3Kyyzj-Bf5UzP-4CbKo9-9ae8KM-5CXLKZ-pjFpPT-eccLSR
The 3rd cat hiding in the background is serialization
The 3rd cat hiding in the background is serialization
The 3rd cat hiding in the background is serialization
The 3rd cat hiding in the background is serialization
Red panda image is public domain https://www.flickr.com/photos/mathiasappel/25901445745/in/photolist-FsPFqM-8V1c5-EfF7ct-P6AMfB-oUV39B-EenZCo-oVbE3K-DDMMtE-295N7-dtVySZ-dtVzLP-7yXK47-6DEn2P-oY28m2-DnHaur-qtt3m9-DzTW6Q-E6JQmA-7yNdac-8gAHMU-8469Y8-pCUWfm-qLtvZk-E1BDYY-A5UiNX-bEw84A-yXwV8w-dAyief-z7erYS-BgXdnN-E2yyjM-fhXmyG-Dvyp78-qW9LEZ-qDHaZv-B9BzuN-nL11Wn-C4spLb-8wnAdN-6S1ECS-ntw1gj-7zbXAA-sanjgW-Ci72XP-oNYs5j-rBErd-awBvQG-4YDCuv-G82zkD-zwcfSu