SlideShare a Scribd company logo
1 of 51
Download to read offline
A Fast Intro to Spark
Lightning fast cluster computing*
Now with
vectorized
UDFs! New in
Spark 2.3!
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google focused on OSS Big Data
● Apache Spark PMC (think committer with tenure)
● Contributor to a lot of other projects (including BEAM)
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of High Performance Spark & Learning Spark (+ more)
● Twitter: @holdenkarau
● Slideshare http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Related Spark Videos http://bit.ly/holdenSparkVideos
Who do I think you all are?
● Nice people*
● Getting started with Spark
○ Or wondering if you need it
● Famalrish-ish with Python
Amanda
What we are going to explore together!
● What is Spark?
● Getting Spark setup locally
● Spark primary distributed collection
● Word count in Spark
● Spark SQL / DataFrames
● Wordcount part 2: now with vectorized UDFs
Some things that may color my views:
● I’m on the Spark PMC -- Spark’s success => I can probably make more $s
● My employer cares about data
● I work primarily in Python & Scala these days
● I like functional programming
● Probably some others I’m forgetting
On the other hand:
● I’ve worked on Spark for a long time and know a lot of its faults
● My goals are pretty flexible
● I have x86 assembly code tattooed on my back
What is Spark?
● General purpose distributed system
○ With a really nice API including Python :)
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods
Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau
Companion (optional!) notebook funtimes:
http://bit.ly/sparkDocs
http://bit.ly/sparkPyDocs
http://bit.ly/PySparkIntroExamples (has a notebook!)
● Did you know? You can run Spark on Dataproc there by
giving my employer money. You can also run it
elsewhere.
http://bit.ly/learningSparkExamples (lots of code files)
http://bit.ly/hkMLExample (has a notebook, ML focused)
David DeHetre
SparkContext: entry to the world
● Can be used to create RDDs from many input sources
○ Native collections, local & remote FS
○ Any Hadoop Data Source
● Also create counters & accumulators
● Automatically created in the shells (called sc)
● Specify master & app name when creating
○ Master can be local[*], spark:// , yarn, etc.
○ app name should be human readable and make sense
● etc.
Petfu
l
RDDs: Spark’s Primary abstraction
RDD (Resilient Distributed Dataset)
● Distributed collection
● Recomputed on node failure
● Distributes data & work across the cluster
● Lazily evaluated (transformations & actions)
Helen Olney
Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(“output”)
Photo By: Will
Keightley
Why laziness is cool (and not)
● Pipelining (can put maps, filter, flatMap together)
● Can do interesting optimizations by delaying work
● We use the DAG to recompute on failure
○ (writing data out to 3 disks on different machines is so last season)
○ Or the DAG puts the R is Resilient RDD, except DAG doesn’t have an
R :(
How it hurts:
● Debugging is confusing
● Re-using data - lazyness only sees up to the first action
● Some people really hate immutability
Matthew Hurst
Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile("output")
No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
daniilr
RDD re-use - sadly not magic
● If we know we are going to re-use the RDD what should we do?
○ If it fits nicely in memory caching in memory
○ persisting at another level
■ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK,
MEMORY_AND_DISK_SER
○ checkpointing
● Noisey clusters
○ _2 & checkpointing can help
● persist first for checkpointing
Richard Gillin
Some common transformations & actions
Transformations (lazy)
● map
● filter
● flatMap
● reduceByKey
● join
● cogroup
Actions (eager)
● count
● reduce
● collect
● take
● saveAsTextFile
● saveAsHadoop
● countByValue
Photo by Steve
Photo by Dan G
This can feel like magic* sometimes :)
Steven Saus
*I mean not good magic.
Magic has it’s limits: key-skew + black boxes
● There is a worse way to do WordCount
● We can use the seemingly safe thing called groupByKey
● Then compute the sum...
_torne
Bad word count RDD :(
words = rdd.flatMap(lambda x: x.split(" "))
wordPairs = words.map(lambda w: (w, 1))
grouped = wordPairs.groupByKey()
counted_words = grouped.mapValues(lambda counts: sum(counts))
counted_words.saveAsTextFile("boop")
Tomomi
f ford Pinto by Morven
f ford Pinto by Morven
ayphen
Why should we consider Datasets?
● Performance
○ Smart optimizer
○ More efficient storage
○ Faster serialization
● Simplicity
○ Windowed operations
○ Multi-column & multi-type aggregates
Rikki's Refuge
Why are Datasets so awesome?
● Easier to mix functional style and relational style
○ No more hive UDFs!
● Nice performance of Spark SQL flexibility of RDDs
○ Tungsten (better serialization)
○ Equivalent of Sortable trait
● Strongly typed
● The future (ML, Graph, etc.)
● Potential for better language interop
○ Something like Arrow has a much better chance with Datasets
○ Cross-platform libraries are easier to make & use
Will Folsom
What is the performance like?
Andrew Skudder
How is it so fast?
● Optimizer has more information (schema & operations)
● More efficient storage formats
● Faster serialization
● Some operations directly on serialized data formats
● non-JVM languages: does more computation in the JVM
Andrew Skudder
Word count w/Dataframes
df = spark.read.load(src)
# Returns an RDD
words = df.select("text").flatMap(lambda x: x.text.split(" "))
words_df = words.map(
lambda x: Row(word=x, cnt=1)).toDF()
word_count = words_df.groupBy("word").sum()
word_count.write.format("parquet").save("wc.parquet")
Still have the double
serialization here :(
What can the optimizer do now?
● Sort on the serialized data
● Understand the aggregate (“partial aggregates”)
○ Could sort of do this before but not as awesomely, and only if we used
reduceByKey - not groupByKey
● Pack them bits nice and tight
So whats this new groupBy?
● No longer causes explosions like RDD groupBy
○ Able to introspect and pipeline the aggregation
● Returns a GroupedData (or GroupedDataset)
● Makes it easy to perform multiple aggregations
● Built in shortcuts for aggregates like avg, min, max
● Longer list at
http://spark.apache.org/docs/latest/api/scala/index.html#
org.apache.spark.sql.functions$
● Allows the optimizer to see what aggregates are being
performed
Sherrie Thai
Computing some aggregates by age code:
df.groupBy("age").min("hours-per-week")
OR
import org.apache.spark.sql.catalyst.expressions.aggregate._
df.groupBy("age").agg(min("hours-per-week"))
Easily compute multiple aggregates:
df.groupBy("age").agg(min("hours-per-week"),
avg("hours-per-week"),
max("capital-gain"))
PhotoAtelier
But where DataFrames explode?
● Iterative algorithms - large plans
○ Use your escape hatch to RDDs!
● Some push downs are sad pandas :(
● Default shuffle size is sometimes too small for big data
(200 partitions)
● Default partition size when reading in is also sad
Spark ML pipelines
Tokenizer HashingTF String Indexer Naive Bayes
Tokenizer HashingTF
Streaming
String Indexer
Streaming
Naive Bayes
fit(df)
Estimator
Transformer
● Sci-Kit Learn Inspired
● Consist of Estimators and Transformers
Data prep
# Combines a list of double input features into a vector
assembler = VectorAssembler(inputCols=["age", "education-num"],
outputCol="feautres")
# String indexer converts a set of strings into doubles
indexer =
StringIndexer(inputCol="category")
.setOutputCol("category-index")
# Can be used to combine pipeline components together
pipeline = Pipeline().setStages([assembler, indexer])
Huang
Yun
Chung
And its not complete without a tree :p
dt = DecisionTreeClassifier(labelCol = "category-index",
featuresCol="feautres")
pipeline = Pipeline().setStages([assembler, indexer, dt])
model = pipeline.fit(df)
Diethelm Scheidereit
Spark in Scala, how does PySpark work?
● Py4J + pickling + JSON and magic
○ This can be kind of slow sometimes
● Distributed collections are often collections of pickled
objects
● Spark SQL (and DataFrames) avoid some of this
○ Sometimes we can make them go fast and compile them to the JVM
● Features aren’t automatically exposed, but exposing
them is normally simple.
● SparkR depends on similar magic
kristin klein
So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
The present*: faster interchange
● By present I mean in a release this month, so with lots of bugs
● Unifying our cross-language experience
○ And not just “normal” languages, CUDA counts yo
Tambako The Jaguar
Andrew Skudder
*Arrow: likely the future. I really hope so. Spark 2.3 and beyond!
* *
With early work happening to
support GPUs/ TF.
What does the future look like?*
*Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html.
*Vendor
benchmark.
Trust but verify.
What does the future look like - in code
@pandas_udf("integer", PandasUDFType.SCALAR)
def add_one(x):
return x + 1
What does the future look like - in code
@pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
def normalize(pdf):
v = pdf.v
return pdf.assign(v=(v - v.mean()) / v.std())
What does the future look like - in code
@pandas_udf("word string", PandasUDFType.GROUPED_MAP)
def special_tokenize(s):
if s.strings is not None:
return pandas.DataFrame(reduce(list.__add__, map(lambda x:
x.split(' '), s.strings)))
# This is a little ugly because currently the scalar transform
# doesn't allow flat map like behaviour only map like.
grouped = df.groupby("strings")
tokenized = grouped.apply(special_tokenize)
tokenized.show()
More
Word
Count!
What’s the general state of non-JVM world?
Most of the tools are built in the JVM, so how do we play together?
● Pickling, Strings, JSON, XML, oh my!
● Unix pipes
● Sockets
What about if we don’t want to copy the data all the time? Dataframe Api + Arrow
● Or standalone “pure”* re-implementations of everything
○ Reasonable option for things like Kafka where you would have the I/O regardless.
○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem
David Brown
Dask: a new beginning?
● Pure* python implementation
● Provides real enough DataFrame interface for distributed data
● Also your standard-ish distributed collections
● Multiple backends
● Primary challenge: interacting with the rest of the big data ecosystem
○ Arrow & friends might make this better with time too, buuut….
● See https://dask.pydata.org/en/latest/ &
http://dask.pydata.org/en/latest/spark.html
● Also similar ray
Lisa Zins
BEAM Beyond the JVM
● Non JVM BEAM doesn’t work outside of Google’s environment yet, so I’m
going to skip the details.
● tl;dr : uses grpc / protobuf
● But exciting new plans to unify the runners and ease the support of different
languages (called SDKS)
○ See https://beam.apache.org/contribute/portability/
● If this is exciting, you can come join me on making BEAM work in Python3
○ Yes we still don’t have that :(
○ But we're getting closer!
What about just multiprocessing?
● Does your data fit on a single machine?
● Will it continue to do so for as long as you plan to work on this team?
High Performance Spark!
You can buy it today! Several copies!
Only one chapter on non-JVM stuff, I’m sorry.
Cats love it*
*Or at least the box it comes in. If buying for a cat, get print
rather than e-book.
And some upcoming talks:
● April
○ Flink Forward
○ Dataworks Summit Berlin
○ Kafka Summit London
○ PyData London - Keynote :D
● May
○ Stata London:
https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/6
4759
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
I need to give a testing talk in a few
months, help a “friend” out.
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
It’s performance review season, so help a friend out and
fill out this survey with your talk feedback
http://bit.ly/holdenTalkFeedback

More Related Content

What's hot

Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
 
Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!Holden Karau
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Holden Karau
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Holden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache SparkHolden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopHolden Karau
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Holden Karau
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Modelssparktc
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Holden Karau
 

What's hot (20)

Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
 
Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
 

Similar to A fast introduction to PySpark with a quick look at Arrow based UDFs

Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Holden Karau
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Holden Karau
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Lightning Fast Dataframes with Polars
Lightning Fast Dataframes with PolarsLightning Fast Dataframes with Polars
Lightning Fast Dataframes with PolarsAlberto Danese
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Ian Pointer
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 

Similar to A fast introduction to PySpark with a quick look at Arrow based UDFs (20)

Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Lightning Fast Dataframes with Polars
Lightning Fast Dataframes with PolarsLightning Fast Dataframes with Polars
Lightning Fast Dataframes with Polars
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 

Recently uploaded

Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Lucknow
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 

Recently uploaded (20)

Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 

A fast introduction to PySpark with a quick look at Arrow based UDFs

  • 1. A Fast Intro to Spark Lightning fast cluster computing* Now with vectorized UDFs! New in Spark 2.3!
  • 2. Who am I? ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google focused on OSS Big Data ● Apache Spark PMC (think committer with tenure) ● Contributor to a lot of other projects (including BEAM) ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of High Performance Spark & Learning Spark (+ more) ● Twitter: @holdenkarau ● Slideshare http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Related Spark Videos http://bit.ly/holdenSparkVideos
  • 3.
  • 4. Who do I think you all are? ● Nice people* ● Getting started with Spark ○ Or wondering if you need it ● Famalrish-ish with Python Amanda
  • 5. What we are going to explore together! ● What is Spark? ● Getting Spark setup locally ● Spark primary distributed collection ● Word count in Spark ● Spark SQL / DataFrames ● Wordcount part 2: now with vectorized UDFs
  • 6. Some things that may color my views: ● I’m on the Spark PMC -- Spark’s success => I can probably make more $s ● My employer cares about data ● I work primarily in Python & Scala these days ● I like functional programming ● Probably some others I’m forgetting On the other hand: ● I’ve worked on Spark for a long time and know a lot of its faults ● My goals are pretty flexible ● I have x86 assembly code tattooed on my back
  • 7. What is Spark? ● General purpose distributed system ○ With a really nice API including Python :) ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Good when too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  • 8. The different pieces of Spark Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  • 9. Why people come to Spark: Well this MapReduce job is going to take 16 hours - how long could it take to learn Spark? dougwoods
  • 10. Why people come to Spark: My DataFrame won’t fit in memory on my cluster anymore, let alone my MacBook Pro :( Maybe this Spark business will solve that... brownpau
  • 11. Companion (optional!) notebook funtimes: http://bit.ly/sparkDocs http://bit.ly/sparkPyDocs http://bit.ly/PySparkIntroExamples (has a notebook!) ● Did you know? You can run Spark on Dataproc there by giving my employer money. You can also run it elsewhere. http://bit.ly/learningSparkExamples (lots of code files) http://bit.ly/hkMLExample (has a notebook, ML focused) David DeHetre
  • 12. SparkContext: entry to the world ● Can be used to create RDDs from many input sources ○ Native collections, local & remote FS ○ Any Hadoop Data Source ● Also create counters & accumulators ● Automatically created in the shells (called sc) ● Specify master & app name when creating ○ Master can be local[*], spark:// , yarn, etc. ○ app name should be human readable and make sense ● etc. Petfu l
  • 13. RDDs: Spark’s Primary abstraction RDD (Resilient Distributed Dataset) ● Distributed collection ● Recomputed on node failure ● Distributes data & work across the cluster ● Lazily evaluated (transformations & actions) Helen Olney
  • 14. Word count (in python) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(“output”) Photo By: Will Keightley
  • 15. Why laziness is cool (and not) ● Pipelining (can put maps, filter, flatMap together) ● Can do interesting optimizations by delaying work ● We use the DAG to recompute on failure ○ (writing data out to 3 disks on different machines is so last season) ○ Or the DAG puts the R is Resilient RDD, except DAG doesn’t have an R :( How it hurts: ● Debugging is confusing ● Re-using data - lazyness only sees up to the first action ● Some people really hate immutability Matthew Hurst
  • 16. Word count (in python) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile("output") No data is read or processed until after this line This is an “action” which forces spark to evaluate the RDD daniilr
  • 17. RDD re-use - sadly not magic ● If we know we are going to re-use the RDD what should we do? ○ If it fits nicely in memory caching in memory ○ persisting at another level ■ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER ○ checkpointing ● Noisey clusters ○ _2 & checkpointing can help ● persist first for checkpointing Richard Gillin
  • 18. Some common transformations & actions Transformations (lazy) ● map ● filter ● flatMap ● reduceByKey ● join ● cogroup Actions (eager) ● count ● reduce ● collect ● take ● saveAsTextFile ● saveAsHadoop ● countByValue Photo by Steve Photo by Dan G
  • 19. This can feel like magic* sometimes :) Steven Saus *I mean not good magic.
  • 20. Magic has it’s limits: key-skew + black boxes ● There is a worse way to do WordCount ● We can use the seemingly safe thing called groupByKey ● Then compute the sum... _torne
  • 21. Bad word count RDD :( words = rdd.flatMap(lambda x: x.split(" ")) wordPairs = words.map(lambda w: (w, 1)) grouped = wordPairs.groupByKey() counted_words = grouped.mapValues(lambda counts: sum(counts)) counted_words.saveAsTextFile("boop") Tomomi
  • 22. f ford Pinto by Morven
  • 23. f ford Pinto by Morven ayphen
  • 24. Why should we consider Datasets? ● Performance ○ Smart optimizer ○ More efficient storage ○ Faster serialization ● Simplicity ○ Windowed operations ○ Multi-column & multi-type aggregates Rikki's Refuge
  • 25. Why are Datasets so awesome? ● Easier to mix functional style and relational style ○ No more hive UDFs! ● Nice performance of Spark SQL flexibility of RDDs ○ Tungsten (better serialization) ○ Equivalent of Sortable trait ● Strongly typed ● The future (ML, Graph, etc.) ● Potential for better language interop ○ Something like Arrow has a much better chance with Datasets ○ Cross-platform libraries are easier to make & use Will Folsom
  • 26. What is the performance like? Andrew Skudder
  • 27. How is it so fast? ● Optimizer has more information (schema & operations) ● More efficient storage formats ● Faster serialization ● Some operations directly on serialized data formats ● non-JVM languages: does more computation in the JVM Andrew Skudder
  • 28. Word count w/Dataframes df = spark.read.load(src) # Returns an RDD words = df.select("text").flatMap(lambda x: x.text.split(" ")) words_df = words.map( lambda x: Row(word=x, cnt=1)).toDF() word_count = words_df.groupBy("word").sum() word_count.write.format("parquet").save("wc.parquet") Still have the double serialization here :(
  • 29. What can the optimizer do now? ● Sort on the serialized data ● Understand the aggregate (“partial aggregates”) ○ Could sort of do this before but not as awesomely, and only if we used reduceByKey - not groupByKey ● Pack them bits nice and tight
  • 30. So whats this new groupBy? ● No longer causes explosions like RDD groupBy ○ Able to introspect and pipeline the aggregation ● Returns a GroupedData (or GroupedDataset) ● Makes it easy to perform multiple aggregations ● Built in shortcuts for aggregates like avg, min, max ● Longer list at http://spark.apache.org/docs/latest/api/scala/index.html# org.apache.spark.sql.functions$ ● Allows the optimizer to see what aggregates are being performed Sherrie Thai
  • 31. Computing some aggregates by age code: df.groupBy("age").min("hours-per-week") OR import org.apache.spark.sql.catalyst.expressions.aggregate._ df.groupBy("age").agg(min("hours-per-week"))
  • 32. Easily compute multiple aggregates: df.groupBy("age").agg(min("hours-per-week"), avg("hours-per-week"), max("capital-gain")) PhotoAtelier
  • 33. But where DataFrames explode? ● Iterative algorithms - large plans ○ Use your escape hatch to RDDs! ● Some push downs are sad pandas :( ● Default shuffle size is sometimes too small for big data (200 partitions) ● Default partition size when reading in is also sad
  • 34. Spark ML pipelines Tokenizer HashingTF String Indexer Naive Bayes Tokenizer HashingTF Streaming String Indexer Streaming Naive Bayes fit(df) Estimator Transformer ● Sci-Kit Learn Inspired ● Consist of Estimators and Transformers
  • 35. Data prep # Combines a list of double input features into a vector assembler = VectorAssembler(inputCols=["age", "education-num"], outputCol="feautres") # String indexer converts a set of strings into doubles indexer = StringIndexer(inputCol="category") .setOutputCol("category-index") # Can be used to combine pipeline components together pipeline = Pipeline().setStages([assembler, indexer]) Huang Yun Chung
  • 36. And its not complete without a tree :p dt = DecisionTreeClassifier(labelCol = "category-index", featuresCol="feautres") pipeline = Pipeline().setStages([assembler, indexer, dt]) model = pipeline.fit(df) Diethelm Scheidereit
  • 37. Spark in Scala, how does PySpark work? ● Py4J + pickling + JSON and magic ○ This can be kind of slow sometimes ● Distributed collections are often collections of pickled objects ● Spark SQL (and DataFrames) avoid some of this ○ Sometimes we can make them go fast and compile them to the JVM ● Features aren’t automatically exposed, but exposing them is normally simple. ● SparkR depends on similar magic kristin klein
  • 38. So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  • 39. The present*: faster interchange ● By present I mean in a release this month, so with lots of bugs ● Unifying our cross-language experience ○ And not just “normal” languages, CUDA counts yo Tambako The Jaguar
  • 40. Andrew Skudder *Arrow: likely the future. I really hope so. Spark 2.3 and beyond! * * With early work happening to support GPUs/ TF.
  • 41. What does the future look like?* *Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html. *Vendor benchmark. Trust but verify.
  • 42. What does the future look like - in code @pandas_udf("integer", PandasUDFType.SCALAR) def add_one(x): return x + 1
  • 43. What does the future look like - in code @pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) def normalize(pdf): v = pdf.v return pdf.assign(v=(v - v.mean()) / v.std())
  • 44. What does the future look like - in code @pandas_udf("word string", PandasUDFType.GROUPED_MAP) def special_tokenize(s): if s.strings is not None: return pandas.DataFrame(reduce(list.__add__, map(lambda x: x.split(' '), s.strings))) # This is a little ugly because currently the scalar transform # doesn't allow flat map like behaviour only map like. grouped = df.groupby("strings") tokenized = grouped.apply(special_tokenize) tokenized.show() More Word Count!
  • 45. What’s the general state of non-JVM world? Most of the tools are built in the JVM, so how do we play together? ● Pickling, Strings, JSON, XML, oh my! ● Unix pipes ● Sockets What about if we don’t want to copy the data all the time? Dataframe Api + Arrow ● Or standalone “pure”* re-implementations of everything ○ Reasonable option for things like Kafka where you would have the I/O regardless. ○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem David Brown
  • 46. Dask: a new beginning? ● Pure* python implementation ● Provides real enough DataFrame interface for distributed data ● Also your standard-ish distributed collections ● Multiple backends ● Primary challenge: interacting with the rest of the big data ecosystem ○ Arrow & friends might make this better with time too, buuut…. ● See https://dask.pydata.org/en/latest/ & http://dask.pydata.org/en/latest/spark.html ● Also similar ray Lisa Zins
  • 47. BEAM Beyond the JVM ● Non JVM BEAM doesn’t work outside of Google’s environment yet, so I’m going to skip the details. ● tl;dr : uses grpc / protobuf ● But exciting new plans to unify the runners and ease the support of different languages (called SDKS) ○ See https://beam.apache.org/contribute/portability/ ● If this is exciting, you can come join me on making BEAM work in Python3 ○ Yes we still don’t have that :( ○ But we're getting closer!
  • 48. What about just multiprocessing? ● Does your data fit on a single machine? ● Will it continue to do so for as long as you plan to work on this team?
  • 49. High Performance Spark! You can buy it today! Several copies! Only one chapter on non-JVM stuff, I’m sorry. Cats love it* *Or at least the box it comes in. If buying for a cat, get print rather than e-book.
  • 50. And some upcoming talks: ● April ○ Flink Forward ○ Dataworks Summit Berlin ○ Kafka Summit London ○ PyData London - Keynote :D ● May ○ Stata London: https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/6 4759
  • 51. k thnx bye :) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark I need to give a testing talk in a few months, help a “friend” out. Will tweet results “eventually” @holdenkarau Do you want more realistic benchmarks? Share your UDFs! http://bit.ly/pySparkUDF It’s performance review season, so help a friend out and fill out this survey with your talk feedback http://bit.ly/holdenTalkFeedback