SlideShare a Scribd company logo
1 of 63
Download to read offline
A Fast Intro to Spark
And a glance at BEAM
Lightning fast cluster computing*
Who am I?
Who am I?
● So this is kind of a long shot, but American TV gets everywhere so….
● I’m not a doctor but I did stay at an IHG property last night
● Which is like a fancy version of Holiday Inn Express
○ I’m honestly not sure if this makes me more or less qualified
● And I did get my IHG points restored
● Ok but for real
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google focused on OSS Big Data
● Apache Spark PMC
● Contributor to a lot of other projects (including BEAM)
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of High Performance Spark & Learning Spark (+ more)
● Twitter: @holdenkarau
● Slideshare
● Linkedin
● Github
● Related Spark Videos
Who do I think you all are?
● Nice people*
● Getting started with Spark or BEAM
○ Or wondering if you need it
● Familiar-ish with Scala or Java or Python
What we are going to explore together!
● What is Spark?
● Getting Spark setup locally
● Spark primary distributed collection
● Word count in Spark
● Spark SQL / DataFrames
Then a glance at BEAM
● What is BEAM & what’s its current state
● Streaming wordcount because of course
Some things that may color my views:
● I’m on the Spark PMC -- Spark’s success => I can probably make more $s
● My employer cares about BEAM (and Spark and other things)
● I work primarily in Python & Scala these days
● I like functional programming
● Probably some others I’m forgetting
On the other hand:
● I’ve worked on Spark for a long time and know a lot of its faults
● My goals are pretty flexible
● I have x86 assembly code tattooed on my back
What is Spark?
● General purpose distributed system
○ With a really nice API including Python :)
● Apache project (one of the most
● Must faster than Hadoop
● Good when too big for a single
● Built on top of two abstractions for
distributed data: RDDs & Datasets
The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Python, &
Spark ML
bagel &
Graph X
Paul Hudson
Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
Companion (optional!) notebook funtimes: (has a notebook!)
● Did you know? You can run Spark on Dataproc there by
giving my employer money. You can also run it
elsewhere. (lots of code files) (has a notebook, ML focused)
David DeHetre
SparkContext: entry to the world
● Can be used to create RDDs from many input sources
○ Native collections, local & remote FS
○ Any Hadoop Data Source
● Also create counters & accumulators
● Automatically created in the shells (called sc)
● Specify master & app name when creating
○ Master can be local[*], spark:// , yarn, etc.
○ app name should be human readable and make sense
● etc.
RDDs: Spark’s Primary abstraction
RDD (Resilient Distributed Dataset)
● Distributed collection
● Recomputed on node failure
● Distributes data & work across the cluster
● Lazily evaluated (transformations & actions)
Helen Olney
Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
( x: (x, 1))
.reduceByKey(lambda x, y: x+y))
Photo By: Will
Why laziness is cool (and not)
● Pipelining (can put maps, filter, flatMap together)
● Can do interesting optimizations by delaying work
● We use the DAG to recompute on failure
○ (writing data out to 3 disks on different machines is so last season)
○ Or the DAG puts the R is Resilient RDD, except DAG doesn’t have an
R :(
How it hurts:
● Debugging is confusing
● Re-using data - lazyness only sees up to the first action
● Some people really hate immutability
Matthew Hurst
Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
( x: (x, 1))
.reduceByKey(lambda x, y: x+y))
No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
RDD re-use - sadly not magic
● If we know we are going to re-use the RDD what should we do?
○ If it fits nicely in memory caching in memory
○ persisting at another level
○ checkpointing
● Noisey clusters
○ _2 & checkpointing can help
● persist first for checkpointing
Richard Gillin
Some common transformations & actions
Transformations (lazy)
● map
● filter
● flatMap
● reduceByKey
● join
● cogroup
Actions (eager)
● count
● reduce
● collect
● take
● saveAsTextFile
● saveAsHadoop
● countByValue
Photo by Steve
Photo by Dan G
This can feel like magic* sometimes :)
Steven Saus
*I mean not good magic.
Magic has it’s limits: key-skew + black boxes
● There is a worse way to do WordCount
● We can use the seemingly safe thing called groupByKey
● Then compute the sum...
Bad word count RDD :(
words = rdd.flatMap(lambda x: x.split(" "))
wordPairs = w: (w, 1))
grouped = wordPairs.groupByKey()
counted_words = grouped.mapValues(lambda counts: sum(counts))
f ford Pinto by Morven
f ford Pinto by Morven
Why should we consider Datasets?
● Performance
○ Smart optimizer
○ More efficient storage
○ Faster serialization
● Simplicity
○ Windowed operations
○ Multi-column & multi-type aggregates
Rikki's Refuge
Why are Datasets so awesome?
● Easier to mix functional style and relational style
○ No more hive UDFs!
● Nice performance of Spark SQL flexibility of RDDs
○ Tungsten (better serialization)
○ Equivalent of Sortable trait
● Strongly typed
● The future (ML, Graph, etc.)
● Potential for better language interop
○ Something like Arrow has a much better chance with Datasets
○ Cross-platform libraries are easier to make & use
Will Folsom
What is the performance like?
Andrew Skudder
How is it so fast?
● Optimizer has more information (schema & operations)
● More efficient storage formats
● Faster serialization
● Some operations directly on serialized data formats
● non-JVM languages: does more computation in the JVM
Andrew Skudder
Word count w/Dataframes
df =
# Returns an RDD
words ="text").flatMap(lambda x: x.text.split(" "))
words_df =
lambda x: Row(word=x, cnt=1)).toDF()
word_count = words_df.groupBy("word").sum()
Still have the double
serialization here :(
Word count w/Datasets
val df ="text")
val ds =[String]
# Returns an Dataset!
val words = ds.flatMap(x => x.split(" "))
val grouped = words.groupBy("value")
val word_count = grouped.agg(count("*") as
Can’t push down
filters from here
If it’s a simple type we don’t
have to define a case class
What can the optimizer do now?
● Sort on the serialized data
● Understand the aggregate (“partial aggregates”)
○ Could sort of do this before but not as awesomely, and only if we used
reduceByKey - not groupByKey
● Pack them bits nice and tight
So whats this new groupBy?
● No longer causes explosions like RDD groupBy
○ Able to introspect and pipeline the aggregation
● Returns a GroupedData (or GroupedDataset)
● Makes it easy to perform multiple aggregations
● Built in shortcuts for aggregates like avg, min, max
● Longer list at
● Allows the optimizer to see what aggregates are being
Sherrie Thai
Computing some aggregates by age code:
import org.apache.spark.sql.catalyst.expressions.aggregate._
Easily compute multiple aggregates:
Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...
val happiness = ds.toDF().filter($"happy" === true).as[RawPanda].
reduce((x, y) => x + y)
So what was that?
ds.toDF().filter($"happy" === true).as[RawPanda].
reduce((x, y) => x + y)
convert a Dataset to a
DataFrame to access
more DataFrame
functions (pre-2.0)
Convert DataFrame
back to a Dataset
A typed query (specifies the
return type).Traditional functional
arbitrary scala code :)
And functional style maps:
* Functional map + Dataset, sums the positive attributes for the
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {{rp => rp.attributes.filter(_ > 0).sum}
Chris Isherwood
But where DataFrames explode?
● Iterative algorithms - large plans
○ Use your escape hatch to RDDs!
● Some push downs are sad pandas :(
● Default shuffle size is sometimes too small for big data
(200 partitions)
● Default partition size when reading in is also sad
Our ever growing ecosystem:
General purpose eating the world
● Operations overhead
● Moving data from System 1 to System 2 (sqoop and friends)
● We still have specialized tools, but being built on top of general frameworks
○ e.g. see mahout on Spark
○ Less closely tied things like Hive/Pig on Spark
○ TF.Transform etc.
Photo by D Coetzee
Even then, lots of general purpose tools:
Resulting in:
And language silos (Scala, Python, Go, etc.!)
Photo by: photobom Photo: Fritz Schuman (ScalaDays CPH)
And cloud silos….
Photo By: Zechariah Judy
And the proliferation of pagers :(
Photo by: Hades2k
Mike Knell
What’s the state of non-JVM big data?
Most of the tools are built in the JVM, so how do we play together?
● Pickling, Strings, JSON, XML, oh my!
● Unix pipes
● Sockets
What about if we don’t want to copy the data all the time?
● Or standalone “pure”* re-implementations of everything
○ Reasonable option for things like Kafka where you would have the I/O regardless.
○ Also cool projects like dask -- but hard to talk to existing ecosystem
David Brown
Spark in Scala, how does PySpark work?
● Py4J + pickling + JSON and magic
○ This can be kind of slow sometimes
● Distributed collections are often collections of pickled
● Spark SQL (and DataFrames) avoid some of this
○ Sometimes we can make them go fast and compile them to the JVM
● Features aren’t automatically exposed, but exposing
them is normally simple.
● SparkR depends on similar magic
kristin klein
So what does that look like?
Worker 1
Worker K
The “future”*: faster interchange
● By future I mean availability starting in the next 3-6 months (with more
improvements after).
○ Yes much of this code exists, it just isn’t released yet so I’m sure we’ll find all sorts of bugs
and ways to improve.
○ Relatedly you can help us test in Spark 2.3 when we start the RC process to catch bug early!
● Unifying our cross-language experience
○ And not just “normal” languages, CUDA counts yo
Tambako The Jaguar
Andrew Skudder
*Arrow: likely the future. I really hope so. Spark 2.3 and beyond!
* *
With early work happening to
support GPUs/ TF.
BEAM backends:
● BEAM nominally supports*
○ Dataflow
○ Flink*
○ Spark*
○ IBM Streams, etc.
● Goal of more than just lowest-common-demoninator, think of it like a
*Supports as in early-stage, but we’re working on it (and we’d love your help!)
**But you know, in the same sense I compare Spark Streams to pandas coming a
wooden slide.
BEAM Languages
● JVM: Scala, Java, etc.
● non-JVM: Python w/Go and more coming
BEAM Beyond the JVM
● This part doesn’t work outside of Google’s hosted environment yet, so I’m
going to be light on the details
● tl;dr : uses grpc / protobuf
● But exciting new plans (w/ some early code) to unify the runners and ease the
support of different languages (called SDKS)
○ See
What do the different APIs look like?
● Everyone's favourite: Streaming Word Count Example
● And then windowed wordcount!
● (And also a peak at Tensorflow incase anyone is trying to raise a series A)
Spark wordcount (Python*) - “pure” relational
# Split the lines into words
words =
# explode turns each item in an array into a separate row
split(lines.value, ' ')
# Generate running word count
wordCounts = words.groupBy('word').count()
BEAM wordcount (Java)
p.apply("ExtractWords", ParDo.of(new DoFn<String, String>() {
public void processElement(ProcessContext c) {
for (String word : c.element().split("W+")) {
What about windowed word count?
Trish Hamme
Christer van der Meeren
What else might happen?
● One execution engine becomes super amazing at everything
● Instead of a compiler like unifier we see something like streaming SQL
become our unifier
○ Relatedly BEAM, Spark & Flink, and Kafka all have streaming SQL implementations.
● People realize there big data problem is actually three small data problems in
a trench coat
And some upcoming talks:
● Jan
○ If interest tomorrow: Office Hours? Tweet me @holdenkarau
○ LinuxConf AU - next week
○ Sydney Spark meetup 23rd
○ Data Day Texas - Nate will be there too!
● Feb
○ FOSDEM - One on testing one on scaling
○ JFokus in Stockholm - Adding deep learning to Spark
○ I disappear for a week and pretend computers work
● March
○ Strata San Jose - Big Data Beyond the JVM
Learning Spark
Fast Data
Processing with
(Out of Date)
Fast Data
Processing with
(2nd edition)
Analytics with
Coming soon:
Spark in Action
High Performance SparkComing Soon:
Learning PySpark
High Performance Spark!
Available today!
You can buy it from that scrappy Seattle bookstore, Jeff
Bezos needs another newspaper and I want a cup of
Cat wave photo by Quinn Dombrowski
k thnx bye!
If you <3 testing & want to fill out
Want to tell me (and or my boss) how
I’m doing?
Want to e-mail me? Promise not to be
creepy? Ok:

More Related Content

What's hot

Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Holden Karau
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopHolden Karau
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Holden Karau
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018Holden Karau
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache SparkHolden Karau
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Holden Karau
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Holden Karau
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Holden Karau

What's hot (20)

Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...

Similar to A Fast Intro to Spark and BEAM

Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Holden Karau
Hadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective AudienceHadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective AudienceChandra Sekhar
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...Holden Karau
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
Building Recoverable (and optionally async) Pipelines with Apache Spark (+ s...
Building Recoverable (and optionally async) Pipelines with Apache Spark  (+ s...Building Recoverable (and optionally async) Pipelines with Apache Spark  (+ s...
Building Recoverable (and optionally async) Pipelines with Apache Spark (+ s...Holden Karau
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Holden Karau
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

Similar to A Fast Intro to Spark and BEAM (20)

Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
Hadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective AudienceHadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective Audience
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Building Recoverable (and optionally async) Pipelines with Apache Spark (+ s...
Building Recoverable (and optionally async) Pipelines with Apache Spark  (+ s...Building Recoverable (and optionally async) Pipelines with Apache Spark  (+ s...
Building Recoverable (and optionally async) Pipelines with Apache Spark (+ s...
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop

Recently uploaded

Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleanscorenetworkseo
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationMarko4394
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1

Recently uploaded (20)

Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleans
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentation
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...

A Fast Intro to Spark and BEAM

  • 1. A Fast Intro to Spark And a glance at BEAM Lightning fast cluster computing*
  • 3. Who am I? ● So this is kind of a long shot, but American TV gets everywhere so…. ● I’m not a doctor but I did stay at an IHG property last night ● Which is like a fancy version of Holiday Inn Express ○ I’m honestly not sure if this makes me more or less qualified ● And I did get my IHG points restored ● Ok but for real
  • 4. Who am I? ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google focused on OSS Big Data ● Apache Spark PMC ● Contributor to a lot of other projects (including BEAM) ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of High Performance Spark & Learning Spark (+ more) ● Twitter: @holdenkarau ● Slideshare ● Linkedin ● Github ● Related Spark Videos
  • 5.
  • 6. Who do I think you all are? ● Nice people* ● Getting started with Spark or BEAM ○ Or wondering if you need it ● Familiar-ish with Scala or Java or Python Amanda
  • 7. What we are going to explore together! ● What is Spark? ● Getting Spark setup locally ● Spark primary distributed collection ● Word count in Spark ● Spark SQL / DataFrames Then a glance at BEAM ● What is BEAM & what’s its current state ● Streaming wordcount because of course
  • 8. Some things that may color my views: ● I’m on the Spark PMC -- Spark’s success => I can probably make more $s ● My employer cares about BEAM (and Spark and other things) ● I work primarily in Python & Scala these days ● I like functional programming ● Probably some others I’m forgetting On the other hand: ● I’ve worked on Spark for a long time and know a lot of its faults ● My goals are pretty flexible ● I have x86 assembly code tattooed on my back
  • 9. What is Spark? ● General purpose distributed system ○ With a really nice API including Python :) ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Good when too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  • 10. The different pieces of Spark Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  • 11. Why people come to Spark: Well this MapReduce job is going to take 16 hours - how long could it take to learn Spark? dougwoods
  • 12. Why people come to Spark: My DataFrame won’t fit in memory on my cluster anymore, let alone my MacBook Pro :( Maybe this Spark business will solve that... brownpau
  • 13. Companion (optional!) notebook funtimes: (has a notebook!) ● Did you know? You can run Spark on Dataproc there by giving my employer money. You can also run it elsewhere. (lots of code files) (has a notebook, ML focused) David DeHetre
  • 14. SparkContext: entry to the world ● Can be used to create RDDs from many input sources ○ Native collections, local & remote FS ○ Any Hadoop Data Source ● Also create counters & accumulators ● Automatically created in the shells (called sc) ● Specify master & app name when creating ○ Master can be local[*], spark:// , yarn, etc. ○ app name should be human readable and make sense ● etc. Petfu l
  • 15. RDDs: Spark’s Primary abstraction RDD (Resilient Distributed Dataset) ● Distributed collection ● Recomputed on node failure ● Distributes data & work across the cluster ● Lazily evaluated (transformations & actions) Helen Olney
  • 16. Word count (in python) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = ( x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(“output”) Photo By: Will Keightley
  • 17. Why laziness is cool (and not) ● Pipelining (can put maps, filter, flatMap together) ● Can do interesting optimizations by delaying work ● We use the DAG to recompute on failure ○ (writing data out to 3 disks on different machines is so last season) ○ Or the DAG puts the R is Resilient RDD, except DAG doesn’t have an R :( How it hurts: ● Debugging is confusing ● Re-using data - lazyness only sees up to the first action ● Some people really hate immutability Matthew Hurst
  • 18. Word count (in python) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = ( x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile("output") No data is read or processed until after this line This is an “action” which forces spark to evaluate the RDD daniilr
  • 19. RDD re-use - sadly not magic ● If we know we are going to re-use the RDD what should we do? ○ If it fits nicely in memory caching in memory ○ persisting at another level ■ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER ○ checkpointing ● Noisey clusters ○ _2 & checkpointing can help ● persist first for checkpointing Richard Gillin
  • 20. Some common transformations & actions Transformations (lazy) ● map ● filter ● flatMap ● reduceByKey ● join ● cogroup Actions (eager) ● count ● reduce ● collect ● take ● saveAsTextFile ● saveAsHadoop ● countByValue Photo by Steve Photo by Dan G
  • 21. This can feel like magic* sometimes :) Steven Saus *I mean not good magic.
  • 22. Magic has it’s limits: key-skew + black boxes ● There is a worse way to do WordCount ● We can use the seemingly safe thing called groupByKey ● Then compute the sum... _torne
  • 23. Bad word count RDD :( words = rdd.flatMap(lambda x: x.split(" ")) wordPairs = w: (w, 1)) grouped = wordPairs.groupByKey() counted_words = grouped.mapValues(lambda counts: sum(counts)) counted_words.saveAsTextFile("boop") Tomomi
  • 24. f ford Pinto by Morven
  • 25. f ford Pinto by Morven ayphen
  • 26. Why should we consider Datasets? ● Performance ○ Smart optimizer ○ More efficient storage ○ Faster serialization ● Simplicity ○ Windowed operations ○ Multi-column & multi-type aggregates Rikki's Refuge
  • 27. Why are Datasets so awesome? ● Easier to mix functional style and relational style ○ No more hive UDFs! ● Nice performance of Spark SQL flexibility of RDDs ○ Tungsten (better serialization) ○ Equivalent of Sortable trait ● Strongly typed ● The future (ML, Graph, etc.) ● Potential for better language interop ○ Something like Arrow has a much better chance with Datasets ○ Cross-platform libraries are easier to make & use Will Folsom
  • 28. What is the performance like? Andrew Skudder
  • 29. How is it so fast? ● Optimizer has more information (schema & operations) ● More efficient storage formats ● Faster serialization ● Some operations directly on serialized data formats ● non-JVM languages: does more computation in the JVM Andrew Skudder
  • 30. Word count w/Dataframes df = # Returns an RDD words ="text").flatMap(lambda x: x.text.split(" ")) words_df = lambda x: Row(word=x, cnt=1)).toDF() word_count = words_df.groupBy("word").sum() word_count.write.format("parquet").save("wc.parquet") Still have the double serialization here :(
  • 31. Word count w/Datasets val df ="text") val ds =[String] # Returns an Dataset! val words = ds.flatMap(x => x.split(" ")) val grouped = words.groupBy("value") val word_count = grouped.agg(count("*") as "count") word_count.write.format("parquet").save("wc") Can’t push down filters from here If it’s a simple type we don’t have to define a case class
  • 32. What can the optimizer do now? ● Sort on the serialized data ● Understand the aggregate (“partial aggregates”) ○ Could sort of do this before but not as awesomely, and only if we used reduceByKey - not groupByKey ● Pack them bits nice and tight
  • 33. So whats this new groupBy? ● No longer causes explosions like RDD groupBy ○ Able to introspect and pipeline the aggregation ● Returns a GroupedData (or GroupedDataset) ● Makes it easy to perform multiple aggregations ● Built in shortcuts for aggregates like avg, min, max ● Longer list at org.apache.spark.sql.functions$ ● Allows the optimizer to see what aggregates are being performed Sherrie Thai
  • 34. Computing some aggregates by age code: df.groupBy("age").min("hours-per-week") OR import org.apache.spark.sql.catalyst.expressions.aggregate._ df.groupBy("age").agg(min("hours-per-week"))
  • 35. Easily compute multiple aggregates: df.groupBy("age").agg(min("hours-per-week"), avg("hours-per-week"), max("capital-gain")) PhotoAtelier
  • 36. Using Datasets to mix functional & relational style: val ds: Dataset[RawPanda] = ... val happiness = ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y)
  • 37. So what was that? ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y) convert a Dataset to a DataFrame to access more DataFrame functions (pre-2.0) Convert DataFrame back to a Dataset A typed query (specifies the return type).Traditional functional reduction: arbitrary scala code :)
  • 38. And functional style maps: /** * Functional map + Dataset, sums the positive attributes for the pandas */ def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {{rp => rp.attributes.filter(_ > 0).sum} } Chris Isherwood
  • 39. But where DataFrames explode? ● Iterative algorithms - large plans ○ Use your escape hatch to RDDs! ● Some push downs are sad pandas :( ● Default shuffle size is sometimes too small for big data (200 partitions) ● Default partition size when reading in is also sad
  • 40. Our ever growing ecosystem:
  • 41. General purpose eating the world ● Operations overhead ● Moving data from System 1 to System 2 (sqoop and friends) ● We still have specialized tools, but being built on top of general frameworks ○ e.g. see mahout on Spark ○ Less closely tied things like Hive/Pig on Spark ○ TF.Transform etc. Photo by D Coetzee
  • 42. Even then, lots of general purpose tools:
  • 44. And language silos (Scala, Python, Go, etc.!) Photo by: photobom Photo: Fritz Schuman (ScalaDays CPH)
  • 45. And cloud silos…. Photo By: Zechariah Judy
  • 46. And the proliferation of pagers :( Photo by: Hades2k Mike Knell
  • 47. What’s the state of non-JVM big data? Most of the tools are built in the JVM, so how do we play together? ● Pickling, Strings, JSON, XML, oh my! ● Unix pipes ● Sockets What about if we don’t want to copy the data all the time? ● Or standalone “pure”* re-implementations of everything ○ Reasonable option for things like Kafka where you would have the I/O regardless. ○ Also cool projects like dask -- but hard to talk to existing ecosystem David Brown
  • 48. Spark in Scala, how does PySpark work? ● Py4J + pickling + JSON and magic ○ This can be kind of slow sometimes ● Distributed collections are often collections of pickled objects ● Spark SQL (and DataFrames) avoid some of this ○ Sometimes we can make them go fast and compile them to the JVM ● Features aren’t automatically exposed, but exposing them is normally simple. ● SparkR depends on similar magic kristin klein
  • 49. So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  • 50. The “future”*: faster interchange ● By future I mean availability starting in the next 3-6 months (with more improvements after). ○ Yes much of this code exists, it just isn’t released yet so I’m sure we’ll find all sorts of bugs and ways to improve. ○ Relatedly you can help us test in Spark 2.3 when we start the RC process to catch bug early! ● Unifying our cross-language experience ○ And not just “normal” languages, CUDA counts yo Tambako The Jaguar
  • 51. Andrew Skudder *Arrow: likely the future. I really hope so. Spark 2.3 and beyond! * * With early work happening to support GPUs/ TF.
  • 52. BEAM backends: ● BEAM nominally supports* ○ Dataflow ○ Flink* ○ Spark* ○ IBM Streams, etc. ● Goal of more than just lowest-common-demoninator, think of it like a compiler** *Supports as in early-stage, but we’re working on it (and we’d love your help!) **But you know, in the same sense I compare Spark Streams to pandas coming a wooden slide.
  • 53. BEAM Languages ● JVM: Scala, Java, etc. ● non-JVM: Python w/Go and more coming
  • 54. BEAM Beyond the JVM ● This part doesn’t work outside of Google’s hosted environment yet, so I’m going to be light on the details ● tl;dr : uses grpc / protobuf ● But exciting new plans (w/ some early code) to unify the runners and ease the support of different languages (called SDKS) ○ See
  • 55. What do the different APIs look like? ● Everyone's favourite: Streaming Word Count Example ● And then windowed wordcount! ● (And also a peak at Tensorflow incase anyone is trying to raise a series A)
  • 56. Spark wordcount (Python*) - “pure” relational # Split the lines into words words = # explode turns each item in an array into a separate row explode( split(lines.value, ' ') ).alias('word') ) # Generate running word count wordCounts = words.groupBy('word').count()
  • 57. BEAM wordcount (Java) p.apply("ExtractWords", ParDo.of(new DoFn<String, String>() { @ProcessElement public void processElement(ProcessContext c) { for (String word : c.element().split("W+")) { c.output(word); }}})) .apply(Count.<String>perElement())
  • 58. What about windowed word count? Trish Hamme Christer van der Meeren
  • 59. What else might happen? ● One execution engine becomes super amazing at everything ● Instead of a compiler like unifier we see something like streaming SQL become our unifier ○ Relatedly BEAM, Spark & Flink, and Kafka all have streaming SQL implementations. ● People realize there big data problem is actually three small data problems in a trench coat bnilsen
  • 60. And some upcoming talks: ● Jan ○ If interest tomorrow: Office Hours? Tweet me @holdenkarau ○ LinuxConf AU - next week ○ Sydney Spark meetup 23rd ○ Data Day Texas - Nate will be there too! ● Feb ○ FOSDEM - One on testing one on scaling ○ JFokus in Stockholm - Adding deep learning to Spark ○ I disappear for a week and pretend computers work ● March ○ Strata San Jose - Big Data Beyond the JVM
  • 61. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action High Performance SparkComing Soon: Learning PySpark
  • 62. High Performance Spark! Available today! You can buy it from that scrappy Seattle bookstore, Jeff Bezos needs another newspaper and I want a cup of coffee.
  • 63. Cat wave photo by Quinn Dombrowski k thnx bye! If you <3 testing & want to fill out survey: Want to tell me (and or my boss) how I’m doing? Want to e-mail me? Promise not to be creepy? Ok: