SlideShare a Scribd company logo
1 of 51
Download to read offline
Are General Purpose Systems Eating
the Big Data World?
with your friend
@holdenkarau
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams & live coding: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback
Some things that may color my views:
● I’m a Spark committer -- if Spark continues to win I can probably make more
$s
● My employer cares about BEAM (and Spark and other things)
● I work primarily in Python & Scala these days
● I like functional programming
● Probably some others I’m forgetting
On the other hand:
● I’ve worked on Spark for a long time and know a lot of its faults
● My goals are pretty flexible
● I have x86 assembly code tattooed on my back
Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
● On twitter @BooProgrammer
● Wishes EU cookie warnings were about real cookies
● Also confused about GDPR
Who I think you wonderful humans are?
● Nice enough people
● Don’t mind pictures of cats
● Think the JVM is kind of cool
Lori Erickson
What are we going to talk about?
● The state of the open big data ecosystem
● Why general purpose matters
○ Where it doesn’t* matter**
● The different kinds of unification and why it matters
● Evolution of Spark/Flink and friends
● What BEAM aims to do
○ Where we are today
○ Where we need to be tomorrow
● Other ways the data world can be unified
● A limited number of GDPR jokes if you opt-in today
● And of course lots of wordcount examples
Our ever growing ecosystem:
http://mattturck.com/bigdata2017/
Enter: Spark
● Integrated different tools which traditionally required different systems
○ Mahout, hive, etc. Not great at all of these but “good enough”
● e.g. can use same system to do ML and SQL
*Often written in Python!
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
General purpose probably are eating the world
● Operations overhead
● Moving data from System 1 to System 2 (sqoop and friends)
○ Interchange: write everything out to disk (or HDFS/GCS/S3/etc.)
● We still have specialized tools, but being built on top of general frameworks
○ e.g. see mahout on Spark
○ Less closely tied things like Hive/Pig on Spark
○ TF.Transform etc.
Tambako The Jaguar
Even then, lots of general purpose tools:
Resulting in:
flink/
h2o/
hdfs/
integration/
mr/
spark/
viennacl/
And language silos (Scala, Java, Python, Go, etc.!)
Photo by: photobom Photo: Fritz Schuman (ScalaDays CPH)
And cloud silos….
Photo By: Zechariah Judy
& We’re getting better with cloud silos
● EMR (elastic map reduce) like services on AWS, Azure & Google
○ EMR
○ Azure HDInsight
○ Google Dataproc
● You can move your Spark job super easily
● Vendors are becoming cross-platform too (see Databricks)
torne (where's my lens ca
We’re getting better with data silos
● Customers demand access to their data in many platforms
● Vendors seem to listen
○ Notable exceptions: formats without a vendor, or vendor with a closed sourced locked in
platform
● e.g. Even if your data is a mixture of Parquet & big query (or red shift) you can
access both
We’re starting to recognize these problems:
● Spark is adding more runners (see SparkR) and integrating Arrow
● Small projects (like Sparkling ML) work to expose Python libs to JVM and vice
versa
● Apache Arrow is focused around improved formats for interop
○ Included in Spark 2.3 to vastly improve Python performance
○ Being used by GPU vendors too!
● Apache BEAM is adding a unified runner SDK API
○ Currently each language has to support each backend
○ Common core
● Apache BEAM is attempting to unify streaming APIs
What’s the state of non-JVM big data?
Most of the tools are built in the JVM, so how do we play together?
● Pickling, Strings, JSON, XML, oh my!
● Unix pipes
● Sockets
What about if we don’t want to copy the data all the time?
● Or standalone “pure”* re-implementations of everything
○ Reasonable option for things like Kafka where you would have the I/O regardless.
○ Also cool projects like dask -- but hard to talk to existing ecosystem
David Brown
Spark in Scala, how does PySpark work?
● Py4J + pickling + JSON and magic
○ This can be kind of slow sometimes
● Distributed collections are often collections of pickled
objects
● Spark SQL (and DataFrames) avoid some of this
○ Sometimes we can make them go fast and compile them to the JVM
● Features aren’t automatically exposed, but exposing
them is normally simple.
● SparkR depends on similar magic
kristin klein
So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
*For a small price of your fun libraries. Bad idea.
That was a bad idea, buuut…..
● Work going on in Scala land to translate simple Scala
into SQL expressions - need the Dataset API
○ Maybe we can try similar approaches with Python?
● POC use Jython for simple UDFs (e.g. 2.7 compat & no
native libraries) - SPARK-15369
○ Early benchmarking w/word count 5% slower than native Scala UDF,
close to 2x faster than regular Python
● Willing to share your Python UDFs for benchmarking? -
http://bit.ly/pySparkUDF
*The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its
ok!
The “future”*: faster interchange
● By future I mean availability starting in the next 3-6 months (with more
improvements after).
○ Yes much of this code exists, it just isn’t released yet so I’m sure we’ll find all sorts of bugs
and ways to improve.
○ Relatedly you can help us test in Spark 2.3 when we start the RC process to catch bug early!
● Unifying our cross-language experience
○ And not just “normal” languages, CUDA counts yo
Tambako The Jaguar
Andrew Skudder
*Arrow: likely the future. I really hope so. Spark 2.3 and beyond!
* *
With early work happening to
support GPUs/ TF.
What does the future look like?*
*Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html.
*Vendor
benchmark. Be
careful.
Arrow (a poorly drawn big data view)
Logos trademarks of their respective projects
Juha Kettunen
*ish
BEAM backends:
● BEAM nominally supports*
○ Dataflow
○ Flink*
○ Spark*
○ IBM Streams, etc.
● Goal of more than just lowest-common-demoninator, think of it like a
compiler**
*Supports as in early-stage, but we’re working on it (and we’d love your help!)
**But you know, in the same sense I compare Spark Streams to pandas coming a
wooden slide.
BEAM Languages
● JVM: Scala*, Java, etc.
● non-JVM: Python w/Go and more coming
BEAM Beyond the JVM
● This part doesn’t work outside of Google’s hosted environment yet, so I’m
going to be light on the details
● tl;dr : uses grpc / protobuf
● But exciting new plans (w/ some early code) to unify the runners and ease the
support of different languages (called SDKS)
○ See https://beam.apache.org/contribute/portability/
BEAM Beyond the JVM: Random collab branch
*ish
*ish
*ish
Nick
portability
*ish
What do (some of) the different APIs look like?
● Everyone's favourite: Streaming Word Count Example
● And then windowed wordcount!
● (And also a peak at Tensorflow incase anyone is trying to raise a series A)
Spark wordcount (Python*)
# Split the lines into words
words = lines.select(
# explode turns each item in an array into a separate row
explode(
split(lines.value, ' ')
).alias('word')
)
# Generate running word count
wordCounts = words.groupBy('word').count()
Flink wordcount (Java)
DataStream<String> text = ...
DataStream<Tuple2<String, Integer>> counts =
text.map(line -> line.toLowerCase().split("W+"))
.flatMap((String[] tokens, Collector<Tuple2<String,
Integer>> out) -> {
Arrays.stream(tokens)
.forEach(t -> out.collect(new Tuple2<>(t, 1)));
})
// group by the tuple field "0" and sum up tuple field "1"
.keyBy(0)
.sum(1);
BEAM wordcount (Java)
p.apply("ExtractWords", ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
for (String word : c.element().split("W+")) {
c.output(word);
}}}))
.apply(Count.<String>perElement())
Go Wordcount (BEAM)
func CountWords(s beam.Scope, lines beam.PCollection)
beam.PCollection {
s = s.Scope("CountWords")
// Convert lines of text into individual words.
col := beam.ParDo(s, extractFn, lines)
// Count the number of times each word occurs.
return stats.Count(s, col)
}
Ada Doglace
Go Wordcount (BEAM) continued
p := beam.NewPipeline()
s := p.Root()
lines := textio.Read(s, *input)
counted := CountWords(s, lines)
formatted := beam.ParDo(s, formatFn, counted)
textio.Write(s, *output, formatted)
What about windowed word count?
Trish Hamme
Christer van der Meeren
Flink windowed wordcount (scala)
val counts: DataStream[(String, Int)] = text
// split up the lines in pairs (2-tuple) containing:
(word,1)
.flatMap(_.toLowerCase.split("W+"))
.map((_, 1))
.keyBy(0)
// create windows of windowSize records slided every
slideSize records
.countWindow(windowSize, slideSize)
// sum up tuple field "1"
.sum(1)
Multi-language pipelines?!?
● Totally! I mean kind-of. Spark is ok right?
○ Beam is working on it :) #comejoinus
○ Easier for Python calling Java
○ If SQL counts for sure. If not, yes*.
● Technically what happens when you use Scala Spark + Tensorflow
● And same with Beam, etc.
● Right now painful to write in practice in OSS land outside of libraries
(commercial vendors have some solutions but I stick to OSS when I can)
● So lets focus and libraries (and look head at more general)
Jennifer C.
A proof of concept: Sparkling ML
● A place for useful Spark ML pipeline stages to live
○ Including both feature transformers and estimators
● The why
○ Spark ML can’t keep up with every new algorithm
○ Spark ML is largely Scala based, cool Python tools for NLP exist as well
● Lots of cool ML on Spark tools exist, but many don’t play nice with Spark ML
or together.
● We make it easier to expose Python transformers into Scala land and vice
versa.
● Our repo is at: https://github.com/sparklingpandas/sparklingml
Supporting non-English languages
With Spacy, so you know more than English*
def inner(inputString):
nlp = SpacyMagic.get(lang)
def spacyTokenToDict(token):
"""Convert the input token into a dictionary"""
return dict(map(lookup_field_or_none, fields))
return list(map(spacyTokenToDict,
list(nlp(inputString))))
test_t51
And from the JVM:
val transformer = new SpacyTokenizePython()
transformer.setLang("en")
val input = spark.createDataset(
List(InputData("hi boo"), InputData("boo")))
transformer.setInputCol("input")
transformer.setOutputCol("output")
val result = transformer.transform(input).collect()
Alexy Khrabrov
How might this all play out?
● A portability layer like Beam could save us
○ And commoditize the execution engine layer
● One execution engine becomes good enough at everything
● Instead of a compiler like unifier we see something like streaming SQL
become our unifier
○ Personally you can pry my functions out of my cold dead hands but that’s all right
● People realize there big data problem is actually three small data problems in
a trench coat
Shawn Carpenter
What does this mean for special systems?
● More special systems will run on top of the same general purpose engine
○ Important to work with general purpose engines to make sure they will serve the future special
needs.
● Think about inputs and outputs and how to share them nicely and efficiently
with other systems (please not just files)
● Some of them will become more like libraries
○ This is perhaps “less cool” than being a stand alone system, but maybe better
● More collaboration hopefully
What does “sharing” look like?
val sparkSql = ...
// Start hive on the engine
HiveThriftServer2.startWithContext(sparkSql)
// Start snappy data
val snappy = new
org.apache.spark.sql.SnappySession(spark.sparkContext)
// Do cool things with both
...
Want to help?
● Lot’s of places to contribute
● We’d love help on Spark and or BEAM
○ Improved execution engine support in BEAM
○ More data sources & specialized format support (everywhere)
○ Better Python & non-JVM language support
● I’m sure Flink would too, but I’m less involved with them
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
Available today!
You can buy it from that scrappy Seattle bookstore, Jeff
Bezos needs another newspaper and I want a cup of
coffee.
http://bit.ly/hkHighPerfSpark
And some upcoming talks:
● June
○ Spark Summit SF (Accelerating Tensorflow & Accelerating Python +
Dependencies)
○ Scala Days NYC
○ FOSS Back Stage & BBuzz (Berlin)
● July
○ Curry On Amsterdam
○ OSCON Portland
● September
○ Strata NYC
● October
○ Reversim (Tel Aviv)
That’s all Folks!
Happy GDPR Friday! - May your last minute deploys
not break too much!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)

More Related Content

What's hot

Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache SparkHolden Karau
 
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Holden Karau
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Holden Karau
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...Holden Karau
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Modelssparktc
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
Validating big data jobs - Spark AI Summit EU
Validating big data jobs  - Spark AI Summit EUValidating big data jobs  - Spark AI Summit EU
Validating big data jobs - Spark AI Summit EUHolden Karau
 
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAYPostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAYEmanuel Calvo
 
Webinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceWebinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceLucidworks
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQLOlaf Hartig
 
A quick review of Python and Graph Databases
A quick review of Python and Graph DatabasesA quick review of Python and Graph Databases
A quick review of Python and Graph DatabasesNicholas Crouch
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLPRobert Viseur
 
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAPOpen Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAPPieter De Leenheer
 

What's hot (20)

Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
 
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Validating big data jobs - Spark AI Summit EU
Validating big data jobs  - Spark AI Summit EUValidating big data jobs  - Spark AI Summit EU
Validating big data jobs - Spark AI Summit EU
 
RDFa Tutorial
RDFa TutorialRDFa Tutorial
RDFa Tutorial
 
ApacheCon09: Avro
ApacheCon09: AvroApacheCon09: Avro
ApacheCon09: Avro
 
CBOR - The Better JSON
CBOR - The Better JSONCBOR - The Better JSON
CBOR - The Better JSON
 
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAYPostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
 
Webinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceWebinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior Relevance
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQL
 
A quick review of Python and Graph Databases
A quick review of Python and Graph DatabasesA quick review of Python and Graph Databases
A quick review of Python and Graph Databases
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
 
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAPOpen Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
 

Similar to Are general purpose big data systems eating the world?

Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Holden Karau
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...Holden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...Holden Karau
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Holden Karau
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Contributing to Apache Spark 3
Contributing to Apache Spark 3Contributing to Apache Spark 3
Contributing to Apache Spark 3Holden Karau
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
 

Similar to Are general purpose big data systems eating the world? (20)

Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Contributing to Apache Spark 3
Contributing to Apache Spark 3Contributing to Apache Spark 3
Contributing to Apache Spark 3
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 

Recently uploaded

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 

Are general purpose big data systems eating the world?

  • 1. Are General Purpose Systems Eating the Big Data World? with your friend @holdenkarau
  • 2. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC, Beam contributor ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Code review livestreams & live coding: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ● Spark Talk Videos http://bit.ly/holdenSparkVideos ● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback
  • 3.
  • 4. Some things that may color my views: ● I’m a Spark committer -- if Spark continues to win I can probably make more $s ● My employer cares about BEAM (and Spark and other things) ● I work primarily in Python & Scala these days ● I like functional programming ● Probably some others I’m forgetting On the other hand: ● I’ve worked on Spark for a long time and know a lot of its faults ● My goals are pretty flexible ● I have x86 assembly code tattooed on my back
  • 5. Who is Boo? ● Boo uses she/her pronouns (as I told the Texas house committee) ● Best doge ● Lot’s of experience barking at computers to make them go faster ● Author of “Learning to Bark” & “High Performance Barking” ● On twitter @BooProgrammer ● Wishes EU cookie warnings were about real cookies ● Also confused about GDPR
  • 6. Who I think you wonderful humans are? ● Nice enough people ● Don’t mind pictures of cats ● Think the JVM is kind of cool Lori Erickson
  • 7. What are we going to talk about? ● The state of the open big data ecosystem ● Why general purpose matters ○ Where it doesn’t* matter** ● The different kinds of unification and why it matters ● Evolution of Spark/Flink and friends ● What BEAM aims to do ○ Where we are today ○ Where we need to be tomorrow ● Other ways the data world can be unified ● A limited number of GDPR jokes if you opt-in today ● And of course lots of wordcount examples
  • 8. Our ever growing ecosystem: http://mattturck.com/bigdata2017/
  • 9. Enter: Spark ● Integrated different tools which traditionally required different systems ○ Mahout, hive, etc. Not great at all of these but “good enough” ● e.g. can use same system to do ML and SQL *Often written in Python! Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  • 10. General purpose probably are eating the world ● Operations overhead ● Moving data from System 1 to System 2 (sqoop and friends) ○ Interchange: write everything out to disk (or HDFS/GCS/S3/etc.) ● We still have specialized tools, but being built on top of general frameworks ○ e.g. see mahout on Spark ○ Less closely tied things like Hive/Pig on Spark ○ TF.Transform etc. Tambako The Jaguar
  • 11. Even then, lots of general purpose tools:
  • 13. And language silos (Scala, Java, Python, Go, etc.!) Photo by: photobom Photo: Fritz Schuman (ScalaDays CPH)
  • 14. And cloud silos…. Photo By: Zechariah Judy
  • 15. & We’re getting better with cloud silos ● EMR (elastic map reduce) like services on AWS, Azure & Google ○ EMR ○ Azure HDInsight ○ Google Dataproc ● You can move your Spark job super easily ● Vendors are becoming cross-platform too (see Databricks) torne (where's my lens ca
  • 16. We’re getting better with data silos ● Customers demand access to their data in many platforms ● Vendors seem to listen ○ Notable exceptions: formats without a vendor, or vendor with a closed sourced locked in platform ● e.g. Even if your data is a mixture of Parquet & big query (or red shift) you can access both
  • 17. We’re starting to recognize these problems: ● Spark is adding more runners (see SparkR) and integrating Arrow ● Small projects (like Sparkling ML) work to expose Python libs to JVM and vice versa ● Apache Arrow is focused around improved formats for interop ○ Included in Spark 2.3 to vastly improve Python performance ○ Being used by GPU vendors too! ● Apache BEAM is adding a unified runner SDK API ○ Currently each language has to support each backend ○ Common core ● Apache BEAM is attempting to unify streaming APIs
  • 18. What’s the state of non-JVM big data? Most of the tools are built in the JVM, so how do we play together? ● Pickling, Strings, JSON, XML, oh my! ● Unix pipes ● Sockets What about if we don’t want to copy the data all the time? ● Or standalone “pure”* re-implementations of everything ○ Reasonable option for things like Kafka where you would have the I/O regardless. ○ Also cool projects like dask -- but hard to talk to existing ecosystem David Brown
  • 19. Spark in Scala, how does PySpark work? ● Py4J + pickling + JSON and magic ○ This can be kind of slow sometimes ● Distributed collections are often collections of pickled objects ● Spark SQL (and DataFrames) avoid some of this ○ Sometimes we can make them go fast and compile them to the JVM ● Features aren’t automatically exposed, but exposing them is normally simple. ● SparkR depends on similar magic kristin klein
  • 20. So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  • 21. *For a small price of your fun libraries. Bad idea.
  • 22. That was a bad idea, buuut….. ● Work going on in Scala land to translate simple Scala into SQL expressions - need the Dataset API ○ Maybe we can try similar approaches with Python? ● POC use Jython for simple UDFs (e.g. 2.7 compat & no native libraries) - SPARK-15369 ○ Early benchmarking w/word count 5% slower than native Scala UDF, close to 2x faster than regular Python ● Willing to share your Python UDFs for benchmarking? - http://bit.ly/pySparkUDF *The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its ok!
  • 23.
  • 24. The “future”*: faster interchange ● By future I mean availability starting in the next 3-6 months (with more improvements after). ○ Yes much of this code exists, it just isn’t released yet so I’m sure we’ll find all sorts of bugs and ways to improve. ○ Relatedly you can help us test in Spark 2.3 when we start the RC process to catch bug early! ● Unifying our cross-language experience ○ And not just “normal” languages, CUDA counts yo Tambako The Jaguar
  • 25. Andrew Skudder *Arrow: likely the future. I really hope so. Spark 2.3 and beyond! * * With early work happening to support GPUs/ TF.
  • 26. What does the future look like?* *Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html. *Vendor benchmark. Be careful.
  • 27. Arrow (a poorly drawn big data view) Logos trademarks of their respective projects Juha Kettunen *ish
  • 28. BEAM backends: ● BEAM nominally supports* ○ Dataflow ○ Flink* ○ Spark* ○ IBM Streams, etc. ● Goal of more than just lowest-common-demoninator, think of it like a compiler** *Supports as in early-stage, but we’re working on it (and we’d love your help!) **But you know, in the same sense I compare Spark Streams to pandas coming a wooden slide.
  • 29. BEAM Languages ● JVM: Scala*, Java, etc. ● non-JVM: Python w/Go and more coming
  • 30. BEAM Beyond the JVM ● This part doesn’t work outside of Google’s hosted environment yet, so I’m going to be light on the details ● tl;dr : uses grpc / protobuf ● But exciting new plans (w/ some early code) to unify the runners and ease the support of different languages (called SDKS) ○ See https://beam.apache.org/contribute/portability/
  • 31. BEAM Beyond the JVM: Random collab branch *ish *ish *ish Nick portability *ish
  • 32. What do (some of) the different APIs look like? ● Everyone's favourite: Streaming Word Count Example ● And then windowed wordcount! ● (And also a peak at Tensorflow incase anyone is trying to raise a series A)
  • 33. Spark wordcount (Python*) # Split the lines into words words = lines.select( # explode turns each item in an array into a separate row explode( split(lines.value, ' ') ).alias('word') ) # Generate running word count wordCounts = words.groupBy('word').count()
  • 34. Flink wordcount (Java) DataStream<String> text = ... DataStream<Tuple2<String, Integer>> counts = text.map(line -> line.toLowerCase().split("W+")) .flatMap((String[] tokens, Collector<Tuple2<String, Integer>> out) -> { Arrays.stream(tokens) .forEach(t -> out.collect(new Tuple2<>(t, 1))); }) // group by the tuple field "0" and sum up tuple field "1" .keyBy(0) .sum(1);
  • 35. BEAM wordcount (Java) p.apply("ExtractWords", ParDo.of(new DoFn<String, String>() { @ProcessElement public void processElement(ProcessContext c) { for (String word : c.element().split("W+")) { c.output(word); }}})) .apply(Count.<String>perElement())
  • 36. Go Wordcount (BEAM) func CountWords(s beam.Scope, lines beam.PCollection) beam.PCollection { s = s.Scope("CountWords") // Convert lines of text into individual words. col := beam.ParDo(s, extractFn, lines) // Count the number of times each word occurs. return stats.Count(s, col) } Ada Doglace
  • 37. Go Wordcount (BEAM) continued p := beam.NewPipeline() s := p.Root() lines := textio.Read(s, *input) counted := CountWords(s, lines) formatted := beam.ParDo(s, formatFn, counted) textio.Write(s, *output, formatted)
  • 38. What about windowed word count? Trish Hamme Christer van der Meeren
  • 39. Flink windowed wordcount (scala) val counts: DataStream[(String, Int)] = text // split up the lines in pairs (2-tuple) containing: (word,1) .flatMap(_.toLowerCase.split("W+")) .map((_, 1)) .keyBy(0) // create windows of windowSize records slided every slideSize records .countWindow(windowSize, slideSize) // sum up tuple field "1" .sum(1)
  • 40. Multi-language pipelines?!? ● Totally! I mean kind-of. Spark is ok right? ○ Beam is working on it :) #comejoinus ○ Easier for Python calling Java ○ If SQL counts for sure. If not, yes*. ● Technically what happens when you use Scala Spark + Tensorflow ● And same with Beam, etc. ● Right now painful to write in practice in OSS land outside of libraries (commercial vendors have some solutions but I stick to OSS when I can) ● So lets focus and libraries (and look head at more general) Jennifer C.
  • 41. A proof of concept: Sparkling ML ● A place for useful Spark ML pipeline stages to live ○ Including both feature transformers and estimators ● The why ○ Spark ML can’t keep up with every new algorithm ○ Spark ML is largely Scala based, cool Python tools for NLP exist as well ● Lots of cool ML on Spark tools exist, but many don’t play nice with Spark ML or together. ● We make it easier to expose Python transformers into Scala land and vice versa. ● Our repo is at: https://github.com/sparklingpandas/sparklingml
  • 42. Supporting non-English languages With Spacy, so you know more than English* def inner(inputString): nlp = SpacyMagic.get(lang) def spacyTokenToDict(token): """Convert the input token into a dictionary""" return dict(map(lookup_field_or_none, fields)) return list(map(spacyTokenToDict, list(nlp(inputString)))) test_t51
  • 43. And from the JVM: val transformer = new SpacyTokenizePython() transformer.setLang("en") val input = spark.createDataset( List(InputData("hi boo"), InputData("boo"))) transformer.setInputCol("input") transformer.setOutputCol("output") val result = transformer.transform(input).collect() Alexy Khrabrov
  • 44. How might this all play out? ● A portability layer like Beam could save us ○ And commoditize the execution engine layer ● One execution engine becomes good enough at everything ● Instead of a compiler like unifier we see something like streaming SQL become our unifier ○ Personally you can pry my functions out of my cold dead hands but that’s all right ● People realize there big data problem is actually three small data problems in a trench coat Shawn Carpenter
  • 45. What does this mean for special systems? ● More special systems will run on top of the same general purpose engine ○ Important to work with general purpose engines to make sure they will serve the future special needs. ● Think about inputs and outputs and how to share them nicely and efficiently with other systems (please not just files) ● Some of them will become more like libraries ○ This is perhaps “less cool” than being a stand alone system, but maybe better ● More collaboration hopefully
  • 46. What does “sharing” look like? val sparkSql = ... // Start hive on the engine HiveThriftServer2.startWithContext(sparkSql) // Start snappy data val snappy = new org.apache.spark.sql.SnappySession(spark.sparkContext) // Do cool things with both ...
  • 47. Want to help? ● Lot’s of places to contribute ● We’d love help on Spark and or BEAM ○ Improved execution engine support in BEAM ○ More data sources & specialized format support (everywhere) ○ Better Python & non-JVM language support ● I’m sure Flink would too, but I’m less involved with them
  • 48. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action High Performance SparkLearning PySpark
  • 49. High Performance Spark! Available today! You can buy it from that scrappy Seattle bookstore, Jeff Bezos needs another newspaper and I want a cup of coffee. http://bit.ly/hkHighPerfSpark
  • 50. And some upcoming talks: ● June ○ Spark Summit SF (Accelerating Tensorflow & Accelerating Python + Dependencies) ○ Scala Days NYC ○ FOSS Back Stage & BBuzz (Berlin) ● July ○ Curry On Amsterdam ○ OSCON Portland ● September ○ Strata NYC ● October ○ Reversim (Tel Aviv)
  • 51. That’s all Folks! Happy GDPR Friday! - May your last minute deploys not break too much! If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark Pssst: Have feedback on the presentation? Give me a shout (holden@pigscanfly.ca) if you feel comfortable doing so :)