SlideShare a Scribd company logo
1 of 46
Download to read offline
Accelerating
Big Data Beyond the JVM
Engine noises go here
@holdenkarau
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC :)
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
Who I think you wonderful humans are?
● Nice enough people
● Don’t mind pictures of cats
● Might know some the different distributed systems talked about
● Possibly know some Python or R
● Or are tired of scala/ java
Lori Erickson
What are these systems for?
● “Big Data”
○ Which varies from: it doesn’t fit on your macbook to
the largest AWS/GCP server you can rent)
● Involve both the distribution of Data & Processing
● Most are written in the JVM (but not all)
Fat Cat Moe
What will be covered?
● A more detailed look at the current state of PySpark
● Why it isn’t good enough
● Why things are finally changing
● A brief tour of options for non-JVM languages in the Big Data space
● My even less subtle attempts to get you to buy my new book
● Pictures of cats & stuffed animals
● tl;dr - We’ve* made some bad** choices historically, and projects like Arrow &
friends can save us from some of these (yay!)
What’s the state of non-JVM big data?
Most of the tools are built in the JVM, so how do we play together?
● Pickling, Strings, JSON, XML, oh my!
● Unix pipes
● Sockets
What about if we don’t want to copy the data all the time?
● Or standalone “pure”* re-implementations of everything
○ Reasonable option for things like Kafka where you would have the I/O regardless.
○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem
David Brown
PySpark:
● The Python interface to Spark
● Same general technique used as the bases for the C#, R, Julia, etc.
interfaces to Spark
● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API
● Has some serious performance hurdles from the design
Yes, we have wordcount! :p
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(output)
No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
These are still
combined and
executed in
one python
executor
Trish Hamme
A quick detour into PySpark’s internals
+ + JSON
Spark in Scala, how does PySpark work?
● Py4J + pickling + JSON and magic
○ Py4j in the driver
○ Pipes to start python process from java exec
○ cloudPickle to serialize data between JVM and python executors
(transmitted via sockets)
○ Json for dataframe schema
● Data from Spark worker serialized and piped to Python
worker --> then piped back to jvm
○ Multiple iterator-to-iterator transformations are still pipelined :)
○ So serialization happens only once per stage
● Spark SQL (and DataFrames) avoid some of this
kristin klein
So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
And in flink….
Driver
custom
Worker 1
Worker K
mmap
mmap
So how does that impact PySpark?
● Double serialization cost makes everything more
expensive
● Python worker startup takes a bit of extra time
● Python memory isn’t controlled by the JVM - easy to go
over container limits if deploying on YARN or similar
● Error messages make ~0 sense
● Spark Features aren’t automatically exposed, but
exposing them is normally simple
Our saviour from serialization: DataFrames
● For the most part keeps data in the JVM
○ Notable exception is UDFs written in Python
● Takes our python calls and turns it into a query plan if
we need more than the native operations in Spark’s
DataFrames
● be wary of Distributed Systems bringing claims of
usability….
Andy
Blackledge
So what are Spark DataFrames?
● More than SQL tables
● Not Pandas or R DataFrames
● Semi-structured (have schema information)
● tabular
● work on expression as well as lambdas
○ e.g. df.filter(df.col(“happy”) == true) instead of rdd.filter(lambda x:
x.happy == true))
● Not a subset of Spark “Datasets” - since Dataset API
isn’t exposed in Python yet :(
Quinn Dombrowski
Word count w/Dataframes
df = sqlCtx.read.load(src)
# Returns an RDD
words = df.select("text").flatMap(lambda x: x.text.split(" "))
words_df = words.map(
lambda x: Row(word=x, cnt=1)).toDF()
word_count = words_df.groupBy("word").sum()
word_count.write.format("parquet").save("wc.parquet")
Still have the double
serialization here :(
We can see the difference easily:
Andrew Skudder
*
*Vendor
benchmark.
Trust but verify.
The “future”*: faster interchange
● By future I mean availability starting in the next 3-6 months (with more
improvements after).
○ Yes much of this code exists, it just isn’t released yet so I’m sure we’ll find all sorts of bugs
and ways to improve.
○ Relatedly you can help us test in Spark 2.3 when we start the RC process to catch bug early!
● Unifying our cross-language experience
○ And not just “normal” languages, CUDA counts yo
Tambako The Jaguar
Andrew Skudder
*Arrow: likely the future. I really hope so. Spark 2.3 and beyond!
* *
What does the future look like?*
*Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html.
*Vendor
benchmark.
Trust but verify.
Arrow (a poorly drawn big data view)
Logos trademarks of their respective projects
Rewriting your code because why not
spark.catalog.registerFunction(
"add", lambda x, y: x + y, IntegerType())
=>
add = pandas_udf(lambda x, y: x + y, IntegerType())
Hadoop “streaming” (Python/R)
● Unix pipes!
● Involves a data copy, formats get sad
● But the overhead of a Map/Reduce task is pretty high anyways...
Lisa Larsson
What do the rest of the systems do?
● Spoiler: mostly it’s not better
● Different tradeoffs, maybe better for your use case but all tradeoffs
Kate Neilan
Kafka: re-implement all the things
● Multiple options for connecting to Kafka from outside of the JVM (yay!)
● They implement the protocol to talk to Kafka (yay!)
● This involves duplicated client work, and sometimes the clients can be slow
(solution, FFI bindings to C instead of Java)
● Buuuut -- we can’t access all of the cool Kafka business (like Kafka Streams)
and features depend on client libraries implementing them (easy to slip below
parity)
Smokey Combs
Dask: a new beginning?
● Pure* python implementation
● Provides real enough DataFrame interface for distributed data
● Also your standard-ish distributed collections
● Multiple backends
● Primary challenge: interacting with the rest of the big data ecosystem
○ Arrow & friends might make this better with time too, buuut….
● See https://dask.pydata.org/en/latest/ &
http://dask.pydata.org/en/latest/spark.html
Lisa Zins
BEAM Beyond the JVM
● Non JVM BEAM doesn’t work outside of Google’s environment yet
● tl;dr : uses grpc / protobuf
○ Similar to the common design but with more efficient representations (often)
● But exciting new plans to unify the runners and ease the support of different
languages (called SDKS)
○ See https://beam.apache.org/contribute/portability/
● If this is exciting, you can come join me on making BEAM work in Python3
○ Yes we still don’t have that :(
○ But we're getting closer & you can come join us on BEAM-2874 :D
References
● Apache Arrow: https://arrow.apache.org/
● Brian (IBM) on initial Spark + Arrow
https://arrow.apache.org/blog/2017/07/26/spark-arrow/
● Li Jin (two sigma)
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspar
k.html
● Bill Maimone
https://blogs.nvidia.com/blog/2017/06/27/gpu-computation-visualization/
PROR. Crap Mariner
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
You can buy it today, the O’Reilly folks have it upstairs (&
so does Amazon).
Only one chapter on non-JVM and nothing on Arrow, I’m
sorry.
Cats love it*
*Or at least the box it comes in. If buying for a cat, get print
rather than e-book.
Spark Videos
● Apache Spark Youtube Channel
● My Spark videos on YouTube -
○ http://bit.ly/holdenSparkVideos
● Spark Summit 2014 training
● Paco’s Introduction to Apache Spark
And some upcoming talks:
● Maybe office hours @ 6pm today if interest?*
○ Tweet at me (@holdenkarau) if interested+free, otherwise I’ll go find
chocolate :)
● JFokus - Surprisingly mostly talking about Python….
● Strata San Jose
● Strata London
● QCon Brasil
● QCon AI SF
● Know of interesting conferences/webinar things that
should be on my radar? Let me know!
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
I need to give a testing talk next
month, help a “friend” out.
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)
Give feedback on this presentation
http://bit.ly/holdenTalkFeedback
Beyond wordcount: depencies?
● Your machines probably already have pandas
○ But maybe an old version
● But they might not have “special_business_logic”
○ Very special business logic, no one wants change fortran code*.
● Option 1: Talk to your vendor**
● Option 2: Try some open source software
● Option 3: Containers containers containers***
● Option 4: Any of this mornings talks
● We’re going to focus on option 2!
*Because it’s perfect, it is fortran after all.
** I don’t like this option because the vendor I work for doesn’t have an answer.
*** Great for your resume!
coffee_boat to the rescue*
# You can tell it's alpha cause were installing from github
!pip install --upgrade
git+https://github.com/nteract/coffee_boat.git
# Use the coffee boat
from coffee_boat import Captain
captain = Captain(accept_conda_license=True)
captain.add_pip_packages("pyarrow", "edtf")
captain.launch_ship()
sc = SparkContext(master="yarn")
# You can now use pyarrow & edtf
captain.add_pip_packages("yourmagic")
# You can now use yourmagic in transformations!
Bonus Slides
Maybe you ask a question and we go here :)
Why now?
● There’s been better formats/options for a long time
● JVM devs want to use libraries in other languages with lots of data
○ e.g. startup + Deep Learning + ? => profit
● Arrow has solved the chicken-egg problem by building not just the chicken &
the egg, but also a hen house
Andrew Mager
What’s still going to hurt?
● Per-record streaming
○ Arrow is probably less awesome for serialization
○ But it’s still better than we had before
● Debugging is just going to get worse
● Custom data formats
○ Time to bust out the C++ code and a bottle of scotch / matte as appropriate
○ Or just accept the “legacy” performance
bryanbug
We can do that w/Kafka streams..
● Why bother learning from our mistakes?
● Or more seriously, the mistakes weren’t that bad...
Our “special” business logic
def transform(input):
"""
Transforms the supplied input.
"""
return str(len(input))
Pargon
Let’s pretend all the world is a string:
override def transform(value: String): String = {
// WARNING: This may summon cuthuluhu
dataOut.writeInt(value.getBytes.size)
dataOut.write(value.getBytes)
dataOut.flush()
val resultSize = dataIn.readInt()
val result = new Array[Byte](resultSize)
dataIn.readFully(result)
// Assume UTF8, what could go wrong? :p
new String(result)
}
From https://github.com/holdenk/kafka-streams-python-cthulhu
Then make an instance to use it...
val testFuncFile =
"kafka_streams_python_cthulhu/strlen.py"
stream.transformValues(
PythonStringValueTransformerSupplier(testFuncFile))
// Or we could wrap this in the bridge but thats effort.
From https://github.com/holdenk/kafka-streams-python-cthulhu
Let’s pretend all the world is a string:
def main(socket):
while (True):
input_length = _read_int(socket)
data = socket.read(input_length)
result = transform(data)
resultBytes = result.encode()
_write_int(len(resultBytes), socket)
socket.write(resultBytes)
socket.flush()
From https://github.com/holdenk/kafka-streams-python-cthulhu
What does that let us do?
● You can add a map stage with your data scientists
Python code in the middle
● You’re limited to strings*
● Still missing the “driver side” integration (e.g. the
interface requires someone to make a Scala class at
some point)
What about things other than strings?
Use another system
● Like Spark! (oh wait) or BEAM* or FLINK*?
Write it in a format Python can understand:
● Pickling (from Java)
● JSON
● XML
Purely Python solutions
● Currently roll-your-own (but not that bad)
*These are also JVM based solutions calling into Python. I’m not saying they will also summon Cuthulhu, I’m just saying hang onto

More Related Content

What's hot

Building Recoverable (and optionally async) Pipelines with Apache Spark (+ s...
Building Recoverable (and optionally async) Pipelines with Apache Spark  (+ s...Building Recoverable (and optionally async) Pipelines with Apache Spark  (+ s...
Building Recoverable (and optionally async) Pipelines with Apache Spark (+ s...Holden Karau
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!Holden Karau
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache SparkHolden Karau
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...Holden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkHolden Karau
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Modelssparktc
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupHolden Karau
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Holden Karau
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...Holden Karau
 
Validating big data pipelines - Scala eXchange 2018
Validating big data pipelines -  Scala eXchange 2018Validating big data pipelines -  Scala eXchange 2018
Validating big data pipelines - Scala eXchange 2018Holden Karau
 

What's hot (20)

Building Recoverable (and optionally async) Pipelines with Apache Spark (+ s...
Building Recoverable (and optionally async) Pipelines with Apache Spark  (+ s...Building Recoverable (and optionally async) Pipelines with Apache Spark  (+ s...
Building Recoverable (and optionally async) Pipelines with Apache Spark (+ s...
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New York
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
 
Validating big data pipelines - Scala eXchange 2018
Validating big data pipelines -  Scala eXchange 2018Validating big data pipelines -  Scala eXchange 2018
Validating big data pipelines - Scala eXchange 2018
 

Similar to Accelerating Big Data beyond the JVM - Fosdem 2018

Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Holden Karau
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...Holden Karau
 
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Holden Karau
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
 
Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...Holden Karau
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Holden Karau
 
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...Holden Karau
 
Contributing to Apache Spark 3
Contributing to Apache Spark 3Contributing to Apache Spark 3
Contributing to Apache Spark 3Holden Karau
 
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and BeyondGetting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and BeyondDatabricks
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Lightning Fast Dataframes with Polars
Lightning Fast Dataframes with PolarsLightning Fast Dataframes with Polars
Lightning Fast Dataframes with PolarsAlberto Danese
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopHolden Karau
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 

Similar to Accelerating Big Data beyond the JVM - Fosdem 2018 (20)

Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...
 
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
 
Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
 
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
 
Contributing to Apache Spark 3
Contributing to Apache Spark 3Contributing to Apache Spark 3
Contributing to Apache Spark 3
 
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and BeyondGetting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Lightning Fast Dataframes with Polars
Lightning Fast Dataframes with PolarsLightning Fast Dataframes with Polars
Lightning Fast Dataframes with Polars
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 

Recently uploaded

Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationMarko4394
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 

Recently uploaded (17)

Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentation
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 

Accelerating Big Data beyond the JVM - Fosdem 2018

  • 1. Accelerating Big Data Beyond the JVM Engine noises go here @holdenkarau
  • 2. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC :) ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos
  • 3.
  • 4. Who I think you wonderful humans are? ● Nice enough people ● Don’t mind pictures of cats ● Might know some the different distributed systems talked about ● Possibly know some Python or R ● Or are tired of scala/ java Lori Erickson
  • 5. What are these systems for? ● “Big Data” ○ Which varies from: it doesn’t fit on your macbook to the largest AWS/GCP server you can rent) ● Involve both the distribution of Data & Processing ● Most are written in the JVM (but not all) Fat Cat Moe
  • 6. What will be covered? ● A more detailed look at the current state of PySpark ● Why it isn’t good enough ● Why things are finally changing ● A brief tour of options for non-JVM languages in the Big Data space ● My even less subtle attempts to get you to buy my new book ● Pictures of cats & stuffed animals ● tl;dr - We’ve* made some bad** choices historically, and projects like Arrow & friends can save us from some of these (yay!)
  • 7. What’s the state of non-JVM big data? Most of the tools are built in the JVM, so how do we play together? ● Pickling, Strings, JSON, XML, oh my! ● Unix pipes ● Sockets What about if we don’t want to copy the data all the time? ● Or standalone “pure”* re-implementations of everything ○ Reasonable option for things like Kafka where you would have the I/O regardless. ○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem David Brown
  • 8. PySpark: ● The Python interface to Spark ● Same general technique used as the bases for the C#, R, Julia, etc. interfaces to Spark ● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API ● Has some serious performance hurdles from the design
  • 9. Yes, we have wordcount! :p lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(output) No data is read or processed until after this line This is an “action” which forces spark to evaluate the RDD These are still combined and executed in one python executor Trish Hamme
  • 10. A quick detour into PySpark’s internals + + JSON
  • 11. Spark in Scala, how does PySpark work? ● Py4J + pickling + JSON and magic ○ Py4j in the driver ○ Pipes to start python process from java exec ○ cloudPickle to serialize data between JVM and python executors (transmitted via sockets) ○ Json for dataframe schema ● Data from Spark worker serialized and piped to Python worker --> then piped back to jvm ○ Multiple iterator-to-iterator transformations are still pipelined :) ○ So serialization happens only once per stage ● Spark SQL (and DataFrames) avoid some of this kristin klein
  • 12. So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  • 14. So how does that impact PySpark? ● Double serialization cost makes everything more expensive ● Python worker startup takes a bit of extra time ● Python memory isn’t controlled by the JVM - easy to go over container limits if deploying on YARN or similar ● Error messages make ~0 sense ● Spark Features aren’t automatically exposed, but exposing them is normally simple
  • 15. Our saviour from serialization: DataFrames ● For the most part keeps data in the JVM ○ Notable exception is UDFs written in Python ● Takes our python calls and turns it into a query plan if we need more than the native operations in Spark’s DataFrames ● be wary of Distributed Systems bringing claims of usability…. Andy Blackledge
  • 16. So what are Spark DataFrames? ● More than SQL tables ● Not Pandas or R DataFrames ● Semi-structured (have schema information) ● tabular ● work on expression as well as lambdas ○ e.g. df.filter(df.col(“happy”) == true) instead of rdd.filter(lambda x: x.happy == true)) ● Not a subset of Spark “Datasets” - since Dataset API isn’t exposed in Python yet :( Quinn Dombrowski
  • 17. Word count w/Dataframes df = sqlCtx.read.load(src) # Returns an RDD words = df.select("text").flatMap(lambda x: x.text.split(" ")) words_df = words.map( lambda x: Row(word=x, cnt=1)).toDF() word_count = words_df.groupBy("word").sum() word_count.write.format("parquet").save("wc.parquet") Still have the double serialization here :(
  • 18. We can see the difference easily: Andrew Skudder * *Vendor benchmark. Trust but verify.
  • 19. The “future”*: faster interchange ● By future I mean availability starting in the next 3-6 months (with more improvements after). ○ Yes much of this code exists, it just isn’t released yet so I’m sure we’ll find all sorts of bugs and ways to improve. ○ Relatedly you can help us test in Spark 2.3 when we start the RC process to catch bug early! ● Unifying our cross-language experience ○ And not just “normal” languages, CUDA counts yo Tambako The Jaguar
  • 20. Andrew Skudder *Arrow: likely the future. I really hope so. Spark 2.3 and beyond! * *
  • 21. What does the future look like?* *Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html. *Vendor benchmark. Trust but verify.
  • 22. Arrow (a poorly drawn big data view) Logos trademarks of their respective projects
  • 23. Rewriting your code because why not spark.catalog.registerFunction( "add", lambda x, y: x + y, IntegerType()) => add = pandas_udf(lambda x, y: x + y, IntegerType())
  • 24. Hadoop “streaming” (Python/R) ● Unix pipes! ● Involves a data copy, formats get sad ● But the overhead of a Map/Reduce task is pretty high anyways... Lisa Larsson
  • 25. What do the rest of the systems do? ● Spoiler: mostly it’s not better ● Different tradeoffs, maybe better for your use case but all tradeoffs Kate Neilan
  • 26. Kafka: re-implement all the things ● Multiple options for connecting to Kafka from outside of the JVM (yay!) ● They implement the protocol to talk to Kafka (yay!) ● This involves duplicated client work, and sometimes the clients can be slow (solution, FFI bindings to C instead of Java) ● Buuuut -- we can’t access all of the cool Kafka business (like Kafka Streams) and features depend on client libraries implementing them (easy to slip below parity) Smokey Combs
  • 27. Dask: a new beginning? ● Pure* python implementation ● Provides real enough DataFrame interface for distributed data ● Also your standard-ish distributed collections ● Multiple backends ● Primary challenge: interacting with the rest of the big data ecosystem ○ Arrow & friends might make this better with time too, buuut…. ● See https://dask.pydata.org/en/latest/ & http://dask.pydata.org/en/latest/spark.html Lisa Zins
  • 28. BEAM Beyond the JVM ● Non JVM BEAM doesn’t work outside of Google’s environment yet ● tl;dr : uses grpc / protobuf ○ Similar to the common design but with more efficient representations (often) ● But exciting new plans to unify the runners and ease the support of different languages (called SDKS) ○ See https://beam.apache.org/contribute/portability/ ● If this is exciting, you can come join me on making BEAM work in Python3 ○ Yes we still don’t have that :( ○ But we're getting closer & you can come join us on BEAM-2874 :D
  • 29. References ● Apache Arrow: https://arrow.apache.org/ ● Brian (IBM) on initial Spark + Arrow https://arrow.apache.org/blog/2017/07/26/spark-arrow/ ● Li Jin (two sigma) https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspar k.html ● Bill Maimone https://blogs.nvidia.com/blog/2017/06/27/gpu-computation-visualization/ PROR. Crap Mariner
  • 30. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 31. High Performance Spark! You can buy it today, the O’Reilly folks have it upstairs (& so does Amazon). Only one chapter on non-JVM and nothing on Arrow, I’m sorry. Cats love it* *Or at least the box it comes in. If buying for a cat, get print rather than e-book.
  • 32. Spark Videos ● Apache Spark Youtube Channel ● My Spark videos on YouTube - ○ http://bit.ly/holdenSparkVideos ● Spark Summit 2014 training ● Paco’s Introduction to Apache Spark
  • 33. And some upcoming talks: ● Maybe office hours @ 6pm today if interest?* ○ Tweet at me (@holdenkarau) if interested+free, otherwise I’ll go find chocolate :) ● JFokus - Surprisingly mostly talking about Python…. ● Strata San Jose ● Strata London ● QCon Brasil ● QCon AI SF ● Know of interesting conferences/webinar things that should be on my radar? Let me know!
  • 34. k thnx bye :) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark I need to give a testing talk next month, help a “friend” out. Will tweet results “eventually” @holdenkarau Do you want more realistic benchmarks? Share your UDFs! http://bit.ly/pySparkUDF Pssst: Have feedback on the presentation? Give me a shout (holden@pigscanfly.ca) if you feel comfortable doing so :) Give feedback on this presentation http://bit.ly/holdenTalkFeedback
  • 35. Beyond wordcount: depencies? ● Your machines probably already have pandas ○ But maybe an old version ● But they might not have “special_business_logic” ○ Very special business logic, no one wants change fortran code*. ● Option 1: Talk to your vendor** ● Option 2: Try some open source software ● Option 3: Containers containers containers*** ● Option 4: Any of this mornings talks ● We’re going to focus on option 2! *Because it’s perfect, it is fortran after all. ** I don’t like this option because the vendor I work for doesn’t have an answer. *** Great for your resume!
  • 36. coffee_boat to the rescue* # You can tell it's alpha cause were installing from github !pip install --upgrade git+https://github.com/nteract/coffee_boat.git # Use the coffee boat from coffee_boat import Captain captain = Captain(accept_conda_license=True) captain.add_pip_packages("pyarrow", "edtf") captain.launch_ship() sc = SparkContext(master="yarn") # You can now use pyarrow & edtf captain.add_pip_packages("yourmagic") # You can now use yourmagic in transformations!
  • 37. Bonus Slides Maybe you ask a question and we go here :)
  • 38. Why now? ● There’s been better formats/options for a long time ● JVM devs want to use libraries in other languages with lots of data ○ e.g. startup + Deep Learning + ? => profit ● Arrow has solved the chicken-egg problem by building not just the chicken & the egg, but also a hen house Andrew Mager
  • 39. What’s still going to hurt? ● Per-record streaming ○ Arrow is probably less awesome for serialization ○ But it’s still better than we had before ● Debugging is just going to get worse ● Custom data formats ○ Time to bust out the C++ code and a bottle of scotch / matte as appropriate ○ Or just accept the “legacy” performance bryanbug
  • 40. We can do that w/Kafka streams.. ● Why bother learning from our mistakes? ● Or more seriously, the mistakes weren’t that bad...
  • 41. Our “special” business logic def transform(input): """ Transforms the supplied input. """ return str(len(input)) Pargon
  • 42. Let’s pretend all the world is a string: override def transform(value: String): String = { // WARNING: This may summon cuthuluhu dataOut.writeInt(value.getBytes.size) dataOut.write(value.getBytes) dataOut.flush() val resultSize = dataIn.readInt() val result = new Array[Byte](resultSize) dataIn.readFully(result) // Assume UTF8, what could go wrong? :p new String(result) } From https://github.com/holdenk/kafka-streams-python-cthulhu
  • 43. Then make an instance to use it... val testFuncFile = "kafka_streams_python_cthulhu/strlen.py" stream.transformValues( PythonStringValueTransformerSupplier(testFuncFile)) // Or we could wrap this in the bridge but thats effort. From https://github.com/holdenk/kafka-streams-python-cthulhu
  • 44. Let’s pretend all the world is a string: def main(socket): while (True): input_length = _read_int(socket) data = socket.read(input_length) result = transform(data) resultBytes = result.encode() _write_int(len(resultBytes), socket) socket.write(resultBytes) socket.flush() From https://github.com/holdenk/kafka-streams-python-cthulhu
  • 45. What does that let us do? ● You can add a map stage with your data scientists Python code in the middle ● You’re limited to strings* ● Still missing the “driver side” integration (e.g. the interface requires someone to make a Scala class at some point)
  • 46. What about things other than strings? Use another system ● Like Spark! (oh wait) or BEAM* or FLINK*? Write it in a format Python can understand: ● Pickling (from Java) ● JSON ● XML Purely Python solutions ● Currently roll-your-own (but not that bad) *These are also JVM based solutions calling into Python. I’m not saying they will also summon Cuthulhu, I’m just saying hang onto