Every-time there is a new piece of big data technology we often see many different specific implementations of the concepts, which often eventually consolidate down to a few viable options, and then frequently end up getting rolled into part of another larger project. This talk will examine this trend in big data ecosystem, look at the exceptions to the "rule", and look at how better interchange formats like Apache Arrow have the potential to change this going forward. In addition to general vague happy feelings (or sad depending on your ideas about how software should be made), this talk will look at some specific examples with deep learning, so if anyone is looking for a little bit of pixie dust to sprinkle on a failing business plan to take to silicon valley to raise a series A, you'll get something out this as well.
Video - https://www.youtube.com/watch?v=P_YKrLFZQJo
Are general purpose big data systems eating the world?
1. Are General Purpose Systems Eating
the Big Data World?
with your friend
@holdenkarau
2. Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams & live coding: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback
3.
4. Some things that may color my views:
● I’m a Spark committer -- if Spark continues to win I can probably make more
$s
● My employer cares about BEAM (and Spark and other things)
● I work primarily in Python & Scala these days
● I like functional programming
● Probably some others I’m forgetting
On the other hand:
● I’ve worked on Spark for a long time and know a lot of its faults
● My goals are pretty flexible
● I have x86 assembly code tattooed on my back
5. Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
● On twitter @BooProgrammer
● Wishes EU cookie warnings were about real cookies
● Also confused about GDPR
6. Who I think you wonderful humans are?
● Nice enough people
● Don’t mind pictures of cats
● Think the JVM is kind of cool
Lori Erickson
7. What are we going to talk about?
● The state of the open big data ecosystem
● Why general purpose matters
○ Where it doesn’t* matter**
● The different kinds of unification and why it matters
● Evolution of Spark/Flink and friends
● What BEAM aims to do
○ Where we are today
○ Where we need to be tomorrow
● Other ways the data world can be unified
● A limited number of GDPR jokes if you opt-in today
● And of course lots of wordcount examples
9. Enter: Spark
● Integrated different tools which traditionally required different systems
○ Mahout, hive, etc. Not great at all of these but “good enough”
● e.g. can use same system to do ML and SQL
*Often written in Python!
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
10. General purpose probably are eating the world
● Operations overhead
● Moving data from System 1 to System 2 (sqoop and friends)
○ Interchange: write everything out to disk (or HDFS/GCS/S3/etc.)
● We still have specialized tools, but being built on top of general frameworks
○ e.g. see mahout on Spark
○ Less closely tied things like Hive/Pig on Spark
○ TF.Transform etc.
Tambako The Jaguar
15. & We’re getting better with cloud silos
● EMR (elastic map reduce) like services on AWS, Azure & Google
○ EMR
○ Azure HDInsight
○ Google Dataproc
● You can move your Spark job super easily
● Vendors are becoming cross-platform too (see Databricks)
torne (where's my lens ca
16. We’re getting better with data silos
● Customers demand access to their data in many platforms
● Vendors seem to listen
○ Notable exceptions: formats without a vendor, or vendor with a closed sourced locked in
platform
● e.g. Even if your data is a mixture of Parquet & big query (or red shift) you can
access both
17. We’re starting to recognize these problems:
● Spark is adding more runners (see SparkR) and integrating Arrow
● Small projects (like Sparkling ML) work to expose Python libs to JVM and vice
versa
● Apache Arrow is focused around improved formats for interop
○ Included in Spark 2.3 to vastly improve Python performance
○ Being used by GPU vendors too!
● Apache BEAM is adding a unified runner SDK API
○ Currently each language has to support each backend
○ Common core
● Apache BEAM is attempting to unify streaming APIs
18. What’s the state of non-JVM big data?
Most of the tools are built in the JVM, so how do we play together?
● Pickling, Strings, JSON, XML, oh my!
● Unix pipes
● Sockets
What about if we don’t want to copy the data all the time?
● Or standalone “pure”* re-implementations of everything
○ Reasonable option for things like Kafka where you would have the I/O regardless.
○ Also cool projects like dask -- but hard to talk to existing ecosystem
David Brown
19. Spark in Scala, how does PySpark work?
● Py4J + pickling + JSON and magic
○ This can be kind of slow sometimes
● Distributed collections are often collections of pickled
objects
● Spark SQL (and DataFrames) avoid some of this
○ Sometimes we can make them go fast and compile them to the JVM
● Features aren’t automatically exposed, but exposing
them is normally simple.
● SparkR depends on similar magic
kristin klein
20. So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
21. *For a small price of your fun libraries. Bad idea.
22. That was a bad idea, buuut…..
● Work going on in Scala land to translate simple Scala
into SQL expressions - need the Dataset API
○ Maybe we can try similar approaches with Python?
● POC use Jython for simple UDFs (e.g. 2.7 compat & no
native libraries) - SPARK-15369
○ Early benchmarking w/word count 5% slower than native Scala UDF,
close to 2x faster than regular Python
● Willing to share your Python UDFs for benchmarking? -
http://bit.ly/pySparkUDF
*The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its
ok!
23.
24. The “future”*: faster interchange
● By future I mean availability starting in the next 3-6 months (with more
improvements after).
○ Yes much of this code exists, it just isn’t released yet so I’m sure we’ll find all sorts of bugs
and ways to improve.
○ Relatedly you can help us test in Spark 2.3 when we start the RC process to catch bug early!
● Unifying our cross-language experience
○ And not just “normal” languages, CUDA counts yo
Tambako The Jaguar
25. Andrew Skudder
*Arrow: likely the future. I really hope so. Spark 2.3 and beyond!
* *
With early work happening to
support GPUs/ TF.
26. What does the future look like?*
*Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html.
*Vendor
benchmark. Be
careful.
27. Arrow (a poorly drawn big data view)
Logos trademarks of their respective projects
Juha Kettunen
*ish
28. BEAM backends:
● BEAM nominally supports*
○ Dataflow
○ Flink*
○ Spark*
○ IBM Streams, etc.
● Goal of more than just lowest-common-demoninator, think of it like a
compiler**
*Supports as in early-stage, but we’re working on it (and we’d love your help!)
**But you know, in the same sense I compare Spark Streams to pandas coming a
wooden slide.
30. BEAM Beyond the JVM
● This part doesn’t work outside of Google’s hosted environment yet, so I’m
going to be light on the details
● tl;dr : uses grpc / protobuf
● But exciting new plans (w/ some early code) to unify the runners and ease the
support of different languages (called SDKS)
○ See https://beam.apache.org/contribute/portability/
31. BEAM Beyond the JVM: Random collab branch
*ish
*ish
*ish
Nick
portability
*ish
32. What do (some of) the different APIs look like?
● Everyone's favourite: Streaming Word Count Example
● And then windowed wordcount!
● (And also a peak at Tensorflow incase anyone is trying to raise a series A)
33. Spark wordcount (Python*)
# Split the lines into words
words = lines.select(
# explode turns each item in an array into a separate row
explode(
split(lines.value, ' ')
).alias('word')
)
# Generate running word count
wordCounts = words.groupBy('word').count()
34. Flink wordcount (Java)
DataStream<String> text = ...
DataStream<Tuple2<String, Integer>> counts =
text.map(line -> line.toLowerCase().split("W+"))
.flatMap((String[] tokens, Collector<Tuple2<String,
Integer>> out) -> {
Arrays.stream(tokens)
.forEach(t -> out.collect(new Tuple2<>(t, 1)));
})
// group by the tuple field "0" and sum up tuple field "1"
.keyBy(0)
.sum(1);
35. BEAM wordcount (Java)
p.apply("ExtractWords", ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
for (String word : c.element().split("W+")) {
c.output(word);
}}}))
.apply(Count.<String>perElement())
36. Go Wordcount (BEAM)
func CountWords(s beam.Scope, lines beam.PCollection)
beam.PCollection {
s = s.Scope("CountWords")
// Convert lines of text into individual words.
col := beam.ParDo(s, extractFn, lines)
// Count the number of times each word occurs.
return stats.Count(s, col)
}
Ada Doglace
37. Go Wordcount (BEAM) continued
p := beam.NewPipeline()
s := p.Root()
lines := textio.Read(s, *input)
counted := CountWords(s, lines)
formatted := beam.ParDo(s, formatFn, counted)
textio.Write(s, *output, formatted)
39. Flink windowed wordcount (scala)
val counts: DataStream[(String, Int)] = text
// split up the lines in pairs (2-tuple) containing:
(word,1)
.flatMap(_.toLowerCase.split("W+"))
.map((_, 1))
.keyBy(0)
// create windows of windowSize records slided every
slideSize records
.countWindow(windowSize, slideSize)
// sum up tuple field "1"
.sum(1)
40. Multi-language pipelines?!?
● Totally! I mean kind-of. Spark is ok right?
○ Beam is working on it :) #comejoinus
○ Easier for Python calling Java
○ If SQL counts for sure. If not, yes*.
● Technically what happens when you use Scala Spark + Tensorflow
● And same with Beam, etc.
● Right now painful to write in practice in OSS land outside of libraries
(commercial vendors have some solutions but I stick to OSS when I can)
● So lets focus and libraries (and look head at more general)
Jennifer C.
41. A proof of concept: Sparkling ML
● A place for useful Spark ML pipeline stages to live
○ Including both feature transformers and estimators
● The why
○ Spark ML can’t keep up with every new algorithm
○ Spark ML is largely Scala based, cool Python tools for NLP exist as well
● Lots of cool ML on Spark tools exist, but many don’t play nice with Spark ML
or together.
● We make it easier to expose Python transformers into Scala land and vice
versa.
● Our repo is at: https://github.com/sparklingpandas/sparklingml
42. Supporting non-English languages
With Spacy, so you know more than English*
def inner(inputString):
nlp = SpacyMagic.get(lang)
def spacyTokenToDict(token):
"""Convert the input token into a dictionary"""
return dict(map(lookup_field_or_none, fields))
return list(map(spacyTokenToDict,
list(nlp(inputString))))
test_t51
43. And from the JVM:
val transformer = new SpacyTokenizePython()
transformer.setLang("en")
val input = spark.createDataset(
List(InputData("hi boo"), InputData("boo")))
transformer.setInputCol("input")
transformer.setOutputCol("output")
val result = transformer.transform(input).collect()
Alexy Khrabrov
44. How might this all play out?
● A portability layer like Beam could save us
○ And commoditize the execution engine layer
● One execution engine becomes good enough at everything
● Instead of a compiler like unifier we see something like streaming SQL
become our unifier
○ Personally you can pry my functions out of my cold dead hands but that’s all right
● People realize there big data problem is actually three small data problems in
a trench coat
Shawn Carpenter
45. What does this mean for special systems?
● More special systems will run on top of the same general purpose engine
○ Important to work with general purpose engines to make sure they will serve the future special
needs.
● Think about inputs and outputs and how to share them nicely and efficiently
with other systems (please not just files)
● Some of them will become more like libraries
○ This is perhaps “less cool” than being a stand alone system, but maybe better
● More collaboration hopefully
46. What does “sharing” look like?
val sparkSql = ...
// Start hive on the engine
HiveThriftServer2.startWithContext(sparkSql)
// Start snappy data
val snappy = new
org.apache.spark.sql.SnappySession(spark.sparkContext)
// Do cool things with both
...
47. Want to help?
● Lot’s of places to contribute
● We’d love help on Spark and or BEAM
○ Improved execution engine support in BEAM
○ More data sources & specialized format support (everywhere)
○ Better Python & non-JVM language support
● I’m sure Flink would too, but I’m less involved with them
48. Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
High Performance SparkLearning PySpark
49. High Performance Spark!
Available today!
You can buy it from that scrappy Seattle bookstore, Jeff
Bezos needs another newspaper and I want a cup of
coffee.
http://bit.ly/hkHighPerfSpark
50. And some upcoming talks:
● June
○ Spark Summit SF (Accelerating Tensorflow & Accelerating Python +
Dependencies)
○ Scala Days NYC
○ FOSS Back Stage & BBuzz (Berlin)
● July
○ Curry On Amsterdam
○ OSCON Portland
● September
○ Strata NYC
● October
○ Reversim (Tel Aviv)
51. That’s all Folks!
Happy GDPR Friday! - May your last minute deploys
not break too much!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)