With the new Apache Arrow integration in PySpark 2.3, it is now starting become reasonable to look to the Python world and ask “what else do we want to steal besides tensorflow”, or as a Python developer look and say “how can I get my code into production without it being rewritten into a mess of Java?”
Regardless of your specific side(s) in the JVM/Python divide, collaboration is getting a lot faster, so lets learn how to share! In this brief talk we will examine sharing some of the wonders of Spacy with the Java world, which still has a somewhat lackluster set of options for NLP.
2. Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC :)
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
● Talk feedback: http://bit.ly/holdenTalkFeedback http://bit.ly/holdenTalkFeedback
3.
4. Who I think you wonderful humans are?
● Nice enough people
● I’m sure you love pictures of cats
● Possibly know some Apache Spark
● Interested in stealing from Python (or getting your Python code into
production faster)
Lori Erickson
5. What will be covered?
● A quick look at the current state of PySpark
● Looking at how to reverse this
● Using Arrow for fast Python UDFS with Spark
● Reversing this again
● Beam Outside the JVM
● Our even less subtle attempts to get you to buy my new book
● Pictures of cats & stuffed animals
● tl;dr - Java has poor NLP and limited DL options, but it doesn’t matter we can
steal them from Python
Photo by Dean Wampler
6. What’s the state of non-JVM big data?
Most of the tools are built in the JVM, so how do we play together?
● Pickling, Strings, JSON, XML, oh my!
● Unix pipes
● Sockets
What about if we don’t want to copy the data all the time? Dataframe Api + Arrow
● Or standalone “pure”* re-implementations of everything
○ Reasonable option for things like Kafka where you would have the I/O regardless.
○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem
David Brown
7. PySpark:
● The Python interface to Spark
● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API
● Has some serious performance hurdles from the design
● Same general technique used as the bases for the other non JVM
implementations in Spark
○ C#
○ R
○ Julia
○ Javascript - surprisingly different
9. Spark in Scala, how does PySpark work?
● Py4J + pickling + JSON and magic
○ Py4j in the driver
○ Pipes to start python process from java exec
○ cloudPickle to serialize data between JVM and python executors
(transmitted via sockets)
○ Json for dataframe schema
● Data from Spark worker serialized and piped to Python
worker --> then piped back to jvm
○ Multiple iterator-to-iterator transformations are still pipelined :)
○ So serialization happens only once per stage
● Spark SQL (and DataFrames) avoid some of this
kristin klein
10. So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
11. Ok so how do we use this from the JVM?
● Dirty dirty tricks
● Launch Python from the JVM
● Instead of launching context.py launch our own special
entry point (startup.py - I’m bad at names)
● Implement an interface matching a Python class we can
call to register Python functions by string names
○ Optional: implement Scala classes for each of the Python classes. But
that sounded like more work, so uhhh PRs welcome?
● Run it! Curse. Debug. Run it!
12. So what does that look like?
shell/pipes
Py4J with Spark
& a friend:
startup.py
Photo by Dean Wampler
13. So what does that look like?
Request a UDF
Register a UDF
Photo by Dean
Wampler
14. So what does that look like?
Driver
Worker 1
Worker K
pipe
pipe
15. What is/why Sparkling ML
● A place for useful Spark ML pipeline stages to live
○ Including both feature transformers and estimators
● The why: Spark ML can’t keep up with every new algorithm
● Lots of cool ML on Spark tools exist, but many don’t play nice with Spark ML
or together.
● We make it easier to expose Python transformers into Scala land and vice
versa.
● Our repo is at: https://github.com/sparklingpandas/sparklingml
16. So what goes in startup.py?
● A class for our Java code to call with parameters &
request functions
● Code to take the Python UDFS and construct/return the
underlying Java UDFS
● A main function to startup the Py4J gateway & Spark
context to serialize our functions in the way that is
expected
● Pretty much it’s just boilerplate but you can take a look if
you want.
Jennifer C.
17. So what goes in startup.py?
class PythonRegistrationProvider(object):
class Java:
package = "com.sparklingpandas.sparklingml.util.python"
className = "PythonRegisterationProvider"
implements = [package + "." + className]
Jennifer C.
18. So what goes in startup.py?
def registerFunction(self, ssc, jsession, function_name,
params):
setup_spark_context_if_needed()
if function_name in functions_info:
function_info = functions_info[function_name]
evaledParams = ast.literal_eval(params)
func = function_info.func(*evaledParams)
udf = UserDefinedFunction(func, ret_type,
make_registration_name())
return udf._judf
else:
print("Could not find function")
Jennifer C.
19. What’s the boilerplate in Java?
● Call Python
● A trait representing the Python entry point
● Wrapping the UDFS in Spark ML stages (optional buuut
nice?)
● Also kind of boring, its in a few files if you want to look.
20. Enough boilerplate: counting words!
With Spacy, so you know more than English*
def inner(inputString):
nlp = SpacyMagic.get(lang)
def spacyTokenToDict(token):
"""Convert the input token into a dictionary"""
return dict(map(lookup_field_or_none, fields))
return list(map(spacyTokenToDict,
list(nlp(inputString))))
21. And from the JVM:
val transformer = new SpacyTokenizePython()
transformer.setLang("en")
val input = spark.createDataset(
List(InputData("hi boo"), InputData("boo")))
transformer.setInputCol("input")
transformer.setOutputCol("output")
val result = transformer.transform(input).collect()
Alexy Khrabrov
22. Ok but now it’s kind of slow….
● Well yeah
● Think back to that architecture diagram
● It’s not like a fast design
● We could try Jython?
23. *For a small price of your fun libraries. Bad idea.
24. That was a bad idea, buuut…..
● Work going on in Scala land to translate simple Scala
into SQL expressions - need the Dataset API
○ Maybe we can try similar approaches with Python?
● POC use Jython for simple UDFs (e.g. 2.7 compat & no
native libraries) - SPARK-15369
○ Early benchmarking w/word count 5% slower than native Scala UDF,
close to 2x faster than regular Python
● Willing to share your Python UDFs for benchmarking? -
http://bit.ly/pySparkUDF
*The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its
ok!
27. What does the future look like?*
*Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html.
*Vendor
benchmark.
Trust but verify.
28. What does the future look like - in code
@pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
def normalize(pdf):
v = pdf.v
return pdf.assign(v=(v - v.mean()) / v.std())
29. And we can share this with Java...:
With NLTK now! Sentiment is all the rage.
def inner(input_series):
from nltk.sentiment.vader import
SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
result = input_series.apply(lambda sentence:
sid.polarity_scores(sentence)['pos'])
return result
30. And from the JVM:
val transformer = new NltkPosPython()
val input = spark.createDataset(
List(InputData("Boo is happy"), InputData("Boo is sad")))
transformer.setInputCol("input")
transformer.setOutputCol("output")
val result = transformer.transform(input).collect()
result.size shouldBe 2
result(0)(0) shouldBe "Boo is happy"
result(0)(1) shouldBe 0.649
Alexy Khrabrov
31. Everyone loves wordcount right?
With Spacy now! Non-English language support!
def inner(inputSeries):
"""Tokenize the inputString using spacy for
the provided language."""
nlp = SpacyMagic.get(lang)
def tokenizeElem(elem):
return list(map(lambda token: token.text,
list(nlp(unicode(elem)))))
return inputSeries.apply(tokenizeElem)
32. BEAM Beyond the JVM
● Non JVM BEAM doesn’t work outside of Google’s environment yet, so I’m
going to skip the details.
● tl;dr : uses grpc / protobuf
● But exciting new plans to unify the runners and ease the support of different
languages (called SDKS)
○ See https://beam.apache.org/contribute/portability/
● If this is exciting, you can come join me on making BEAM work in Python3
○ Yes we still don’t have that :(
○ But we're getting closer!
33. Why now?
● There’s been better formats/options for a long time
● JVM devs want to use libraries in other languages with lots of data
○ e.g. startup + Deep Learning + ? => profit
● Arrow has solved the chicken-egg problem by building not just the chicken &
the egg, but also a hen house
Andrew Mager
34. References
● Live Streaming of Working with Spark + Arrow:
https://www.youtube.com/watch?v=EPvd5BhhevM&list=PLRLebp9QyZtYF46jl
SnIu2x1NDBkKa2uw&index=5
● Sparkling ML: https://github.com/sparklingpandas/sparklingml
● Apache Arrow: https://arrow.apache.org/
● Brian (IBM) on initial Spark + Arrow
https://arrow.apache.org/blog/2017/07/26/spark-arrow/
● Li Jin (two sigma)
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspar
k.html
● Bill Maimone
https://blogs.nvidia.com/blog/2017/06/27/gpu-computation-visualization/
35. Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
36. High Performance Spark!
You can buy it today!
Only one chapter on non-JVM stuff, I’m sorry.
Cats love it*
*Or at least the box it comes in. If buying for a cat, get print
rather than e-book.
37. k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
I need to give a testing talk in a few
months, help a “friend” out.
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
It’s performance review season, so help a friend out and
fill out this survey with your talk feedback
http://bit.ly/holdenTalkFeedback
38. Beyond wordcount: depencies?
● Your machines probably already have pandas
○ But maybe an old version
● But they might not have “special_business_logic”
○ Very special business logic, no one wants change fortran code*.
● Option 1: Talk to your vendor**
● Option 2: Try some sketchy open source software from
a hack day
● We’re going to focus on option 2!
*Because it’s perfect, it is fortran after all.
** I don’t like this option because the vendor I work for doesn’t have an answer.
39. coffee_boat to the rescue*
# This is beta, be careful. It may screw up your venv
!pip install --upgrade coffee_boat
# Use the coffee boat
from coffee_boat import Captain
captain = Captain(accept_conda_license=True)
captain.add_pip_packages("pyarrow", "edtf")
captain.launch_ship()
sc = SparkContext(master="yarn")
# You can now use pyarrow & edtf
captain.add_pip_packages("yourmagic")
# You can now use your magic in transformations!