SlideShare a Scribd company logo
1 of 48
Powering TensorFlow with
big data
With Apache Beam, Flink & Spark bonus
@holdenkarau
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
Who I think you wonderful humans are?
● Nice enough people
● Don’t mind pictures of cats
● Maybe somewhat familiar with Tensorflor?
● Maybe somewhat familiar with Beam or Spark or Flink?
Lori Erickson
What we did get to:
● TensorFlowOnSpark w/basic Apache Arrow
● TF Transform demo on Apache Flink via Apache Beam
● Python & Go on Beam on Flink prototype
○ Everyone loves wordcount right? right?....
● New Beam architecture allowing for better portability &
handling dependencies (like Tensorflow)
● DO NOT non-JVM BEAM on Flink IN PRODUCTION
Vladimir Pustovit
DO NOT USE THIS* IN PRODUCTION
TODAY
● I’m serious, I don’t want to die or cause the next
financial meltdown with software I’m a part of
● By Today I mean July 18th 2018, but it’s probably going
to not be great for at least a “little while”
*Portable Beam on Flink (including Python, so TFT)
Vladimir Pustovit
PROTambako The Jaguar
What will be covered?
Most likely:
● Where we are today in non-JVM support in Beam
● And why this matters for Tensorflow
● What the rest of Big Data ecosystem looks like going outside the JVM
● A partial TF Transform demo + TFMA links + possible system crash
If there is time (e.g. demo runs quickly, wifi works, several other happy unlikely
things):
● TensorFlowOnSpark
● Apache Arrow - How this changes “everything”*
So why do I need to power DL w/Big Data?
● Deep learning is most effective with large sample sets for training
● As much as some may say that no feature prep is required even if you’re
looking at mnist.csv you probably have _some_ feature prep
● Since we need big data for training we need to to do our feature prep it
● Even if your just trying to raise some VC money it's going to go a lot better if
you add some keywords about a large proprietary dataset
TensorFlow isn’t enough on its own
● Enter TFX & friends like Kubeflow
○ Current related TFX OSS components: TF.Transform TF.Serving (with
more coming)
● Alternatives: piles of custom code re-created at serving time.
○ Yay job security?
PROJennifer C.
How do I do feature prep? (old skool)
● Write custom preparation jobs in your favourite big data tool (mine is Apache
Spark, but maybe yours is something else)
● Run it, train on the prepared data
● Rewrite your feature prep code to run at serving time
○ Error prone and sad
Enter: TF.Transform
● For pre-processing of your data
○ e.g. where you spend 90% of your dev time anyways
● Integrates into serving time :D
● OSS
● Written in Python
● Runs on top of Apache Beam, but current release not yet support Python
outside of GCP
○ On master this can run on Flink, but has bugs currently.
○ Please don’t use this in production today unless your on
GCP/Dataflow
PROKathryn Yengel
Defining a Transform processing function
def preprocessing_fn(inputs):
x = inputs['x']
y = inputs['y']
s = inputs['s']
x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_int = tft.string_to_int(s)
return { 'x_centered': x_centered,
'y_normalized': y_normalized, 's_int': s_int}
mean stddev
normalize
multiply
quantiles
bucketize
Analyzers
Reduce (full pass)
Implemented as a distributed
data pipeline
Transforms
Instance-to-instance (don’t
change batch dimension)
Pure TensorFlow
Analyze
normalize
multiply
bucketize
constant
tensors
data
mean stddev
normalize
multiply
quantiles
bucketize
Scale to ... Bag of Words / N-Grams
Bucketization Feature Crosses
tft.ngrams
tft.string_to_int
tf.string_split
tft.scale_to_z_score
tft.apply_buckets
tft.quantiles
tft.string_to_int
tf.string_join
...
Some common use-cases...
BEAM Beyond the JVM: Current release
● Non JVM BEAM doesn’t work outside of Google’s environment yet
● tl;dr : uses grpc / protobuf
○ Similar to the common design but with more efficient representations (often)
● But exciting new plans to unify the runners and ease the support of different
languages (called SDKS)
○ See https://beam.apache.org/contribute/portability/
● If this is exciting, you can come join me on making BEAM work in Python3
○ Yes we still don’t have that :(
○ But we're getting closer & you can come join us on BEAM-2874 :D
Emma
BEAM Beyond the JVM: Master + Experiments
● Common interface for setting up jobs
● Portability framework allows SDK harnesses in arbitrary to be kicked off
● Runners ship in their own docker containers (goodbye dependency hell, hello
container hell)
○ Also for now rolling containers leaves something to be desired (e.g. edit docker file by hand)
● Hacked up Python SDK to sort of talk to the new interface
● Go SDK talks to the new interface, still missing some features
● Need permissions? Run on GKE or plumb through permissions file :(
Nick
BEAM Beyond the JVM: Master w/ experiments
*ish
*ish
*ish
Nick
portability
*ish
So what does that look like?
Driver
Worker 1
Docker
grpc
Worker K
Docker
grpc
So how TF does this relate to TF?
● Tensorflow is in Python (kind of)
● Once we finish the Python SDK on Beam on Flink adventure you can use all
sorts of cool libraries (like TFT/TFX) to do your tensorflow work
○ You can use them today too if your use case is on Dataflow
○ If you don’t mind bugs you can experiment with them on Flink too
● You will be able manage your dependencies
● You will be able to (in theory) re-use dataprep code at serving time
○ 80% less copy n’ paste code with slight mistakes that get out of date!**
● No that doesn’t work today
● Or tomorrow
● But… eventually
○ Standard OSS excuse “patches welcome” (sort of if you can find the branch :p)
**Not a guarantee, see your vendor for details.
Ooor from the chicago taxi data...
for key in taxi.DENSE_FLOAT_FEATURE_KEYS:
# Preserve this feature as a dense float, setting nan's to
the mean.
outputs[key] = transform.scale_to_z_score(inputs[key])
for key in taxi.VOCAB_FEATURE_KEYS:
# Build a vocabulary for this feature.
outputs[key] = transform.string_to_int(
inputs[key], top_k=taxi.VOCAB_SIZE,
num_oov_buckets=taxi.OOV_SIZE)
for key in taxi.BUCKET_FEATURE_KEYS:
outputs[key] = transform.bucketize(inputs[key],
Beam Demo Time!!!!!
● TFT!!!!!! So amazing!!!!!!!!!
○ Want to move to SF and raise a series A? Pay attention :p
● Based on my testing at noon there is a 50% chance it will finish without
crashing
● That’s your friendly reminder not to run any of this in production (yet)
● The demo code (holdenk/model-analysis) is forked from axelmagn/model-
analysis which is forked from tensorflow/model-analysis and runs on master
of Apache Beam w/Flink
BEAM Beyond the JVM: The “future”
E.g. not now
*ish
*ish
*ish
Nick
portability
*ish
*ish
Ok now what?
● Integrate this into your model serving pipeline of choice
○ Don’t have one or open to change? Checkout TFMA which can directly serve it
● There’s a guide (it doesn’t show Flink because not released yet) but steps are
similar
○ But you’re not using this in production today anyways?
○ Right?
Nick Perla
(Optional) Second Beam Demo Time!!!!!
● Word count!!!!!! So amazing!!!!!!!!!
● Based on my testing on Saturday there is a 2 in 3 chance this will hard lock
my computer
● That’s your friendly reminder not to run any of this in production
● Demo shell script of fun (go only) & python + go
What do the rest of the systems do?
● Spoiler: mostly it’s not better
○ Although it tends to be more finished
○ Sometimes it's different
● Different tradeoffs, maybe better for your use case but all tradeoffs
Kate Neilan
A quick detour into PySpark’s internals
+ + JSON
TimOve
PySpark
● The Python interface to Spark
● Same general technique used as the bases for the C#, R, Julia, etc.
interfaces to Spark
● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API
● Has some serious performance hurdles from the design
So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
And in flink….
Driver
custom
Worker 1
Worker K
mmap
mmap
So how does that impact Py[X]
forall X in {Big Data}-{Native Python Big Data}
● Double serialization cost makes everything more
expensive
● Python worker startup takes a bit of extra time
● Python memory isn’t controlled by the JVM - easy to go
over container limits if deploying on YARN or similar
● Error messages make ~0 sense
● Dependency management makes limited sense
● features aren’t automatically exposed, but exposing
them is normally simple
TensorFlowOnSpark, everyone loves mnist!
cluster = TFCluster.run(sc, mnist_dist_dataset.map_fun, args,
args.cluster_size, num_ps, args.tensorboard,
TFCluster.InputMode.SPARK)
if args.mode == "train":
cluster.train(dataRDD, args.epochs)
Lida
The “future”*: faster interchange
● By future I mean availability today but running it in production is “adventurous”
● Unifying our cross-language experience
○ And not just “normal” languages, CUDA counts yo
Tambako The Jaguar
Andrew Skudder
*Arrow: Spark 2.3 and beyond & GPUs & R & Python & ….
* *
What does the future look like?*
*Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html.
*Vendor
benchmark.
Trust but verify.
Arrow (a poorly drawn big data view)
Logos trademarks of their respective projects
Juha Kettunen
*ish
Rewriting your code because why not
spark.catalog.registerFunction(
"add", lambda x, y: x + y, IntegerType())
=>
add = pandas_udf(lambda x, y: x + y, IntegerType())
Jennifer C.
And we can do this in TFOnSpark*:
unionRDD.foreachPartition(TFSparkNode.train(self.cluster_info,
self.cluster_meta, qname))
Will Transform Into something magical (aka fast but unreliable)
on the next slide!
Delaina Haslam
Which becomes
train_func = TFSparkNode.train(self.cluster_info,
self.cluster_meta, qname)
@pandas_udf("int")
def do_train(inputSeries1, inputSeries2):
# Sad hack for now
modified_series = map(lambda x: (x[0], x[1]),
zip(inputSeries1, inputSeries2))
train_func(modified_series)
return pandas.Series([0] * len(inputSeries1))
ljmacphee
And this now looks like:
Logos trademarks of their respective projects
Juha Kettunen
*ish
TFOnSpark Possible vNext+1?
● Avoid funneling the data through Python native types
○ For now the Spark Arrow UDFS aren’t perfect for this
○ But we can (and are) improving them
● mmapped Arrow?
● Skip Python on the workers handling data entirely (idk I’m lazy so probably
not)
Renars
References
● TFMA + TFT example guide -
https://www.tensorflow.org/tfx/model_analysis/examples/chicago_taxi
● Apache Beam github repo (w/early alpha portable Flink support)-
https://beam.apache.org/
● TFMA Example fork for use w/Beam on Flink -
● TensorFlowOnSpark -https://github.com/yahoo/TensorFlowOnSpark
● Spark Deep Learning Pipelines - https://github.com/databricks/spark-deep-
learning
● flink-tensorflow - https://github.com/FlinkML/flink-tensorflow
● TF.Transform - https://github.com/tensorflow/transform
● Beam portability design: https://beam.apache.org/contribute/portability/
● Beam on Flink + portability https://issues.apache.org/jira/browse/BEAM-2889
PROR. Crap Mariner
And some upcoming talks:
● August
○ Tentative Ottawa meetup!
○ JupyterCon NYC
● September
○ Strata NYC
○ Strangeloop STL
● October
○ Spark Summit London
○ Reversim Tel Aviv
● November
○ Big Data Spain
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
I need to give a testing talk next
month, help a “friend” out.
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)
Give feedback on this presentation
http://bit.ly/holdenTalkFeedback
What’s the rest of big data outside the JVM
look like?
Most of the tools are built in the JVM, so how do we play together?
● Pickling, Strings, JSON, XML, oh my!
● Unix pipes
● Sockets
What about if we don’t want to copy the data all the time?
● Or standalone “pure”* re-implementations of everything
○ Reasonable option for things like Kafka where you would have the I/O regardless.
○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem
David Brown
Hadoop “streaming” (Python/R)
● Unix pipes!
● Involves a data copy, formats get sad
● But the overhead of a Map/Reduce task is pretty high anyways...
Lisa Larsson
Kafka: re-implement all the things
● Multiple options for connecting to Kafka from outside of the JVM (yay!)
● They implement the protocol to talk to Kafka (yay!)
● This involves duplicated client work, and sometimes the clients can be slow
(solution, FFI bindings to C instead of Java)
● Buuuut -- we can’t access all of the cool Kafka business (like Kafka Streams)
and features depend on client libraries implementing them (easy to slip below
parity)
Smokey Combs
Dask: a new beginning?
● Pure* python implementation
● Provides real enough DataFrame interface for distributed data
● Also your standard-ish distributed collections
● Multiple backends
● Primary challenge: interacting with the rest of the big data ecosystem
○ Arrow & friends might make this better with time too, buuut….
● See https://dask.pydata.org/en/latest/ &
http://dask.pydata.org/en/latest/spark.html
Lisa Zins

More Related Content

What's hot

Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on AndroidKoan-Sin Tan
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using SwiftDiego Freniche Brito
 
Functional Programming for Busy Object Oriented Programmers
Functional Programming for Busy Object Oriented ProgrammersFunctional Programming for Busy Object Oriented Programmers
Functional Programming for Busy Object Oriented ProgrammersDiego Freniche Brito
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Holden Karau
 
A peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserA peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserKoan-Sin Tan
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Databricks
 
Puppetizing Your Organization
Puppetizing Your OrganizationPuppetizing Your Organization
Puppetizing Your OrganizationRobert Nelson
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsKoan-Sin Tan
 
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)Tim Bunce
 
From Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François SaussetFrom Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François SaussetPôle Systematic Paris-Region
 
Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...Pôle Systematic Paris-Region
 
Golang 101
Golang 101Golang 101
Golang 101宇 傅
 
Golang workshop - Mindbowser
Golang workshop - MindbowserGolang workshop - Mindbowser
Golang workshop - MindbowserMindbowser Inc
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsStijn Decubber
 

What's hot (20)

Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on Android
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
 
Functional Programming for Busy Object Oriented Programmers
Functional Programming for Busy Object Oriented ProgrammersFunctional Programming for Busy Object Oriented Programmers
Functional Programming for Busy Object Oriented Programmers
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?
 
A peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserA peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk User
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
Puppetizing Your Organization
Puppetizing Your OrganizationPuppetizing Your Organization
Puppetizing Your Organization
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
 
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
 
From Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François SaussetFrom Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François Sausset
 
Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...
 
Golang 101
Golang 101Golang 101
Golang 101
 
Iron Sprog Tech Talk
Iron Sprog Tech TalkIron Sprog Tech Talk
Iron Sprog Tech Talk
 
CPAN Training
CPAN TrainingCPAN Training
CPAN Training
 
Golang workshop - Mindbowser
Golang workshop - MindbowserGolang workshop - Mindbowser
Golang workshop - Mindbowser
 
OpenMP
OpenMPOpenMP
OpenMP
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
 
Why learn python in 2017?
Why learn python in 2017?Why learn python in 2017?
Why learn python in 2017?
 

Similar to Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON PDX 2018

Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Holden Karau
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018Holden Karau
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowMigrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowDatabricks
 
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)DataWorks Summit
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
 
Performance optimization techniques for Java code
Performance optimization techniques for Java codePerformance optimization techniques for Java code
Performance optimization techniques for Java codeAttila Balazs
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent
 
How to Choose a Deep Learning Framework
How to Choose a Deep Learning FrameworkHow to Choose a Deep Learning Framework
How to Choose a Deep Learning FrameworkNavid Kalaei
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Holden Karau
 
Prototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at SingaporePrototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at SingaporeDhruv Gohil
 
TFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesTFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesKoan-Sin Tan
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopHolden Karau
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 
OpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonOpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonHostedbyConfluent
 

Similar to Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON PDX 2018 (20)

Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowMigrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
 
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Performance optimization techniques for Java code
Performance optimization techniques for Java codePerformance optimization techniques for Java code
Performance optimization techniques for Java code
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
How to Choose a Deep Learning Framework
How to Choose a Deep Learning FrameworkHow to Choose a Deep Learning Framework
How to Choose a Deep Learning Framework
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
Prototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at SingaporePrototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at Singapore
 
TFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesTFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU Delegates
 
Go fundamentals
Go fundamentalsGo fundamentals
Go fundamentals
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
OpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonOpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit London
 

Recently uploaded

A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleanscorenetworkseo
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 

Recently uploaded (20)

A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleans
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 

Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON PDX 2018

  • 1. Powering TensorFlow with big data With Apache Beam, Flink & Spark bonus @holdenkarau
  • 2. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC, Beam contributor ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Code review livestreams: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ● Spark Talk Videos http://bit.ly/holdenSparkVideos
  • 3.
  • 4. Who I think you wonderful humans are? ● Nice enough people ● Don’t mind pictures of cats ● Maybe somewhat familiar with Tensorflor? ● Maybe somewhat familiar with Beam or Spark or Flink? Lori Erickson
  • 5. What we did get to: ● TensorFlowOnSpark w/basic Apache Arrow ● TF Transform demo on Apache Flink via Apache Beam ● Python & Go on Beam on Flink prototype ○ Everyone loves wordcount right? right?.... ● New Beam architecture allowing for better portability & handling dependencies (like Tensorflow) ● DO NOT non-JVM BEAM on Flink IN PRODUCTION Vladimir Pustovit
  • 6. DO NOT USE THIS* IN PRODUCTION TODAY ● I’m serious, I don’t want to die or cause the next financial meltdown with software I’m a part of ● By Today I mean July 18th 2018, but it’s probably going to not be great for at least a “little while” *Portable Beam on Flink (including Python, so TFT) Vladimir Pustovit PROTambako The Jaguar
  • 7. What will be covered? Most likely: ● Where we are today in non-JVM support in Beam ● And why this matters for Tensorflow ● What the rest of Big Data ecosystem looks like going outside the JVM ● A partial TF Transform demo + TFMA links + possible system crash If there is time (e.g. demo runs quickly, wifi works, several other happy unlikely things): ● TensorFlowOnSpark ● Apache Arrow - How this changes “everything”*
  • 8. So why do I need to power DL w/Big Data? ● Deep learning is most effective with large sample sets for training ● As much as some may say that no feature prep is required even if you’re looking at mnist.csv you probably have _some_ feature prep ● Since we need big data for training we need to to do our feature prep it ● Even if your just trying to raise some VC money it's going to go a lot better if you add some keywords about a large proprietary dataset
  • 9. TensorFlow isn’t enough on its own ● Enter TFX & friends like Kubeflow ○ Current related TFX OSS components: TF.Transform TF.Serving (with more coming) ● Alternatives: piles of custom code re-created at serving time. ○ Yay job security? PROJennifer C.
  • 10. How do I do feature prep? (old skool) ● Write custom preparation jobs in your favourite big data tool (mine is Apache Spark, but maybe yours is something else) ● Run it, train on the prepared data ● Rewrite your feature prep code to run at serving time ○ Error prone and sad
  • 11. Enter: TF.Transform ● For pre-processing of your data ○ e.g. where you spend 90% of your dev time anyways ● Integrates into serving time :D ● OSS ● Written in Python ● Runs on top of Apache Beam, but current release not yet support Python outside of GCP ○ On master this can run on Flink, but has bugs currently. ○ Please don’t use this in production today unless your on GCP/Dataflow PROKathryn Yengel
  • 12. Defining a Transform processing function def preprocessing_fn(inputs): x = inputs['x'] y = inputs['y'] s = inputs['s'] x_centered = x - tft.mean(x) y_normalized = tft.scale_to_0_1(y) s_int = tft.string_to_int(s) return { 'x_centered': x_centered, 'y_normalized': y_normalized, 's_int': s_int}
  • 13. mean stddev normalize multiply quantiles bucketize Analyzers Reduce (full pass) Implemented as a distributed data pipeline Transforms Instance-to-instance (don’t change batch dimension) Pure TensorFlow
  • 15. Scale to ... Bag of Words / N-Grams Bucketization Feature Crosses tft.ngrams tft.string_to_int tf.string_split tft.scale_to_z_score tft.apply_buckets tft.quantiles tft.string_to_int tf.string_join ... Some common use-cases...
  • 16. BEAM Beyond the JVM: Current release ● Non JVM BEAM doesn’t work outside of Google’s environment yet ● tl;dr : uses grpc / protobuf ○ Similar to the common design but with more efficient representations (often) ● But exciting new plans to unify the runners and ease the support of different languages (called SDKS) ○ See https://beam.apache.org/contribute/portability/ ● If this is exciting, you can come join me on making BEAM work in Python3 ○ Yes we still don’t have that :( ○ But we're getting closer & you can come join us on BEAM-2874 :D Emma
  • 17. BEAM Beyond the JVM: Master + Experiments ● Common interface for setting up jobs ● Portability framework allows SDK harnesses in arbitrary to be kicked off ● Runners ship in their own docker containers (goodbye dependency hell, hello container hell) ○ Also for now rolling containers leaves something to be desired (e.g. edit docker file by hand) ● Hacked up Python SDK to sort of talk to the new interface ● Go SDK talks to the new interface, still missing some features ● Need permissions? Run on GKE or plumb through permissions file :( Nick
  • 18. BEAM Beyond the JVM: Master w/ experiments *ish *ish *ish Nick portability *ish
  • 19. So what does that look like? Driver Worker 1 Docker grpc Worker K Docker grpc
  • 20. So how TF does this relate to TF? ● Tensorflow is in Python (kind of) ● Once we finish the Python SDK on Beam on Flink adventure you can use all sorts of cool libraries (like TFT/TFX) to do your tensorflow work ○ You can use them today too if your use case is on Dataflow ○ If you don’t mind bugs you can experiment with them on Flink too ● You will be able manage your dependencies ● You will be able to (in theory) re-use dataprep code at serving time ○ 80% less copy n’ paste code with slight mistakes that get out of date!** ● No that doesn’t work today ● Or tomorrow ● But… eventually ○ Standard OSS excuse “patches welcome” (sort of if you can find the branch :p) **Not a guarantee, see your vendor for details.
  • 21. Ooor from the chicago taxi data... for key in taxi.DENSE_FLOAT_FEATURE_KEYS: # Preserve this feature as a dense float, setting nan's to the mean. outputs[key] = transform.scale_to_z_score(inputs[key]) for key in taxi.VOCAB_FEATURE_KEYS: # Build a vocabulary for this feature. outputs[key] = transform.string_to_int( inputs[key], top_k=taxi.VOCAB_SIZE, num_oov_buckets=taxi.OOV_SIZE) for key in taxi.BUCKET_FEATURE_KEYS: outputs[key] = transform.bucketize(inputs[key],
  • 22. Beam Demo Time!!!!! ● TFT!!!!!! So amazing!!!!!!!!! ○ Want to move to SF and raise a series A? Pay attention :p ● Based on my testing at noon there is a 50% chance it will finish without crashing ● That’s your friendly reminder not to run any of this in production (yet) ● The demo code (holdenk/model-analysis) is forked from axelmagn/model- analysis which is forked from tensorflow/model-analysis and runs on master of Apache Beam w/Flink
  • 23. BEAM Beyond the JVM: The “future” E.g. not now *ish *ish *ish Nick portability *ish *ish
  • 24. Ok now what? ● Integrate this into your model serving pipeline of choice ○ Don’t have one or open to change? Checkout TFMA which can directly serve it ● There’s a guide (it doesn’t show Flink because not released yet) but steps are similar ○ But you’re not using this in production today anyways? ○ Right? Nick Perla
  • 25. (Optional) Second Beam Demo Time!!!!! ● Word count!!!!!! So amazing!!!!!!!!! ● Based on my testing on Saturday there is a 2 in 3 chance this will hard lock my computer ● That’s your friendly reminder not to run any of this in production ● Demo shell script of fun (go only) & python + go
  • 26. What do the rest of the systems do? ● Spoiler: mostly it’s not better ○ Although it tends to be more finished ○ Sometimes it's different ● Different tradeoffs, maybe better for your use case but all tradeoffs Kate Neilan
  • 27. A quick detour into PySpark’s internals + + JSON TimOve
  • 28. PySpark ● The Python interface to Spark ● Same general technique used as the bases for the C#, R, Julia, etc. interfaces to Spark ● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API ● Has some serious performance hurdles from the design
  • 29. So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  • 31. So how does that impact Py[X] forall X in {Big Data}-{Native Python Big Data} ● Double serialization cost makes everything more expensive ● Python worker startup takes a bit of extra time ● Python memory isn’t controlled by the JVM - easy to go over container limits if deploying on YARN or similar ● Error messages make ~0 sense ● Dependency management makes limited sense ● features aren’t automatically exposed, but exposing them is normally simple
  • 32. TensorFlowOnSpark, everyone loves mnist! cluster = TFCluster.run(sc, mnist_dist_dataset.map_fun, args, args.cluster_size, num_ps, args.tensorboard, TFCluster.InputMode.SPARK) if args.mode == "train": cluster.train(dataRDD, args.epochs) Lida
  • 33. The “future”*: faster interchange ● By future I mean availability today but running it in production is “adventurous” ● Unifying our cross-language experience ○ And not just “normal” languages, CUDA counts yo Tambako The Jaguar
  • 34. Andrew Skudder *Arrow: Spark 2.3 and beyond & GPUs & R & Python & …. * *
  • 35. What does the future look like?* *Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html. *Vendor benchmark. Trust but verify.
  • 36. Arrow (a poorly drawn big data view) Logos trademarks of their respective projects Juha Kettunen *ish
  • 37. Rewriting your code because why not spark.catalog.registerFunction( "add", lambda x, y: x + y, IntegerType()) => add = pandas_udf(lambda x, y: x + y, IntegerType()) Jennifer C.
  • 38. And we can do this in TFOnSpark*: unionRDD.foreachPartition(TFSparkNode.train(self.cluster_info, self.cluster_meta, qname)) Will Transform Into something magical (aka fast but unreliable) on the next slide! Delaina Haslam
  • 39. Which becomes train_func = TFSparkNode.train(self.cluster_info, self.cluster_meta, qname) @pandas_udf("int") def do_train(inputSeries1, inputSeries2): # Sad hack for now modified_series = map(lambda x: (x[0], x[1]), zip(inputSeries1, inputSeries2)) train_func(modified_series) return pandas.Series([0] * len(inputSeries1)) ljmacphee
  • 40. And this now looks like: Logos trademarks of their respective projects Juha Kettunen *ish
  • 41. TFOnSpark Possible vNext+1? ● Avoid funneling the data through Python native types ○ For now the Spark Arrow UDFS aren’t perfect for this ○ But we can (and are) improving them ● mmapped Arrow? ● Skip Python on the workers handling data entirely (idk I’m lazy so probably not) Renars
  • 42. References ● TFMA + TFT example guide - https://www.tensorflow.org/tfx/model_analysis/examples/chicago_taxi ● Apache Beam github repo (w/early alpha portable Flink support)- https://beam.apache.org/ ● TFMA Example fork for use w/Beam on Flink - ● TensorFlowOnSpark -https://github.com/yahoo/TensorFlowOnSpark ● Spark Deep Learning Pipelines - https://github.com/databricks/spark-deep- learning ● flink-tensorflow - https://github.com/FlinkML/flink-tensorflow ● TF.Transform - https://github.com/tensorflow/transform ● Beam portability design: https://beam.apache.org/contribute/portability/ ● Beam on Flink + portability https://issues.apache.org/jira/browse/BEAM-2889 PROR. Crap Mariner
  • 43. And some upcoming talks: ● August ○ Tentative Ottawa meetup! ○ JupyterCon NYC ● September ○ Strata NYC ○ Strangeloop STL ● October ○ Spark Summit London ○ Reversim Tel Aviv ● November ○ Big Data Spain
  • 44. k thnx bye :) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark I need to give a testing talk next month, help a “friend” out. Will tweet results “eventually” @holdenkarau Do you want more realistic benchmarks? Share your UDFs! http://bit.ly/pySparkUDF Pssst: Have feedback on the presentation? Give me a shout (holden@pigscanfly.ca) if you feel comfortable doing so :) Give feedback on this presentation http://bit.ly/holdenTalkFeedback
  • 45. What’s the rest of big data outside the JVM look like? Most of the tools are built in the JVM, so how do we play together? ● Pickling, Strings, JSON, XML, oh my! ● Unix pipes ● Sockets What about if we don’t want to copy the data all the time? ● Or standalone “pure”* re-implementations of everything ○ Reasonable option for things like Kafka where you would have the I/O regardless. ○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem David Brown
  • 46. Hadoop “streaming” (Python/R) ● Unix pipes! ● Involves a data copy, formats get sad ● But the overhead of a Map/Reduce task is pretty high anyways... Lisa Larsson
  • 47. Kafka: re-implement all the things ● Multiple options for connecting to Kafka from outside of the JVM (yay!) ● They implement the protocol to talk to Kafka (yay!) ● This involves duplicated client work, and sometimes the clients can be slow (solution, FFI bindings to C instead of Java) ● Buuuut -- we can’t access all of the cool Kafka business (like Kafka Streams) and features depend on client libraries implementing them (easy to slip below parity) Smokey Combs
  • 48. Dask: a new beginning? ● Pure* python implementation ● Provides real enough DataFrame interface for distributed data ● Also your standard-ish distributed collections ● Multiple backends ● Primary challenge: interacting with the rest of the big data ecosystem ○ Arrow & friends might make this better with time too, buuut…. ● See https://dask.pydata.org/en/latest/ & http://dask.pydata.org/en/latest/spark.html Lisa Zins

Editor's Notes

  1. Photo from https://www.flickr.com/photos/lorika/4148361363/in/photolist-7jzriM-9h3my2-9Qn7iD-bp55TS-7YCJ4G-4pVTXa-7AFKbm-bkBfKJ-9Qn6FH-aniTRF-9LmYvZ-HD6w6-4mBo3t-8sekvz-mgpFzD-5z6BRK-de513-8dVhBu-bBZ22n-4Vi2vS-3g13dh-e7aPKj-b6iHHi-4ThGzv-7NcFNK-aniTU6-Kzqxd-7LPmYs-4ok2qy-dLY9La-Nvhey-Kte6U-74B7Ma-6VfnBK-6VjrY7-58kAY9-7qUeDK-4eoSxM-6Vjs5A-9v5Pvb-26mja-4scwq3-GHzAL-672eVr-nFUomD-4s8u8F-5eiQmQ-bxXXCc-5P9cCT-5GX8no
  2. introduce spark rdds, purple blog diagrams https://www.flickr.com/photos/pustovit/15867520885/in/photolist-qbac9i-9XLrR4-74scWq-bpnxfN-qYAD3D-e6u5Ej-oztsCu-qJMG4L-7b4y4a-gu8Wa-8MzgVR-b5gHki-djzdH3-82TowY-qJc99b-pC6yth-ifAkvP-mju1Ce-3ACPG6-F9aWR2-5QQL1U-4Hav3S-dGHvJj-jxQLth-djzdgd-dL24wn-8znjgb-aZxA6H-gDkWNo-djzcZF-22NYH5r-9amo58-apqLdG-fZhoSH-cDjpEQ-nLbBuK-6EuEN2-dAN5KN-asBjbL-Vx2zFR-djzdki-SRkhd1-djzcnF-Tc9FAf-qsduzQ-djzd4P-9wCiZT-8JALTP-eqbpop-R2S2FR
  3. Remind people not to use this in production
  4. https://www.flickr.com/photos/29638108@N06/26104346281/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  5. https://www.flickr.com/photos/29638108@N06/26104346281/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  6. https://www.flickr.com/photos/29638108@N06/26104346281/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  7. As just discussed, TFT provides utility functions that run analyzers as needed. These are data processing jobs that can be run in arbitrary environments using the Beam SDK.
  8. What happens behind the scene is that the analyzers run as a distributed data processing graph, and the result is put into the output graph as constants.
  9. Some of the common use cases: Just talked about ones on left: Scale scores Bucketize Text features: apply bag of words or n grams For Feature crosses: cross strings and generate vocabs of the result of those crosses As mentioned before, tf.Tranform is powerful in that you can chain these transformations.
  10. SparkContext uses Py4J to launch a JVM and create a JavaSparkContext. On The executors, java subprocess launches a python subprocess via a pipe. The data is serialized using cpickle and sent ot the python subprocess
  11. https://www.flickr.com/photos/29638108@N06/26104346281/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  12. https://www.flickr.com/photos/timove/2873619269/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  13. SparkContext uses Py4J to launch a JVM and create a JavaSparkContext. On The executors, java subprocess launches a python subprocess via a pipe. The data is serialized using cpickle and sent ot the python subprocess
  14. What does python memory isn’t controlled by the JVM mean? Double serialization from lanague transform
  15. https://www.flickr.com/photos/juhakettunen/17218305870/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  16. https://www.flickr.com/photos/29638108@N06/26104346281/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  17. https://www.flickr.com/photos/29638108@N06/26104346281/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  18. https://www.flickr.com/photos/juhakettunen/17218305870/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  19. https://www.flickr.com/photos/29638108@N06/26104346281/in/photolist-66Ky9n-5nW3TP-f4xVRt-sewjsA-BVmgy-FLKAFT-89kfzb-FSBrSp-puHhfg-xrXMpL-5fjZcs-G9DjaZ-eXvwfo-oUk4hz-7gmfLB-9s2gwi-bqRAKw-4CGf6X-5o24aR-25AijkV-njSsfw-4tYMke-FsvaQq-haUJv-6S2j5f-c19gY5-rdr7va-6qoirp-666Sgs-3bcTwj-7QoFUj-ayEq5k-2yduWy-Co2uwS-NKcKBY-eXvx2d-ZHnLQj-6Kk14A-rgBNGV-EXb2PG-dGg4Mk-23dGLzS-a7EshL-85r8fq-ix6nEM-6izGaR-9MT8Ee-oqhy96-CE4Sgs-5LKLdr/
  20. Is debugging also really hard?
  21. Or this one … also should we put these below python part?