SlideShare a Scribd company logo
1 of 63
Download to read offline
ML Pipelines with Apache
Spark & a little Apache Beam
Ottawa Reactive Meetup
2018
Hella-Legit
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google focused on OSS Big Data
● Apache Spark PMC (think committer with tenure)
● Contributor to a lot of other projects (including BEAM)
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of High Performance Spark & Learning Spark (+ more)
● Twitter: @holdenkarau
● Slideshare http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Related Spark Videos http://bit.ly/holdenSparkVideos
Who do I think you all are?
● Nice people*
● Mostly engineers
● Familiar with one of Java, Scala, or Python
● May or may not know Apache Spark
Amanda
What is in store for our adventure?
● Why train models on distributed systems
○ Have big data? It’s better than down sampling (more data often wins
over better algorithms)
○ Have small data? Enjoy fast hyper parameter tuning and taking an
early coffee break (but not too long)
● Why Apache Spark and Apache Beam in general
○ Answer questions in minutes instead of days* (*some of the time)
● “Classic” ML pipelines & then some deep learning
Ada Doglace
What is out of scope for today:
● The details backing of these algorithms
○ If you ask I’ll just say “gradient descent” and run away after throwing a
smoke bomb on the floor.
● Questions about stack traces (j/k)
Ada Doglace
So why you might be doing this?
● Maybe you’ve built a system with “hand tuned” weights
● Your static list of [X, Y, Z] no longer cuts it
● Your system is overwhelmed with abuse & your budget
for handling it is less than an intern.
● You want a new job and ML sounds nicer than Perl on
the resume now days
Amanda
Why did I get into this?
● I built a few search systems
● We spent a lot of time… guessing… I mean tuning
● We hired some smart people from Google
● Added ML magic
● Things went downhill from there (and then uphill at
another local maximum later)
What tools are we going to use today?
● Apache Spark - Model Training (plus fits into your ETL)
● emacs/vim - looking at random output
● spark-testing-base - You still need unit tests
● (sort of) spark-validation - Validating your jobs
● csv files - hey at least its not XML
● XML - ahhh crap
Demos will be in Scala (but I’ll try and avoid the odd things
like _s) & you can stop me if its confusing.
Mohammed Mustafa
What is Spark?
● General purpose distributed system
○ Built in Scala with an FP inspired API
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
When we say distributed we mean...
Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods
Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau
Plus a little magic :)
Steven Saus
The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
Required: Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(“output”)
Photo By: Will
Keightley
Companion notebook funtimes:
● Small companion IJupyter notebook to explore with:
○ Python: http://bit.ly/hkMLExample
○ Scala: http://bit.ly/sparkMLScalaNB
● If you want to use it you will access to Apache Spark
○ Install from http://spark.apache.org
○ Or get access to one of the online notebook environments (Google
Dataproc, DataBricks Cloud, Microsoft Spark HDInsights Cluster
Notebook, etc.)
David DeHetre
Transformers, Estimators and Pipelines
● Transformers transform a DataFrame into another
● Estimators can be trained on a DataFrame to produce a
transformer
● Pipelines chain together multiple transformers and
estimators
A.Davey
Let’s start with loading some data
● Genuine big data, doesn’t fit on a floppy disk
○ It’s ok if your inputs do fit on a floppy disk, buuuuut more data
generally works better
● In all seriousness, not a bad practice to down-sample
first while your building your pipeline so you can find
your errors fast (Spark pipelines discarded some type
information -- sorry!)
Jess Johnson
Loading with sparkSQL & spark-csv
sqlContext.read returns a DataFrameReader
We can specify general properties & data specific options
● option(“key”, “value”)
○ spark-csv ones we will use are header & inferSchema
● format(“formatName”)
○ built in formats include parquet, jdbc, etc. today we will use
com.databricks.spark.csv
● load(“path”)
Jess Johnson
Loading with sparkSQL & spark-csv
val df = sqlContext.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("resources/adult.data")
Jess Johnson
Lets explore training a Decision Tree
● Step 1: Data loading (done!)
● Step 2a: Data prep (produce features, remove complete
garbage, etc.) -
● Step 2b: Data prep (select features, etc.)
● Step 3: Train
● Step 4: Predict
Data prep / cleaning
● We need to predict a double (can be 0.0, 1.0, but type
must be double)
● We need to train with a vector of features**
Huang
Yun
Chung
** There is work to allow images and other things too.
Data prep / cleaning continued
// Combines a list of double input features into a vector
val assembler = new VectorAssembler().setInputCols(Array("age",
"education-num")).setOutputCol("features")
// String indexer converts a set of strings into doubles
val indexer = StringIndexer().setInputCol("category")
.setOutputCol("category-index")
// Can be used to combine pipeline components together
val pipeline = Pipeline().setStages(Array(assembler, indexer))
Huang
Yun
Chung
So a bit more about that pipeline
● Each of our previous components has “fit” & “transform”
stage
● Constructing the pipeline this way makes it easier to
work with (only need to call one fit & one transform)
● Can re-use the fitted model on future data
model=pipeline.fit(df)
prepared = model.transform(df)
Andrey
What does our pipeline look like so far?
Input Data Assembler
Input Data
+ Vectors StringIndexer
Input Data
+Cat ID
+ Vectors
While not an ML learning
algorithm this still needs to
be fit
This is a regular
transformer - no fitting
required.
Let's train a model on our prepared data:
# Specify model
dt = DecisionTreeClassifier(labelCol = "category-index",
featuresCol="features")
# Fit it
dt_model = dt.fit(prepared)
# Or as part of the pipeline
pipeline_and_model = Pipeline().setStages([assembler, indexer,
dt])
pipeline_model = pipeline_and_model.fit(df)
And predict the results on the same data:
pipeline_model.transform(df).select("prediction",
"category-index").take(20)
What does our tree look like?
val tree =
pipeline_model.stages(2).asInstanceOf[DecisionTreeClassificatio
nModel]
println(tree.toDebugString)
What does our tree look like?
ooooh
[info] If (feature 1 <= 12.5)
[info] If (feature 0 <= 33.5)
[info] If (feature 0 <= 26.5)
[info] If (feature 0 <= 23.5)
[info] If (feature 0 <= 21.5)
[info] Predict: 0.0
[info] Else (feature 0 > 21.5)
Win G
And predict the results on new data
// Option 1: Add empty/place-holder label data
pipeline_model.transform(df.withColumn("category", "dne"))
.take(20)
I guess that looks ok? Lets serve it!
● Waaaaaait - why is evaluate only on a dataframe?
● Ewwww - embeding Spark local mode in our webapp
○ The jar conflicts: they burn! & the performance will burn “later” :p
Options:
● See if your company has a model server, write export
function to match that glorious 90s C/C++ code base
● Write our own serving code & copy n’ paste the predict
code
● Use someone else’s copy n’ paste project
Ambernectar 13
But wait Spark has PMML support right?
● Spark 2.4 timeframe for pipelines -- I’m sorry (but the
codes in master)
● Limited support in both models & general data prep
○ No general whole pipeline export yet either
● Serving options: write your own, license something, or
AGPL code
The state of serving is generally a mess
● One project which aims to improve this is KubeFlow
○ Goal is unifying training & serving experiences
● Despite the name targeting more than just TensorFlow
● Doesn’t work with Spark yet, but it’s on my TODO list.
Pipeline API has many models:
● org.apache.spark.ml.classification
○ BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.
● org.apache.spark.ml.regression
○ DecissionTreeRegression, GBTRegressor, IsotonicRegression,
LinearRegression, etc.
● org.apache.spark.ml.recommendation
○ ALS
PROcarterse Follow
It’s not always a standalone microservice:
● Linear regression is awesome because I can “serve”* it
inside as an embedding in my elasticsearch / solr query
● Batch prediction is pretty OK too for somethings
○ Videos you may be interested in etc.
● Sometimes hybrid systems
○ Off-line expensive models + on-line inexpensive models
○ At this point you should probably higher a data scientist though
Cross-validation
because saving a test set is effort
● Automagically* fit your model params
○ Things like max tree depth, min info gain, and other regularization.
● Because thinking is effort
● org.apache.spark.ml.tuning has the tools
● If your going to use this for auto-tuning please please
save a test set
● Otherwise your models will look awesome and perform
like a ford pinto
Jonathan Kotta
The state of serving is generally a mess
● One project which aims to improve this is KubeFlow
○ Goal is unifying training & serving experiences
● Despite the name targeting more than just TensorFlow
● Doesn’t work with Spark yet, but it’s on my TODO list.
It’s not always a standalone microservice:
● Linear regression is awesome because I can “serve”* it
inside as an embedding in my elasticsearch / solr query
● Batch prediction is pretty OK too for somethings
○ Videos you may be interested in etc.
● Sometimes hybrid systems
○ Off-line expensive models + on-line inexpensive models
○ At this point you should probably higher a data scientist though
Cross-validation
because saving a test set is effort
● Automagically* fit your model params
○ Things like max tree depth, min info gain, and other regularization.
● Because thinking is effort
● org.apache.spark.ml.tuning has the tools
● If your going to use this for auto-tuning please please
save a test set
● Otherwise your models will look awesome and perform
like a ford pinto
Jonathan Kotta
Cross-validation
because saving a test set is effort
// ParamGridBuilder constructs an Array of parameter
combinations.
val paramGrid: Array[ParamMap] = new ParamGridBuilder()
.addGrid(nb.smoothing, Array(0.1, 0.5, 1.0, 2.0))
.build()
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEstimatorParamMaps(paramGrid)
val cvModel = cv.fit(df)
val bestModel = cvModel.bestModel
Jonathan Kotta
False sense of security:
● A/B test please even if CV says many many $s
● Rank based things can have training bias with previous
orders
● Non-displayed options: unlikely to be chosen
● Sometimes can find previous formulaic corrections
● Sometimes we can “experimentally” determine
● Other times we just hope it’s better than nothing
● Try and make sure your ML isn’t evil or re-encoding
human biases but stronger
TensorFlowOnSpark, everyone loves mnist!
cluster = TFCluster.run(sc, mnist_dist_dataset.map_fun, args,
args.cluster_size, num_ps, args.tensorboard,
TFCluster.InputMode.SPARK)
if args.mode == "train":
cluster.train(dataRDD, args.epochs)
Lida
Enter: TF.Transform
● For pre-processing of your data
○ e.g. where you spend 90% of your dev time anyways
● Integrates into serving time :D
● OSS
● Runs on top of Apache Beam, but current release not yet outside of GCP
○ On master this can run on Flink, but probably has bugs currently.
○ Please don’t use this in production today unless your on
GCP/Dataflow
PROKathryn Yengel
DO NOT USE THIS IN PRODUCTION TODAY
● I’m serious, I don’t want to die or cause the next
financial meltdown with software I’m a part of
● By Today I mean August 15 2018, but it’s probably
going to not be great for at least a “little while”
Vladimir Pustovit
PROTambako The Jaguar
Ooor from the chicago taxi data...
for key in taxi.DENSE_FLOAT_FEATURE_KEYS:
# Preserve this feature as a dense float, setting nan's to
the mean.
outputs[key] = transform.scale_to_z_score(inputs[key])
for key in taxi.VOCAB_FEATURE_KEYS:
# Build a vocabulary for this feature.
outputs[key] = transform.string_to_int(
inputs[key], top_k=taxi.VOCAB_SIZE,
num_oov_buckets=taxi.OOV_SIZE)
for key in taxi.BUCKET_FEATURE_KEYS:
outputs[key] = transform.bucketize(inputs[key],
Defining a Transform processing function
def preprocessing_fn(inputs):
x = inputs['x']
y = inputs['y']
s = inputs['s']
x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_int = tft.string_to_int(s)
return { 'x_centered': x_centered,
'y_normalized': y_normalized, 's_int': s_int}
mean stddev
normalize
multiply
quantiles
bucketize
Analyzers
Reduce (full pass)
Implemented as a distributed
data pipeline
Transforms
Instance-to-instance (don’t
change batch dimension)
Pure TensorFlow
Analyze
normalize
multiply
bucketize
constant
tensors
data
mean stddev
normalize
multiply
quantiles
bucketize
Scale to ... Bag of Words / N-Grams
Bucketization Feature Crosses
tft.ngrams
tft.string_to_int
tf.string_split
tft.scale_to_z_score
tft.apply_buckets
tft.quantiles
tft.string_to_int
tf.string_join
...
Some common use-cases...
BEAM Beyond the JVM: Current release
● Non JVM BEAM doesn’t work outside of Google’s environment yet
● tl;dr : uses grpc / protobuf
○ Similar to the common design but with more efficient representations (often)
● But exciting new plans to unify the runners and ease the support of different
languages (called SDKS)
○ See https://beam.apache.org/contribute/portability/
● If this is exciting, you can come join me on making BEAM work in Python3
○ Yes we still don’t have that :(
○ But we're getting closer & you can come join us on BEAM-2874 :D
Emma
BEAM Beyond the JVM: Master w/ experiments
*ish
*ish
*ish
Nick
portability
*ish
So what does that look like?
Driver
Worker 1
Docker
grpc
Worker K
Docker
grpc
Updating your model
● The real world changes
● Online learning (streaming) is super cool, but hard to
version
● Iterative batches: automatically train on new data,
deploy model, and A/B test
● But A/B testing isn’t enough -- bad data can result in
wrong or even illegal results (ask me after a bud light
lime)
So why should you test & validate
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
Validation
● For now checking file sizes & execution time seem like the most common best
practice (from survey)
● spark-validator is still in early stages and not ready for production use but
interesting proof of concept
● Doesn’t need to be done in your Spark job (can be done in your scripting
language of choice with whatever job control system you are using)
● Sometimes your rules will miss-fire and you’ll need to manually approve a job
- that is ok!
● Remember those property tests? Could be great Validation rules!
Photo by:
Paul Schadler
Using a Spark accumulator for validation:
val (ok, bad) = (sc.accumulator(0), sc.accumulator(0))
val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1
// Actual parse logic here
}
// An action (e.g. count, save, etc.)
if (bad.value > 0.1* ok.value) {
throw Exception("bad data - do not use results")
// Optional cleanup
}
// Mark as safe
P.S: If you are interested in this check out spark-validator (still early stages).
Found Animals Foundation Follow
Validating records read matches our expectations:
val vc = new ValidationConf(tempPath, "1", true,
List[ValidationRule](
new AbsoluteSparkCounterValidationRule("recordsRead", Some(30),
Some(1000)))
)
val sqlCtx = new SQLContext(sc)
val v = Validation(sc, sqlCtx, vc)
//Business logic goes here
assert(v.validate(5) === true)
}
Photo by Dvortygirl
Common ML Specific Validation
● Number of iterations
○ did I coverage super quickly or slowly compared to last time? Could indicate junk data.
● CV model performance versus previous run
● Performance on a “fixed” test set (periodically manually refresh)
● Shadow run model on input stream - % of failures or missing results
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
You can buy it today on Amazon.ca.
Not a lot of ML focus but some!
Cats love it*
*Or at least the box it comes in. If buying for a cat, get print
rather than e-book.
What about the code lab?
● https://github.com/holdenk/spark-intro-ml-pipeline-workshop Chocodyno
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
I need to give a testing talk next
month, help a “friend” out.
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)
Give feedback on this presentation
http://bit.ly/holdenTalkFeedback

More Related Content

What's hot

Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
Holden Karau
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Databricks
 

What's hot (20)

Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
 
Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
 
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 

Similar to Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup August 16 2018

Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
Holden Karau
 

Similar to Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup August 16 2018 (19)

An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 

Recently uploaded

AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
ellan12
 
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
anilsa9823
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
sexy call girls service in goa
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
imonikaupta
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 

Recently uploaded (20)

AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
 
How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)
 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
 
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
 
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
 

Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup August 16 2018

  • 1. ML Pipelines with Apache Spark & a little Apache Beam Ottawa Reactive Meetup 2018 Hella-Legit
  • 2. Who am I? ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google focused on OSS Big Data ● Apache Spark PMC (think committer with tenure) ● Contributor to a lot of other projects (including BEAM) ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of High Performance Spark & Learning Spark (+ more) ● Twitter: @holdenkarau ● Slideshare http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Related Spark Videos http://bit.ly/holdenSparkVideos
  • 3.
  • 4. Who do I think you all are? ● Nice people* ● Mostly engineers ● Familiar with one of Java, Scala, or Python ● May or may not know Apache Spark Amanda
  • 5. What is in store for our adventure? ● Why train models on distributed systems ○ Have big data? It’s better than down sampling (more data often wins over better algorithms) ○ Have small data? Enjoy fast hyper parameter tuning and taking an early coffee break (but not too long) ● Why Apache Spark and Apache Beam in general ○ Answer questions in minutes instead of days* (*some of the time) ● “Classic” ML pipelines & then some deep learning Ada Doglace
  • 6. What is out of scope for today: ● The details backing of these algorithms ○ If you ask I’ll just say “gradient descent” and run away after throwing a smoke bomb on the floor. ● Questions about stack traces (j/k) Ada Doglace
  • 7. So why you might be doing this? ● Maybe you’ve built a system with “hand tuned” weights ● Your static list of [X, Y, Z] no longer cuts it ● Your system is overwhelmed with abuse & your budget for handling it is less than an intern. ● You want a new job and ML sounds nicer than Perl on the resume now days Amanda
  • 8. Why did I get into this? ● I built a few search systems ● We spent a lot of time… guessing… I mean tuning ● We hired some smart people from Google ● Added ML magic ● Things went downhill from there (and then uphill at another local maximum later)
  • 9. What tools are we going to use today? ● Apache Spark - Model Training (plus fits into your ETL) ● emacs/vim - looking at random output ● spark-testing-base - You still need unit tests ● (sort of) spark-validation - Validating your jobs ● csv files - hey at least its not XML ● XML - ahhh crap Demos will be in Scala (but I’ll try and avoid the odd things like _s) & you can stop me if its confusing. Mohammed Mustafa
  • 10. What is Spark? ● General purpose distributed system ○ Built in Scala with an FP inspired API ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Good when too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  • 11. When we say distributed we mean...
  • 12. Why people come to Spark: Well this MapReduce job is going to take 16 hours - how long could it take to learn Spark? dougwoods
  • 13. Why people come to Spark: My DataFrame won’t fit in memory on my cluster anymore, let alone my MacBook Pro :( Maybe this Spark business will solve that... brownpau
  • 14. Plus a little magic :) Steven Saus
  • 15. The different pieces of Spark Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  • 16. Required: Word count (in python) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(“output”) Photo By: Will Keightley
  • 17. Companion notebook funtimes: ● Small companion IJupyter notebook to explore with: ○ Python: http://bit.ly/hkMLExample ○ Scala: http://bit.ly/sparkMLScalaNB ● If you want to use it you will access to Apache Spark ○ Install from http://spark.apache.org ○ Or get access to one of the online notebook environments (Google Dataproc, DataBricks Cloud, Microsoft Spark HDInsights Cluster Notebook, etc.) David DeHetre
  • 18. Transformers, Estimators and Pipelines ● Transformers transform a DataFrame into another ● Estimators can be trained on a DataFrame to produce a transformer ● Pipelines chain together multiple transformers and estimators A.Davey
  • 19. Let’s start with loading some data ● Genuine big data, doesn’t fit on a floppy disk ○ It’s ok if your inputs do fit on a floppy disk, buuuuut more data generally works better ● In all seriousness, not a bad practice to down-sample first while your building your pipeline so you can find your errors fast (Spark pipelines discarded some type information -- sorry!) Jess Johnson
  • 20. Loading with sparkSQL & spark-csv sqlContext.read returns a DataFrameReader We can specify general properties & data specific options ● option(“key”, “value”) ○ spark-csv ones we will use are header & inferSchema ● format(“formatName”) ○ built in formats include parquet, jdbc, etc. today we will use com.databricks.spark.csv ● load(“path”) Jess Johnson
  • 21. Loading with sparkSQL & spark-csv val df = sqlContext.read .format("csv") .option("header", "true") .option("inferSchema", "true") .load("resources/adult.data") Jess Johnson
  • 22. Lets explore training a Decision Tree ● Step 1: Data loading (done!) ● Step 2a: Data prep (produce features, remove complete garbage, etc.) - ● Step 2b: Data prep (select features, etc.) ● Step 3: Train ● Step 4: Predict
  • 23. Data prep / cleaning ● We need to predict a double (can be 0.0, 1.0, but type must be double) ● We need to train with a vector of features** Huang Yun Chung ** There is work to allow images and other things too.
  • 24. Data prep / cleaning continued // Combines a list of double input features into a vector val assembler = new VectorAssembler().setInputCols(Array("age", "education-num")).setOutputCol("features") // String indexer converts a set of strings into doubles val indexer = StringIndexer().setInputCol("category") .setOutputCol("category-index") // Can be used to combine pipeline components together val pipeline = Pipeline().setStages(Array(assembler, indexer)) Huang Yun Chung
  • 25. So a bit more about that pipeline ● Each of our previous components has “fit” & “transform” stage ● Constructing the pipeline this way makes it easier to work with (only need to call one fit & one transform) ● Can re-use the fitted model on future data model=pipeline.fit(df) prepared = model.transform(df) Andrey
  • 26. What does our pipeline look like so far? Input Data Assembler Input Data + Vectors StringIndexer Input Data +Cat ID + Vectors While not an ML learning algorithm this still needs to be fit This is a regular transformer - no fitting required.
  • 27. Let's train a model on our prepared data: # Specify model dt = DecisionTreeClassifier(labelCol = "category-index", featuresCol="features") # Fit it dt_model = dt.fit(prepared) # Or as part of the pipeline pipeline_and_model = Pipeline().setStages([assembler, indexer, dt]) pipeline_model = pipeline_and_model.fit(df)
  • 28. And predict the results on the same data: pipeline_model.transform(df).select("prediction", "category-index").take(20)
  • 29. What does our tree look like? val tree = pipeline_model.stages(2).asInstanceOf[DecisionTreeClassificatio nModel] println(tree.toDebugString)
  • 30. What does our tree look like? ooooh [info] If (feature 1 <= 12.5) [info] If (feature 0 <= 33.5) [info] If (feature 0 <= 26.5) [info] If (feature 0 <= 23.5) [info] If (feature 0 <= 21.5) [info] Predict: 0.0 [info] Else (feature 0 > 21.5) Win G
  • 31. And predict the results on new data // Option 1: Add empty/place-holder label data pipeline_model.transform(df.withColumn("category", "dne")) .take(20)
  • 32. I guess that looks ok? Lets serve it! ● Waaaaaait - why is evaluate only on a dataframe? ● Ewwww - embeding Spark local mode in our webapp ○ The jar conflicts: they burn! & the performance will burn “later” :p Options: ● See if your company has a model server, write export function to match that glorious 90s C/C++ code base ● Write our own serving code & copy n’ paste the predict code ● Use someone else’s copy n’ paste project Ambernectar 13
  • 33. But wait Spark has PMML support right? ● Spark 2.4 timeframe for pipelines -- I’m sorry (but the codes in master) ● Limited support in both models & general data prep ○ No general whole pipeline export yet either ● Serving options: write your own, license something, or AGPL code
  • 34. The state of serving is generally a mess ● One project which aims to improve this is KubeFlow ○ Goal is unifying training & serving experiences ● Despite the name targeting more than just TensorFlow ● Doesn’t work with Spark yet, but it’s on my TODO list.
  • 35. Pipeline API has many models: ● org.apache.spark.ml.classification ○ BinaryLogisticRegressionClassification, DecissionTreeClassification, GBTClassifier, etc. ● org.apache.spark.ml.regression ○ DecissionTreeRegression, GBTRegressor, IsotonicRegression, LinearRegression, etc. ● org.apache.spark.ml.recommendation ○ ALS PROcarterse Follow
  • 36. It’s not always a standalone microservice: ● Linear regression is awesome because I can “serve”* it inside as an embedding in my elasticsearch / solr query ● Batch prediction is pretty OK too for somethings ○ Videos you may be interested in etc. ● Sometimes hybrid systems ○ Off-line expensive models + on-line inexpensive models ○ At this point you should probably higher a data scientist though
  • 37. Cross-validation because saving a test set is effort ● Automagically* fit your model params ○ Things like max tree depth, min info gain, and other regularization. ● Because thinking is effort ● org.apache.spark.ml.tuning has the tools ● If your going to use this for auto-tuning please please save a test set ● Otherwise your models will look awesome and perform like a ford pinto Jonathan Kotta
  • 38. The state of serving is generally a mess ● One project which aims to improve this is KubeFlow ○ Goal is unifying training & serving experiences ● Despite the name targeting more than just TensorFlow ● Doesn’t work with Spark yet, but it’s on my TODO list.
  • 39. It’s not always a standalone microservice: ● Linear regression is awesome because I can “serve”* it inside as an embedding in my elasticsearch / solr query ● Batch prediction is pretty OK too for somethings ○ Videos you may be interested in etc. ● Sometimes hybrid systems ○ Off-line expensive models + on-line inexpensive models ○ At this point you should probably higher a data scientist though
  • 40. Cross-validation because saving a test set is effort ● Automagically* fit your model params ○ Things like max tree depth, min info gain, and other regularization. ● Because thinking is effort ● org.apache.spark.ml.tuning has the tools ● If your going to use this for auto-tuning please please save a test set ● Otherwise your models will look awesome and perform like a ford pinto Jonathan Kotta
  • 41. Cross-validation because saving a test set is effort // ParamGridBuilder constructs an Array of parameter combinations. val paramGrid: Array[ParamMap] = new ParamGridBuilder() .addGrid(nb.smoothing, Array(0.1, 0.5, 1.0, 2.0)) .build() val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) val cvModel = cv.fit(df) val bestModel = cvModel.bestModel Jonathan Kotta
  • 42. False sense of security: ● A/B test please even if CV says many many $s ● Rank based things can have training bias with previous orders ● Non-displayed options: unlikely to be chosen ● Sometimes can find previous formulaic corrections ● Sometimes we can “experimentally” determine ● Other times we just hope it’s better than nothing ● Try and make sure your ML isn’t evil or re-encoding human biases but stronger
  • 43. TensorFlowOnSpark, everyone loves mnist! cluster = TFCluster.run(sc, mnist_dist_dataset.map_fun, args, args.cluster_size, num_ps, args.tensorboard, TFCluster.InputMode.SPARK) if args.mode == "train": cluster.train(dataRDD, args.epochs) Lida
  • 44. Enter: TF.Transform ● For pre-processing of your data ○ e.g. where you spend 90% of your dev time anyways ● Integrates into serving time :D ● OSS ● Runs on top of Apache Beam, but current release not yet outside of GCP ○ On master this can run on Flink, but probably has bugs currently. ○ Please don’t use this in production today unless your on GCP/Dataflow PROKathryn Yengel
  • 45. DO NOT USE THIS IN PRODUCTION TODAY ● I’m serious, I don’t want to die or cause the next financial meltdown with software I’m a part of ● By Today I mean August 15 2018, but it’s probably going to not be great for at least a “little while” Vladimir Pustovit PROTambako The Jaguar
  • 46. Ooor from the chicago taxi data... for key in taxi.DENSE_FLOAT_FEATURE_KEYS: # Preserve this feature as a dense float, setting nan's to the mean. outputs[key] = transform.scale_to_z_score(inputs[key]) for key in taxi.VOCAB_FEATURE_KEYS: # Build a vocabulary for this feature. outputs[key] = transform.string_to_int( inputs[key], top_k=taxi.VOCAB_SIZE, num_oov_buckets=taxi.OOV_SIZE) for key in taxi.BUCKET_FEATURE_KEYS: outputs[key] = transform.bucketize(inputs[key],
  • 47. Defining a Transform processing function def preprocessing_fn(inputs): x = inputs['x'] y = inputs['y'] s = inputs['s'] x_centered = x - tft.mean(x) y_normalized = tft.scale_to_0_1(y) s_int = tft.string_to_int(s) return { 'x_centered': x_centered, 'y_normalized': y_normalized, 's_int': s_int}
  • 48. mean stddev normalize multiply quantiles bucketize Analyzers Reduce (full pass) Implemented as a distributed data pipeline Transforms Instance-to-instance (don’t change batch dimension) Pure TensorFlow
  • 50. Scale to ... Bag of Words / N-Grams Bucketization Feature Crosses tft.ngrams tft.string_to_int tf.string_split tft.scale_to_z_score tft.apply_buckets tft.quantiles tft.string_to_int tf.string_join ... Some common use-cases...
  • 51. BEAM Beyond the JVM: Current release ● Non JVM BEAM doesn’t work outside of Google’s environment yet ● tl;dr : uses grpc / protobuf ○ Similar to the common design but with more efficient representations (often) ● But exciting new plans to unify the runners and ease the support of different languages (called SDKS) ○ See https://beam.apache.org/contribute/portability/ ● If this is exciting, you can come join me on making BEAM work in Python3 ○ Yes we still don’t have that :( ○ But we're getting closer & you can come join us on BEAM-2874 :D Emma
  • 52. BEAM Beyond the JVM: Master w/ experiments *ish *ish *ish Nick portability *ish
  • 53. So what does that look like? Driver Worker 1 Docker grpc Worker K Docker grpc
  • 54. Updating your model ● The real world changes ● Online learning (streaming) is super cool, but hard to version ● Iterative batches: automatically train on new data, deploy model, and A/B test ● But A/B testing isn’t enough -- bad data can result in wrong or even illegal results (ask me after a bud light lime)
  • 55. So why should you test & validate Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
  • 56. Validation ● For now checking file sizes & execution time seem like the most common best practice (from survey) ● spark-validator is still in early stages and not ready for production use but interesting proof of concept ● Doesn’t need to be done in your Spark job (can be done in your scripting language of choice with whatever job control system you are using) ● Sometimes your rules will miss-fire and you’ll need to manually approve a job - that is ok! ● Remember those property tests? Could be great Validation rules! Photo by: Paul Schadler
  • 57. Using a Spark accumulator for validation: val (ok, bad) = (sc.accumulator(0), sc.accumulator(0)) val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1 // Actual parse logic here } // An action (e.g. count, save, etc.) if (bad.value > 0.1* ok.value) { throw Exception("bad data - do not use results") // Optional cleanup } // Mark as safe P.S: If you are interested in this check out spark-validator (still early stages). Found Animals Foundation Follow
  • 58. Validating records read matches our expectations: val vc = new ValidationConf(tempPath, "1", true, List[ValidationRule]( new AbsoluteSparkCounterValidationRule("recordsRead", Some(30), Some(1000))) ) val sqlCtx = new SQLContext(sc) val v = Validation(sc, sqlCtx, vc) //Business logic goes here assert(v.validate(5) === true) } Photo by Dvortygirl
  • 59. Common ML Specific Validation ● Number of iterations ○ did I coverage super quickly or slowly compared to last time? Could indicate junk data. ● CV model performance versus previous run ● Performance on a “fixed” test set (periodically manually refresh) ● Shadow run model on input stream - % of failures or missing results
  • 60. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 61. High Performance Spark! You can buy it today on Amazon.ca. Not a lot of ML focus but some! Cats love it* *Or at least the box it comes in. If buying for a cat, get print rather than e-book.
  • 62. What about the code lab? ● https://github.com/holdenk/spark-intro-ml-pipeline-workshop Chocodyno
  • 63. k thnx bye :) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark I need to give a testing talk next month, help a “friend” out. Will tweet results “eventually” @holdenkarau Do you want more realistic benchmarks? Share your UDFs! http://bit.ly/pySparkUDF Pssst: Have feedback on the presentation? Give me a shout (holden@pigscanfly.ca) if you feel comfortable doing so :) Give feedback on this presentation http://bit.ly/holdenTalkFeedback