Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup August 16 2018
1. ML Pipelines with Apache
Spark & a little Apache Beam
Ottawa Reactive Meetup
2018
Hella-Legit
2. Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google focused on OSS Big Data
● Apache Spark PMC (think committer with tenure)
● Contributor to a lot of other projects (including BEAM)
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of High Performance Spark & Learning Spark (+ more)
● Twitter: @holdenkarau
● Slideshare http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Related Spark Videos http://bit.ly/holdenSparkVideos
3.
4. Who do I think you all are?
● Nice people*
● Mostly engineers
● Familiar with one of Java, Scala, or Python
● May or may not know Apache Spark
Amanda
5. What is in store for our adventure?
● Why train models on distributed systems
○ Have big data? It’s better than down sampling (more data often wins
over better algorithms)
○ Have small data? Enjoy fast hyper parameter tuning and taking an
early coffee break (but not too long)
● Why Apache Spark and Apache Beam in general
○ Answer questions in minutes instead of days* (*some of the time)
● “Classic” ML pipelines & then some deep learning
Ada Doglace
6. What is out of scope for today:
● The details backing of these algorithms
○ If you ask I’ll just say “gradient descent” and run away after throwing a
smoke bomb on the floor.
● Questions about stack traces (j/k)
Ada Doglace
7. So why you might be doing this?
● Maybe you’ve built a system with “hand tuned” weights
● Your static list of [X, Y, Z] no longer cuts it
● Your system is overwhelmed with abuse & your budget
for handling it is less than an intern.
● You want a new job and ML sounds nicer than Perl on
the resume now days
Amanda
8. Why did I get into this?
● I built a few search systems
● We spent a lot of time… guessing… I mean tuning
● We hired some smart people from Google
● Added ML magic
● Things went downhill from there (and then uphill at
another local maximum later)
9. What tools are we going to use today?
● Apache Spark - Model Training (plus fits into your ETL)
● emacs/vim - looking at random output
● spark-testing-base - You still need unit tests
● (sort of) spark-validation - Validating your jobs
● csv files - hey at least its not XML
● XML - ahhh crap
Demos will be in Scala (but I’ll try and avoid the odd things
like _s) & you can stop me if its confusing.
Mohammed Mustafa
10. What is Spark?
● General purpose distributed system
○ Built in Scala with an FP inspired API
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
12. Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods
13. Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau
15. The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
16. Required: Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(“output”)
Photo By: Will
Keightley
17. Companion notebook funtimes:
● Small companion IJupyter notebook to explore with:
○ Python: http://bit.ly/hkMLExample
○ Scala: http://bit.ly/sparkMLScalaNB
● If you want to use it you will access to Apache Spark
○ Install from http://spark.apache.org
○ Or get access to one of the online notebook environments (Google
Dataproc, DataBricks Cloud, Microsoft Spark HDInsights Cluster
Notebook, etc.)
David DeHetre
18. Transformers, Estimators and Pipelines
● Transformers transform a DataFrame into another
● Estimators can be trained on a DataFrame to produce a
transformer
● Pipelines chain together multiple transformers and
estimators
A.Davey
19. Let’s start with loading some data
● Genuine big data, doesn’t fit on a floppy disk
○ It’s ok if your inputs do fit on a floppy disk, buuuuut more data
generally works better
● In all seriousness, not a bad practice to down-sample
first while your building your pipeline so you can find
your errors fast (Spark pipelines discarded some type
information -- sorry!)
Jess Johnson
20. Loading with sparkSQL & spark-csv
sqlContext.read returns a DataFrameReader
We can specify general properties & data specific options
● option(“key”, “value”)
○ spark-csv ones we will use are header & inferSchema
● format(“formatName”)
○ built in formats include parquet, jdbc, etc. today we will use
com.databricks.spark.csv
● load(“path”)
Jess Johnson
21. Loading with sparkSQL & spark-csv
val df = sqlContext.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("resources/adult.data")
Jess Johnson
22. Lets explore training a Decision Tree
● Step 1: Data loading (done!)
● Step 2a: Data prep (produce features, remove complete
garbage, etc.) -
● Step 2b: Data prep (select features, etc.)
● Step 3: Train
● Step 4: Predict
23. Data prep / cleaning
● We need to predict a double (can be 0.0, 1.0, but type
must be double)
● We need to train with a vector of features**
Huang
Yun
Chung
** There is work to allow images and other things too.
24. Data prep / cleaning continued
// Combines a list of double input features into a vector
val assembler = new VectorAssembler().setInputCols(Array("age",
"education-num")).setOutputCol("features")
// String indexer converts a set of strings into doubles
val indexer = StringIndexer().setInputCol("category")
.setOutputCol("category-index")
// Can be used to combine pipeline components together
val pipeline = Pipeline().setStages(Array(assembler, indexer))
Huang
Yun
Chung
25. So a bit more about that pipeline
● Each of our previous components has “fit” & “transform”
stage
● Constructing the pipeline this way makes it easier to
work with (only need to call one fit & one transform)
● Can re-use the fitted model on future data
model=pipeline.fit(df)
prepared = model.transform(df)
Andrey
26. What does our pipeline look like so far?
Input Data Assembler
Input Data
+ Vectors StringIndexer
Input Data
+Cat ID
+ Vectors
While not an ML learning
algorithm this still needs to
be fit
This is a regular
transformer - no fitting
required.
27. Let's train a model on our prepared data:
# Specify model
dt = DecisionTreeClassifier(labelCol = "category-index",
featuresCol="features")
# Fit it
dt_model = dt.fit(prepared)
# Or as part of the pipeline
pipeline_and_model = Pipeline().setStages([assembler, indexer,
dt])
pipeline_model = pipeline_and_model.fit(df)
28. And predict the results on the same data:
pipeline_model.transform(df).select("prediction",
"category-index").take(20)
29. What does our tree look like?
val tree =
pipeline_model.stages(2).asInstanceOf[DecisionTreeClassificatio
nModel]
println(tree.toDebugString)
30. What does our tree look like?
ooooh
[info] If (feature 1 <= 12.5)
[info] If (feature 0 <= 33.5)
[info] If (feature 0 <= 26.5)
[info] If (feature 0 <= 23.5)
[info] If (feature 0 <= 21.5)
[info] Predict: 0.0
[info] Else (feature 0 > 21.5)
Win G
31. And predict the results on new data
// Option 1: Add empty/place-holder label data
pipeline_model.transform(df.withColumn("category", "dne"))
.take(20)
32. I guess that looks ok? Lets serve it!
● Waaaaaait - why is evaluate only on a dataframe?
● Ewwww - embeding Spark local mode in our webapp
○ The jar conflicts: they burn! & the performance will burn “later” :p
Options:
● See if your company has a model server, write export
function to match that glorious 90s C/C++ code base
● Write our own serving code & copy n’ paste the predict
code
● Use someone else’s copy n’ paste project
Ambernectar 13
33. But wait Spark has PMML support right?
● Spark 2.4 timeframe for pipelines -- I’m sorry (but the
codes in master)
● Limited support in both models & general data prep
○ No general whole pipeline export yet either
● Serving options: write your own, license something, or
AGPL code
34. The state of serving is generally a mess
● One project which aims to improve this is KubeFlow
○ Goal is unifying training & serving experiences
● Despite the name targeting more than just TensorFlow
● Doesn’t work with Spark yet, but it’s on my TODO list.
35. Pipeline API has many models:
● org.apache.spark.ml.classification
○ BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.
● org.apache.spark.ml.regression
○ DecissionTreeRegression, GBTRegressor, IsotonicRegression,
LinearRegression, etc.
● org.apache.spark.ml.recommendation
○ ALS
PROcarterse Follow
36. It’s not always a standalone microservice:
● Linear regression is awesome because I can “serve”* it
inside as an embedding in my elasticsearch / solr query
● Batch prediction is pretty OK too for somethings
○ Videos you may be interested in etc.
● Sometimes hybrid systems
○ Off-line expensive models + on-line inexpensive models
○ At this point you should probably higher a data scientist though
37. Cross-validation
because saving a test set is effort
● Automagically* fit your model params
○ Things like max tree depth, min info gain, and other regularization.
● Because thinking is effort
● org.apache.spark.ml.tuning has the tools
● If your going to use this for auto-tuning please please
save a test set
● Otherwise your models will look awesome and perform
like a ford pinto
Jonathan Kotta
38. The state of serving is generally a mess
● One project which aims to improve this is KubeFlow
○ Goal is unifying training & serving experiences
● Despite the name targeting more than just TensorFlow
● Doesn’t work with Spark yet, but it’s on my TODO list.
39. It’s not always a standalone microservice:
● Linear regression is awesome because I can “serve”* it
inside as an embedding in my elasticsearch / solr query
● Batch prediction is pretty OK too for somethings
○ Videos you may be interested in etc.
● Sometimes hybrid systems
○ Off-line expensive models + on-line inexpensive models
○ At this point you should probably higher a data scientist though
40. Cross-validation
because saving a test set is effort
● Automagically* fit your model params
○ Things like max tree depth, min info gain, and other regularization.
● Because thinking is effort
● org.apache.spark.ml.tuning has the tools
● If your going to use this for auto-tuning please please
save a test set
● Otherwise your models will look awesome and perform
like a ford pinto
Jonathan Kotta
41. Cross-validation
because saving a test set is effort
// ParamGridBuilder constructs an Array of parameter
combinations.
val paramGrid: Array[ParamMap] = new ParamGridBuilder()
.addGrid(nb.smoothing, Array(0.1, 0.5, 1.0, 2.0))
.build()
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEstimatorParamMaps(paramGrid)
val cvModel = cv.fit(df)
val bestModel = cvModel.bestModel
Jonathan Kotta
42. False sense of security:
● A/B test please even if CV says many many $s
● Rank based things can have training bias with previous
orders
● Non-displayed options: unlikely to be chosen
● Sometimes can find previous formulaic corrections
● Sometimes we can “experimentally” determine
● Other times we just hope it’s better than nothing
● Try and make sure your ML isn’t evil or re-encoding
human biases but stronger
44. Enter: TF.Transform
● For pre-processing of your data
○ e.g. where you spend 90% of your dev time anyways
● Integrates into serving time :D
● OSS
● Runs on top of Apache Beam, but current release not yet outside of GCP
○ On master this can run on Flink, but probably has bugs currently.
○ Please don’t use this in production today unless your on
GCP/Dataflow
PROKathryn Yengel
45. DO NOT USE THIS IN PRODUCTION TODAY
● I’m serious, I don’t want to die or cause the next
financial meltdown with software I’m a part of
● By Today I mean August 15 2018, but it’s probably
going to not be great for at least a “little while”
Vladimir Pustovit
PROTambako The Jaguar
46. Ooor from the chicago taxi data...
for key in taxi.DENSE_FLOAT_FEATURE_KEYS:
# Preserve this feature as a dense float, setting nan's to
the mean.
outputs[key] = transform.scale_to_z_score(inputs[key])
for key in taxi.VOCAB_FEATURE_KEYS:
# Build a vocabulary for this feature.
outputs[key] = transform.string_to_int(
inputs[key], top_k=taxi.VOCAB_SIZE,
num_oov_buckets=taxi.OOV_SIZE)
for key in taxi.BUCKET_FEATURE_KEYS:
outputs[key] = transform.bucketize(inputs[key],
47. Defining a Transform processing function
def preprocessing_fn(inputs):
x = inputs['x']
y = inputs['y']
s = inputs['s']
x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_int = tft.string_to_int(s)
return { 'x_centered': x_centered,
'y_normalized': y_normalized, 's_int': s_int}
50. Scale to ... Bag of Words / N-Grams
Bucketization Feature Crosses
tft.ngrams
tft.string_to_int
tf.string_split
tft.scale_to_z_score
tft.apply_buckets
tft.quantiles
tft.string_to_int
tf.string_join
...
Some common use-cases...
51. BEAM Beyond the JVM: Current release
● Non JVM BEAM doesn’t work outside of Google’s environment yet
● tl;dr : uses grpc / protobuf
○ Similar to the common design but with more efficient representations (often)
● But exciting new plans to unify the runners and ease the support of different
languages (called SDKS)
○ See https://beam.apache.org/contribute/portability/
● If this is exciting, you can come join me on making BEAM work in Python3
○ Yes we still don’t have that :(
○ But we're getting closer & you can come join us on BEAM-2874 :D
Emma
52. BEAM Beyond the JVM: Master w/ experiments
*ish
*ish
*ish
Nick
portability
*ish
53. So what does that look like?
Driver
Worker 1
Docker
grpc
Worker K
Docker
grpc
54. Updating your model
● The real world changes
● Online learning (streaming) is super cool, but hard to
version
● Iterative batches: automatically train on new data,
deploy model, and A/B test
● But A/B testing isn’t enough -- bad data can result in
wrong or even illegal results (ask me after a bud light
lime)
55. So why should you test & validate
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
56. Validation
● For now checking file sizes & execution time seem like the most common best
practice (from survey)
● spark-validator is still in early stages and not ready for production use but
interesting proof of concept
● Doesn’t need to be done in your Spark job (can be done in your scripting
language of choice with whatever job control system you are using)
● Sometimes your rules will miss-fire and you’ll need to manually approve a job
- that is ok!
● Remember those property tests? Could be great Validation rules!
Photo by:
Paul Schadler
57. Using a Spark accumulator for validation:
val (ok, bad) = (sc.accumulator(0), sc.accumulator(0))
val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1
// Actual parse logic here
}
// An action (e.g. count, save, etc.)
if (bad.value > 0.1* ok.value) {
throw Exception("bad data - do not use results")
// Optional cleanup
}
// Mark as safe
P.S: If you are interested in this check out spark-validator (still early stages).
Found Animals Foundation Follow
58. Validating records read matches our expectations:
val vc = new ValidationConf(tempPath, "1", true,
List[ValidationRule](
new AbsoluteSparkCounterValidationRule("recordsRead", Some(30),
Some(1000)))
)
val sqlCtx = new SQLContext(sc)
val v = Validation(sc, sqlCtx, vc)
//Business logic goes here
assert(v.validate(5) === true)
}
Photo by Dvortygirl
59. Common ML Specific Validation
● Number of iterations
○ did I coverage super quickly or slowly compared to last time? Could indicate junk data.
● CV model performance versus previous run
● Performance on a “fixed” test set (periodically manually refresh)
● Shadow run model on input stream - % of failures or missing results
60. Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
61. High Performance Spark!
You can buy it today on Amazon.ca.
Not a lot of ML focus but some!
Cats love it*
*Or at least the box it comes in. If buying for a cat, get print
rather than e-book.
62. What about the code lab?
● https://github.com/holdenk/spark-intro-ml-pipeline-workshop Chocodyno
63. k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
I need to give a testing talk next
month, help a “friend” out.
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)
Give feedback on this presentation
http://bit.ly/holdenTalkFeedback