Richard Garris presented on ways to productionize machine learning models built with Apache Spark MLlib. He discussed serializing models using MLlib 2.X to save models for production use without reimplementation. This allows data scientists to build models in Python/R and deploy them directly for scoring. He also reviewed model scoring architectures and highlighted Databricks' private beta solution for deploying serialized Spark MLlib models for low latency scoring outside of Spark.
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2.x with Richard Garris
1. Richard Garris (Principal Solutions Architect)
Apache Spark™ MLlib 2.x:
How to Productionize your Machine
Learning Models
2. Empower anyone to innovate faster with big data.
Founded by the creators of Apache Spark.
Contributes 75% of the open source code,
10x more than any other company.
VISION
WHO WE ARE
A fully-managed data processing platform
for the enterprise, powered by Apache Spark.
PRODUCT
4. About Me
• Richard L Garris
• rlgarris@databricks.com
• Twitter @rlgarris
• Principal Data Solutions Architect @ Databricks
• 12+ years designing Enterprise Data Solutions for everyone from
startups to Global 2000
• Prior Work ExperiencePwC, Google and Skytree – the Machine
Learning Company
• Ohio State Buckeye and Masters from CMU
5. Outline
• Spark Mllib2.X
• Model Serialization
• Model Scoring SystemRequirements
• Model Scoring Architectures
• Databricks Model Scoring
6. About Apache Spark™ MLlib
• Started with Spark 0.8 in the
AMPLab in 2014
• Migration to Spark
DataFrames started with
Spark 1.3 with feature parity
within 2.X
• Contributions by 75+ orgs,
~250 individuals
• Distributed algorithms that
scale linearly with the data
7. MLlib’s Goals
• General purpose machine learning library optimizedfor big data
• Linearly scalable = 2x more machines , runtime theoretically cut in half
• Fault tolerant = resilient to the failure of nodes
• Covers the most common algorithms with distributed implementations
• Built around the concept of a Data Science Pipeline (scikit-learn)
• Written entirelyusing Apache Spark™
• Integrateswell withthe Agile Modeling Process
8. A Model is a MathematicalFunction
• A model is a function: 𝑓 𝑥
• Linear regression 𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2
12. Problems with ProductionizingModels
Develop Prototype
Model using Python/R
Re-implement model for
production (Java)
- Extra work
- Different code paths
- Data science does not translate to production
- Slow to update models
Data Science Data Engineering
13. MLlib 2.X Model Serialization
Data Science Data Engineering
Develop Prototype
Model using Python/R
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production
14. Scala
val lrModel = lrPipeline.fit(dataset)
// Save the Model
lrModel.write.save("/models/lr")
•
MLlib 2.X Model Serialization Snippet
Python
lrModel = lrPipeline.fit(dataset)
# Save the Model
lrModel.write.save("/models/lr")
•
15. Model Serialization Output
Code
// List Contents of the Model Dir
dbutils.fs.ls("/models/lr")
•
Output
Remember this is a pipeline
model and these are the stages!
16. TransformerStage (StringIndexer)
Code
// Cat the contents of the Metadata dir
dbutils.fs.head(”/models/lr/stages/00_strI
dx_bb9728f85745/metadata/part-00000")
// Display the Parquet File in the Data dir
display(spark.read.parquet(”/models/lr/sta
ges/00_strIdx_bb9728f85745/data/"))
Output
{
"class":"org.apache.spark.ml.feature.StringIndexerModel",
"timestamp":1488120411719,
"sparkVersion":"2.1.0",
"uid":"strIdx_bb9728f85745",
"paramMap":{
"outputCol":"workclassIdx",
"inputCol":"workclass",
"handleInvalid":"error"
}
}
Metadata and params
Data (Hashmap)
17. Estimator Stage (LogisticRegression)
Code
// Cat the contents of the Metadata dir
dbutils.fs.head(”/models/lr/stages/18_logr
eg_325fa760f925/metadata/part-00000")
// Display the Parquet File in the Data dir
display(spark.read.parquet("/models/lr/sta
ges/18_logreg_325fa760f925/data/"))
Output
Model params
Intercept + Coefficients
{"class":"org.apache.spark.ml.classification.LogisticRegressionModel",
"timestamp":1488120446324,
"sparkVersion":"2.1.0",
"uid":"logreg_325fa760f925",
"paramMap":{
"predictionCol":"prediction",
"standardization":true,
"probabilityCol":"probability",
"maxIter":100,
"elasticNetParam":0.0,
"family":"auto",
"regParam":0.0,
"threshold":0.5,
"fitIntercept":true,
"labelCol":"label” }}
18. Output
Decision Tree Splits
Estimator Stage (DecisionTree)
Code
// Display the Parquet File in the Data dir
display(spark.read.parquet(”/models/dt/stages/18_dtc_3d614bcb3ff825/data/"))
// Re-save as JSON
spark.read.parquet("/models/dt/stages/18_dtc_3d614bcb3ff825/data/").json((”/models/json/dt").
20. What are the Requirementsfor
a Robust Model Deployment
System?
21. Model ScoringEnvironment Examples
• In Web Applications / EcommercePortals
• Mainframe / Batch ProcessingSystems
• Real-TimeProcessingSystems/ Middleware
• Via API / Microservice
• Embeddedin Devices (Mobile Phones, Medical Devices,Autos)
22. Hidden Technical Debt in ML Systems
“Hidden Technical Debt in Machine
Learning Systems “, Google NIPS 2015
“Hidden Technical Debt in Machine Learning Systems “, Google NIPS 2015
26. Model A/B Testing, Monitoring, Updates
• A/B testing – comparing two versions to see what performs better
• Monitoring is the process of observing the model’s performance, logging it’s
behavior and alerting when the model degrades
• Logging should log exactly the data feed into the model at the time of scoring
• Model update process
• Benchmark (or Shadow Models)
• Phase-In (20% traffic)
• Avoid Big Bang
27. Consider the Scoring Environment
Customer SLAs
•Response time
•Throughput
(predictions per
second)
•Uptime / Reliability
Tech Stack
–C / C++
–Legacy (mainframe)
–Java
28. Batch Real-Time
Scoringin Batch vs Real-Time
• Synchronous
• Could be Seconds:
– Customer is waiting
(human real-time)
• Subsecond:
– High Frequency Trading
– Fraud Detection on the Swipe
• Asynchronous
• Internal Use
• Triggers can be event based on time
based
• Used for Email Campaigns,
Notifications
29. Open Loop – human being involved
Closed Loop – no human involved
• Model Scoring – almost always
closed loop, some open loop e.g.
alert agents or customer service
• Model Training – usually open loop
with a data scientist in the loop to
update the model
Online Learning and Open / Closed Loop
• Online is closed loop, entirely
machine driven but modeling is
risky
• need to have proper model
monitoring and safeguards to
prevent abuse / sensitivity to noise
• MLlib supports online through
streaming models (k-means, logistic
regression support online)
• Alternative – use a more complex
model to better fit new data rather
than using online learning
Open / ClosedLoop Online Learning
30. Model Scoring– Bot Detection
Not All Models Return Boolean – e.g. a Yes / No
Example: Login Bot Detector
Different behavior depending on probability that use is a bot
0.0-0.4 ☞ Allow login
0.4-0.6 ☞ Send Challenge Question
0.6 to 0.75 ☞ Send SMS Code
0.75 to 0.9 ☞ Refer to Agent
0.9 - 1.0 ☞ Block
31. Model Scoring– Recommendations
Output is a ranking of the top n items
API – send user ID + number of items
Returnsorted set of items to recommend
Optional –
pass contextsensitive informationto tailor results
33. Architecture Option A
PrecomputePredictions using Spark and Serve fromDatabase
Train ALS Model
Send Email Offers
to Customers
Save Offers to
NoSQL
Ranked Offers
Display Ranked
Offers in Web /
Mobile
Recurring
Batch
34. Architecture Option B
Spark Streamand Score using an API with CachedPredictions
Web Activity Logs
Kill User’s Login
SessionCompute Features Run Prediction
Streaming
Cache Predictions API Check
35. Architecture Option C
Train with Spark and Score Outside of Spark
Train Model in
Spark
Save Model to S3
/ HDFS
New Data
Copy
Model to
Production
Predictions
Load coefficients and
intercept from file
37. Databricks Model Scoring
• Based on ArchitectureOption C
• Goal: DeployMLlib model outside of Apache Spark and
Databricks.
• Easy to Embed in Existing Environments
• Low Latency and Complexity
• Low Overhead
38. • Train Model in Databricks
– Call Fit on Pipeline
– Save Model as JSON
• Deploy model in external system
– Add dependency on “dbml-local”
package (without Spark)
– Load model from JSON at startup
– Make predictions in real time
Databricks Model Scoring
Code
// Fit and Export the Model in Databricks
val lrModel = lrPipeline.fit(dataset)
ModelExporter.export(lrModel, " /models/db ")
// In Your Application (Scala)
import com.databricks.ml.local.ModelImport
val lrModel= ModelImport.import("s3a:/...")
val jsonInput = ...
val jsonOutput= lrModel.transform(jsonInput)
39. Databricks Model ScoringPrivate Beta
• Private BetaAvailable for Databricks Customers
• Available on Databricks using Apache Spark 2.1
• Only logistic regressionavailable now
• Additional Estimatorsand Transformersin Progress