Tech-Talk at Bay Area Spark Meetup
Apache Spark(tm) has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment. How do I embed what I have learned into customer facing data applications. Like all things in engineering, it depends.
In this meetup, we will discuss best practices from Databricks on how our customers productionize machine learning models and do a deep dive with actual customer case studies and live demos of a few example architectures and code in Python and Scala. We will also briefly touch on what is coming in Apache Spark 2.X with model serialization and scoring options.
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Apache Spark Model Deployment
1. Apache Spark(™)
Model Deployment
Bay Area Spark Meetup – June 30, 2016
Richard Garris – Big Data Solution Architect Focused on Advanced Analytics
2. About Me
Richard L Garris
• rlgarris@databricks.com
• @rlgarris [Twitter]
Big Data Solutions Architect @ Databricks
12+ years designing Enterprise Data Solutions for everyone from
startups to Global 2000
Prior Work Experience PwC, Google, Skytree
Ohio State Buckeye and CMU Alumni
2
3. About Apache Spark MLlib
Started at Berkeley AMPLab
(Apache Spark 0.8)
Now (Apache Spark 2.0)
• Contributions from 75+ orgs, ~250 individuals
• Development driven by Databricks: roadmap + 50% of
PRs
• Growing coverage of distributed algorithms
Spark
SparkSQL Streaming MLlib GraphFrames
3
4. MLlib Goals
General Machine Learning library for big data
• Scalable & robust
• Coverage of common algorithms
• Leverages Apache Spark
Tools for practical workflows
Integration with existing data science tools
4
5. Apache Spark MLlib
• spark.mllib
• Pre Mllib < Spark 1.4
• Spark Mllib was a lower
level library that used
Spark RDDs
• Uses LabeledPoint,
Vectors and Tuples
• Maintenance Mode only
after Spark 2.X
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split('
').map(_.toDouble)))
}.cache()
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations, st
epSize)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
6. Apache Spark – ML Pipelines
• spark.ml
• Spark > 1.4
• Spark.ML pipelines –
able to create more
complex models
• Integrated with
DataFrames
// Let's initialize our linear regression learner
val lr = new LinearRegression()
// Now we set the parameters for the method
lr.setPredictionCol("Predicted_PE")
.setLabelCol("PE").setMaxIter(100).setRegParam(0.1)
// We will use the new spark.ml pipeline API. If you
have worked with scikit-learn this will be very
familiar.
val lrPipeline = new Pipeline()
lrPipeline.setStages(Array(vectorizer, lr))
// Let's first train on the entire dataset to see what
we get
val lrModel = lrPipeline.fit(trainingSet)
7. The Agile Modeling Process
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate Results
8. The Agile Modeling Process
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate Results
Focus of this
talk
10. But What Really is a Model?
A model is a complex pipeline of components
• Data Sources
• Joins
• Featurization Logic
• Algorithm(s)
• Transformers
• Estimators
• Tuning Parameters
12. ML Pipelines
12
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract features
Extract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
A real pipeline
14. Why ML persistence?
14
Data
Science
Software
Engineering
Prototype (Python/R)
Create Pipeline
• Extract raw features
• Transform features
• Select key features
• Fit multiple models
• Combine results to
make prediction
• Extra implementation work
• Different code paths
• Synchronization overhead
Re-implement Pipeline for
production (Java)
Deploy Pipeline
17. What are the Requirements
for a Robust Model
Deployment System?
18. Customer SLAs
• Response time
• Throughput (predictions per second)
• Uptime / Reliability
Tech Stack
• C / C++
• Legacy (mainframe)
• Java
• Docker
Your Model Scoring Environment
19. Offline
• Internal Use (batch)
• Emails, Notifications
(batch)
• Offline – schedule based or
event trigger based
Model Scoring Offline vs Online
Online
• Customer Waiting on the
Response (human real-time)
• Super low-latency with fixed
response window
(transactional fraud, ad
bidding)
20. Not All Models Return a Yes / No
Model Scoring Considerations
Example: Login Bot Detector
Different behavior depending on
probability score
0.0-0.4 ☞ Allow login
0.4-0.6 ☞ Challenge Question
0.6 to 0.75 ☞ Send SMS
0.75 to 0.9 ☞ Refer to Agent
0.9 - 1.0 ☞ Block
Example: Item Recommendations
Output is a ranking of the top n items
API – send user ID + number of items
Return sorted set of items to recommend
Optional – pass context sensitive information
to tailor results
21. Model Updates and Versioning
• Model Update Frequency
(nightly, weekly, monthly, quarterly)
• Model Version Tracking
• Model Release Process
• Dev ‣ Test ‣ Staging ‣ Production
• Model update process
• Benchmark (or Shadow Models)
• Phase-In (20% traffic)
• Big Bang
22. • Models can have both reward and risk to the business
– Well designed models prevent fraud, reduce churn, increase sales
– Poorly designed models increase fraud, could impact the company’s brand,
cause compliance violations or other risks
• Models should be governed by the company's policies and procedures,
laws and regulations and the organization's management goals
Model Governance
Considerations
• Models have to be transparent, explainable, traceable and interpretable for
auditors / regulators
• Models may need reason codes for rejections (e.g. if I decline someone credit why?)
• Models should have an approval and release process
• Models also cannot violate any discrimination laws or use features that could be
traced to religion, gender, ethnicity,
23. Model A/B Testing
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate
Results
• A/B testing – comparing two
versions to see what performs
better
• Historical data works for
evaluating models in testing, but
production experiments required
to validate model hypothesis
• Model update process
• Benchmark (or Shadow Models)
• Phase-In (20% traffic)
• Big Bang
A/B Framework should support these steps
24. • Monitoring is the process of
observing the model’s
performance, logging it’s
behavior and alerting when the
model degrades
• Logging should log exactly the
data feed into the model at the
time of scoring
• Model alerting is critical to
detect unusual or unexpected
behaviors
Model Monitoring
25. Open Loop vs Closed Loop
• Open Loop – human being involved
• Closed Loop – no human involved
Model Scoring – almost always closed loop, some models alert
agents or customer service
Model Training – usually open loop with a data scientist in the
loop to update the model
26. Online Learning
• closed loop, entirely machine driven modeling is
risky
• need to have proper model monitoring and
safeguards to prevent abuse / sensitivity to noise
• Mllib supports online through streaming models (k-
means, logistic regression support online)
• Alternative – use a more complex model to better fit
new data rather than using online learning
34. Spark Summit EU
Brussels
October 25-27
The CFP closes at 11:59pm on July 1st
For more information and to submit:
https://spark-summit.org/eu-2016/
34