Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Production and Beyond: Deploying and Managing Machine Learning Models
1. Production and Beyond:
Rajat Arya, Senior Product Manager
Alice Zheng, Director of Data Science
1
Deploying and Managing
Machine Learning Models
2. Choosing between deployed models.
What is Production?
Evaluation
Monitoring
Deployment
Management
Easily serve live predictions.
Measuring quality of deployed models.
Tracking model quality & operations.
3. Lifecycle of ML in Production
Evaluation
Monitoring
Deployment
Management
4. Lifecycle of ML in Production
Evaluation
Monitoring
Deployment
Management
5. The Setup
Suppose we are building a website with product
recommendations, trained using Amazon reviews.
• 34.6M reviews
• 2.4M products
• 6.6M users
8. Batch Training: DIY
• Use entire cluster
efficiently
• Scale nodes up or
down
• No data transfer
• Operation metrics,
dashboard, alarming
8
Scalable
Distributed
Easy to schedule, launch,
and monitor
Co-located with Data
Model
Historical
Data
Batch training
12. Real-time Predictions: DIY
• REST endpoints,
language independence
• Fast predictions, with
caching
• Replicated models
• Alarming, Metrics
• Scale up or down
• Easy to update / deploy
models
12
Low Latency
Ease of Integration
Scalable
Fault Tolerant
Maintainable
Predictions
Live
Data
Real-time predictions
13. Dato Predictive Services Architecture
Dato Predictive Services
website,
mobile,
browser, etc
REST API
DIST. CACHE
MODEL
14. Dato Predictive Services
In one line of code launch a fault-tolerant,
scalable, robust, and maintainable cluster,
putting a service-oriented architecture on
machine learning models.
27. Updating ML models
Why update?
• Trends and user tastes change over time
• Model performance drops
When to update?
• Track statistics of data over time
• Monitor both offline & online metrics on live data
• Update when offline metric diverges from online metrics
28. Choosing between ML models
Model 2
Model 1
2000 visits
10% CTR
Group A
Everybody gets
Model 2
2000 visits
30% CTR
Group B
Strategy 1: A/B testing—select the best model and use it all the time
29. Choosing between ML models
A statistician walks into a casino…
Pay-off $1:$1000 Pay-off $1:$200 Pay-off $1:$500
Play this 85% of
the time
Play this 10% of
the time
Play this 5% of
the time
Multi-armed
bandits
30. Choosing between ML models
A statistician walks into an ML production environment
Pay-off $1:$1000 Pay-off $1:$200 Pay-off $1:$500
Use this 85% of
the time
(Exploitation)
Use this 10% of
the time
(Exploration)
Use this 5% of
the time
(Exploration)
Model 1 Model 2 Model 3
31. MAB vs. A/B testing
Why MAB?
• Continuous optimization, “set and forget”
• Maximize overall reward
Why A/B test?
• Simple to understand
• Single winner
• Tricky to do right
32. Other production considerations
• Versioning
• Logging
• Provenance
• Dashboards
• Reports
“Machine learning: The high interest rate credit card of technical debt,” D. Sculley et al, Google, 2014
“Two big challenges in machine learning,” Leon Bottou, ICML 2015 invited talk
For both hyper-parameter tuning and model training the system we are looking for should be:
Distributed
Do embarrassingly parallel things in parallel
Do ML things that distribute well in parallel
Scalable
Scale nodes up or down
Co-located with the data
Execution happens where the data lives
Easy to schedule, launch, and monitor
Same code as model
Operational metrics, dashboard, alarming
Easy integration with Dato Predictive Services
Why architecture meets requirements?
Distributed, Scalable, Co-located with data, easy to schedule/launch/monitor
Dato Distributed - one command to launch a long-running cluster of machines to do parallel / distributed execution of Jobs from GraphLab Create. These clusters can be launched in the cloud in AWS EC2 or on-premise in Hadoop YARN or Spark clusters.
Dato Predictive Services - with one line we deploy a fault-tolerant, scalable, robust, and maintainable cluster to put a service-oriented architecture on machine learning models.
We can choose to deploy in AWS EC2, or on-premise in our Hadoop YARN or Spark cluster environments.
We launch a 3-node Predictive Service deployment in AWS EC2 instances (m3.xlarge with 4 cores and 15gb RAM) and deploy our Recommender model. I should mention, our recommender model has 2.4mi users, and 6.6mi products (say). We measure operational metrics for getting a real-time live set of recommendations for a given user, and see average latency < 65ms. And of course average round-trip latency is insufficient for a production system, so we measure 99th percentile, P99 latency, and see it is < 100ms.
For both hyper-parameter tuning and model training the system we are looking for should be:
Distributed
Do embarrassingly parallel things in parallel
Do ML things that distribute well in parallel
Scalable
Scale nodes up or down
Co-located with the data
Execution happens where the data lives
Easy to schedule, launch, and monitor
Same code as model
Operational metrics, dashboard, alarming
Easy integration with Dato Predictive Services
For periodic training and distributed feature engineering we decide to use our on-premise Spark cluster, since our data is already stored on HDFS and we already use Spark DataFrames in our data engineering pipeline. So we schedule a nightly job to take the historical data and train our FactorizationRecommender model as a Spark Job, and as part of that job the trained model is updated on the Predictive Service deployment we launched earlier. And with Dato Distributed installed on the Spark cluster, our data scientists can now run distributed hyperparameter tuning regularly, and find an optimal set of parameters for the FactorizationRecommender model.
So far so good, we get together in a meeting with the web team to get the frontend to start using the new recommendation system we've developed. In that meeting, one of the data scienstists on the team mentions, 'I didn't tune the hyperparameters on that recommender model, there is probably an opportunity to get better results from the model.' One of the software engineers asks, 'How regularly will the model be trained?' Uh-oh, we hadn't thought of those things.
The web team puts the frontend REST API work on their next sprint and the data science team goes back to think about how to incorporate hyperparameter tuning and frequent model training into the recommender application.