1. The document discusses how to evaluate and manage machine learning models before deploying them to production. It emphasizes the importance of offline and online evaluation to identify any gaps.
2. A key step described is conducting a "sanity check" on a new model by comparing its predictions to the current model on a sample of real data. This helps identify if the new model improves precision and recall or worsens them.
3. After deploying a new model, ongoing monitoring is recommended to check that the new and old models still make consistent predictions on the same data, and to analyze any differences or errors. This continuous evaluation helps ensure the quality of models in production.
3. Machine Learning Workflow (CRISP-DM)
● Most Important Step
○ Business Understanding
○ Evaluation
● Missing things in for Production
Ref: Kenneth Jensen
5. Content Modelation
1. item listing 2, if prob score greater than
threshold value, items are hied
and alert to Customer Support
violation items
→ Delete
normal items
→ Unhide
3. Customer Support Check
E.g. Contents Moderation target: Fake Brand, Game Account
6. Assummption
● ML Service runs All listing items
● Binary Classiffication
● Precision is important than recall
● We can simulate online result in offline by
Faster Customer Support Check System
Ref: Rendezvous Architecture for Data Science in Production
7. Before Deploy to Production New Model
2019/04/11 all listing items
Cullent Model
1. prob
2. true or false
Cloud
Pub/Sub
8. Sad story in Machine Learning System in Production
● Gap Between Offline & Online evaluation
→ OK! we can’t know online result, let’s deploy!
● Data Imbalance problem
High Speed Continuas Improvment
1. Easy A/B System
2. Online Offline Sanity Check
10. Sanity Check Before Deploy to Production
Threshold: 0.95
ID is_delete Model α Model β
1613431 True 0.98 0.999
5263832 True 0.97 0.43
7213438 False 0.95 0.45
3213492 True 0.70 0.98
9201420 True 0.01 0.97
11. Sanity Check Before Deploy to Production
ID is_delete Model α Model β
1613431 True 0.98 0.999
5263832 True 0.97 0.43
7213438 False 0.95 0.45
3213492 True 0.70 0.98
9201420 True 0.01 0.97
Success! Cost Sensitive
Threshold: 0.95
12. ID is_delete Model α Model β
1613431 True 0.98 0.999
5263832 True 0.97 0.43
7213438 False 0.95 0.45
3213492 True 0.70 0.98
9201420 True 0.01 0.97
Success! Cost Sensitive
Threshold: 0.95
Fail! worsen recall
Sanity Check Before Deploy to Production
13. ID is_delete Model α Model β
1613431 True 0.98 0.999
5263832 True 0.97 0.43
7213438 False 0.95 0.45
3213492 True 0.70 0.98
9201420 True 0.01 0.97
Success! Cost Sensitive
Success! Cost Sensitive
Threshold: 0.95
Fail! worsen recall
Sanity Check Before Deploy to Production
14. ID is_delete Model α Model β
1613431 True 0.98 0.999
5263832 True 0.97 0.43
7213438 False 0.95 0.45
3213492 True 0.70 0.98
9201420 True 0.01 0.97
Success! Cost Sensitive
Fail! worsen recall
Success! Cost Sensitive
Success! Improve Recall
Threshold: 0.95
Sanity Check Before Deploy to Production
15. ID is_delete Model α Model β
1613431 True 0.98 0.999
5263832 True 0.97 0.43
7213438 False 0.95 0.45
3213492 True 0.70 0.98
9201420 True 0.01 0.97
Success! Cost Sensitive
Success! Cost Sensitive
Success! Improve Precision
Success! Improve Recall
Threshold: 0.95
Fail! worsen recall
Sanity Check Before Deploy to Production
16. Sanity Check Before Deploy to Production
Confidence or Probability:
High ↑
Confidence or Probability:
Low ↓
Deleted Items
(term of violation)
Improve👍 !! Bad Model ☠
Undelete Items
(Normal items)
Bad Model ☠ Improve 👍!!
17. Traditional Serve side Design
Ref: Rendezvous Architecture for Data Science in Production
MODEL
{“name”: “Dog” ,
“prob”: “92.5”
}
18. With Load Balancer
Ref: Rendezvous Architecture for Data Science in Production
{“name”: “Dog” ,
“prob”: “92.5”
}
Model 3
Load
Balancer
Model 1
Model 2
19. Sanity Check After Deploy to Production
● New Model, Old Model
○ Same prob score number of Overlap items
○ Top100, bottom100 👀grep
○ Error Analysis(False Positive sample)
○ Use False Positive to Hard Negative Sampling