When a machine learning model needs to served for interactive use cases, the models are either wrapped inside a Flask server or deployed using external services like Sagemaker. Both methods come with flaws. In this talk, you will learn about how ray serve uses ray to address the limitations of current approaches and enable scalable model serving.
2. @simon_mo_
A system for building scalable
Python (and Java) applications.
Reinforcement
learning
Hyperparameter
tuning and
distributed training
Serving
Distributed
Applications
Data analytics
Ray Ecosystem
2
4. Offline
Training
Data
Data
Collection
Cleaning &
Visualization
Feature Eng. &
Model Design
Training &
Validation
Model Development
Trained
Models
Training Pipelines
Live
Data
Training
Validation
End User
Application
Query
Prediction
Prediction Service
Inference
Feedback
Logic
Big Picture:
Machine Learning Lifecycle
4
9. @simon_mo_
The web server approach
+ Simplicity
+ End to end control over how model is served
x One query at a time
x Model loaded once, as global variable
x No isolation
x No fine-grained replication
9
10. @simon_mo_
The web server approach (continue)
x Process-pool based deployment
10
Initial process
Worker process
Worker process
Worker process
…
Requests
Load Balanced
Forked
11. @simon_mo_
The web server approach (continue)
x Process-pool based deployment
11
Worker process
Worker process
12. @simon_mo_
The web server approach (continue)
x Process-pool based deployment -> memory issue
12
Initial process
Worker process
Worker process
Worker process
…
Requests
Load Balanced
Forked
20. @simon_mo_
Offload inference to external service
20
-> HTTP
API Validation
Business Logic
Input Transformation
Inference
Output Transformation
Business Logic
<- HTTP
21. @simon_mo_
Offload inference to external service
21
Web
Server
External
Service
-> HTTP
API Validation
Business Logic
Input Transformation
Inference
Output Transformation
Business Logic
<- HTTP
Input Transformation
Inference
Output Transformation
22. @simon_mo_
Offload inference to external service
22
Web
Server
External
ServiceInference
-> HTTP
API Validation
Business Logic
Business Logic
<- HTTP
Input Transformation
Output Transformation
23. @simon_mo_
External services are mostly “tensor-in,
tensor-out”
23
-> HTTP
API Validation
Business Logic
Input Transformation
Inference
Output Transformation
Business Logic
<- HTTPComplexity
24. @simon_mo_
External services approach
+ Separation of concern
x Need to scale separately
x Model evaluation logic split from transformation logic
x Hard to learn
x Hard to debug
24
28. @simon_mo_
Kubernetes? Service Mesh?
● Serve provide a layer on top Kubernetes
○ Easy to serve simple model
○ Easy to serve complex pipeline
○ API definition and the model at the same place
○ Built-in service mesh for flexible routing
28
29. @simon_mo_
Serve runs on top of K8s
29
Ray Serve: Run on top of Kubernetes
23
Ray Serve
Model Model Model Model Model Model
Pod Pod Pod
30. @simon_mo_
Comparison
30
Compare to Serve
TFServing
+ Scale to any number of nodes
+ Support arbitrary frameworks
Seldon
+ Imperative pipelines
+ Flexible queuing policy
Sagemaker
+ Better batching
+ Deploy anywhere
"Flask”
+ Fine-grained replication
+ Isolated deployment
31. @simon_mo_
Try it out today
pip install ray[serve]
from ray.experimental import serve
31
- Ready for early adopters
- #serve channel in slack
- Coming soon:
- Performance benchmark
- Deployment tutorial