3. Agenda
• Motivation
• ML pipeline tools and platforms
• Machine Learning on Kubernetes
• Deep Learning Demo
• Conclusion
4. Agenda
• Motivation
• ML pipeline tools and platforms
• Machine Learning on Kubernetes
• Deep Learning Demo
• Conclusion
5.
6. ML – Helicopter view
How good are
your predictions?• Accuracy
• Optimization
7.
8. ML – The (enterprise) reality
• Wrangle large datasets
• Unify disparate systems
• Composability
• Manage pipeline complexity
• Improve training/serving
consistency
• Improve portability
• Improve model quality
• Manage versions
Building
a model
Data
ingestion
Data
analysis
Data
transform
Data
validation
Data
splitting
Ad-hoc
Training
Model
validation
Logging
Roll-out Serving
Monitoring
Distributed
Training
Training
at scale
Data
Versioning
HP Tuning
Experiment
Tracking
Feature
Store
SYSTEM 1
SYSTEM 2 SYSTEM 3
SYSTEM 4 SYSTEM 5
SYSTEM 6
SYSTEM 3.5
SYSTEM 1.5
11. Agenda
• Motivation
• ML pipeline tools and platforms
• Machine Learning on Kubernetes
• Deep Learning Demo
• Conclusion
12. Quick comparison
Apache Airflow is a
platform to
programmatically author,
schedule and monitor
workflows.
The Kubeflow project is
dedicated to making
deployments of machine
learning (ML) workflows
on Kubernetes simple,
portable and scalable.
TensorFlow Extended
(TFX) is an end-to-end
platform for deploying
production ML pipelines.
MLflow is an open
source platform to
manage the ML lifecycle,
including
experimentation,
reproducibility and
deployment.
https://airflow.apache.org/ https://www.kubeflow.org/
https://www.tensorflow.org/
tfx
https://mlflow.org/
13. How to scale to production?
Composability
Portability
Scalability
18. Kubernetes is an API and agents
The Kubernetes API provides containers
with a scheduling, configuration, network,
and storage
The Kubernetes runtime manages the
containers
19. Agenda
• Motivation
• ML pipeline tools and platforms
• Machine Learning on Kubernetes
• Deep Learning Demo
• Conclusion
20. Machine Learning on Kubernetes
• Kubernetes-native
• Run wherever k8s runs
• Move between local – dev – test – prod – cloud
• Use k8s to manage ML tasks
• CRDs (UDTs) for distributed training
• Adopt k8s patterns
• Microservices
• Manage infrastructure declaratively
• Support for multiple ML frameworks
• Tensorflow, Pytorch, Scikit, Xgboost, etc.
24. Composability
• Build and deploy re-usable,
portable, scalable, machine
learning workflows based on
Docker containers.
• Use the libraries/ frameworks of
your choice
Example:
KubeFlow "deployer" component lets you
deploy as a plain TF Serving model
server:
https://github.com/kubeflow/pipelines/tree/
master/components/kubeflow/deployer
25. METADATA
SERVING
Back to our ML enterprise workflow!
Building
a model
Data
ingestion
Data
analysis
Data
transform
Data
validation
Data
splitting
Ad-hoc
Training
Model
validation
Logging
Roll-out Serving
Monitoring
Distributed
Training
Training
at scale
Data
Versioning
HP Tuning
Experiment
Tracking
Feature
Store
26. Portability
Containers for
Deep Learning
Container runtime
Infrastructure
NVIDIA drivers
Host OS
Packages:
TensorFlow
mkl
cudnn
cublas
Nccl
CUDA toolkit
CPU:
GPU:
TensorFlow
Container
Image
Keras
horovod
numpy
scipy
others…
scikit-
learn
pandas
openmpi
Python
ML environments
that are:
28. Scalability
• Kubernetes - Autoscaling Jobs
• Describe the job, let Kubernetes take care of the rest
• CPU, RAM, Accelerators
• TF Jobs delete themselves when finished, node pool will auto scale back
down
Model works
great! But I need
six nodes.
Data Scientist IT Ops
Credit: @aronchick
29. Scalability
• Kubernetes - Autoscaling Jobs
• Describe the job, let Kubernetes take care of the rest
• CPU, RAM, Accelerators
• TF Jobs delete themselves when finished, node pool will auto scale back
down
Data Scientist IT Ops
apiVersion: "kubeflow.org/v1alpha1"
kind: "TFJob"
spec:
replicaSpecs:
replicas: 6
CPU: 1
GPU: 1
containers: gcr.io/myco/myjob:1.0
Credit: @aronchick
30. Scalability
• Kubernetes - Autoscaling Jobs
• Describe the job, let Kubernetes take care of the rest
• CPU, RAM, Accelerators
• TF Jobs delete themselves when finished, node pool will auto scale back
down
Data Scientist IT Ops
GPU GPU GPU
GPU GPU GPU
Credit: @aronchick
31. Scalability
• Kubernetes - Autoscaling Jobs
• Describe the job, let Kubernetes take care of the rest
• CPU, RAM, Accelerators
• TF Jobs delete themselves when finished, node pool will auto scale back
down
Job’s done!
Data Scientist IT Ops
Credit: @aronchick
32. Agenda
• Motivation
• ML pipeline tools and platforms
• Container > Kubernetes > Kubeflow
• Deep Learning Demo
• Conclusion
35. Recap:
The “Kube”flow
• Deploy Kubernetes & Kubeflow
• Experiment in Jupyter
• Build Docker Image
• Train at Scale
• Build Model Server
• Deploy Model
• Integrate Model into App
• Operate
Model Training Model Serving
Pod
Pod Pod
Kubernetes Worker Nodes
#1 #2 #3
Jupyter
Notebook
Seldon Core
Engine
Seldon Core
Engine
Doppelganger
Model
Doppelganger
Model
Istio Gateway
(Traffic Routing)
{REST API}
curl…
Dockerfile
Training Job
Dockerfile
Inference Service
Data Scientist
Pod
Train
Model
Pod
Train
Model
36. Agenda
• Motivation
• ML pipeline tools and platforms
• Machine Learning on Kubernetes
• Deep Learning Demo
• Conclusion