SlideShare a Scribd company logo
1 of 20
Download to read offline
Distributed TensorFlow
on Kubernetes
資訊與通訊研究所 蔣是文 Mac Chiang
Copyright 2017 ITRI 工業技術研究院
Agenda
• Kubernetes Introduction
• Scheduling GPUs on Kubernetes
• Distributed TensorFlow Introduction
• Running Distributed TensorFlow on Kubernetes
• Experience Sharing
• Summary
2
Copyright 2017 ITRI 工業技術研究院
What’s Kubernetes
• “Kubernetes” is Greek for captain or pilot
• Experiences from Google and design by Goolge
• Kubernetes is a production-grade, open-source platform that
orchestrates the placement (scheduling) and execution of
application containers within and across computer clusters.
• Masters manage the cluster and the nodes are used to host the
running applications.
3
Copyright 2017 ITRI 工業技術研究院
Why Kubernetes
4
• Automatic binpacking
• Horizontal scaling
• Automated rollouts and rollback
• Service monitor
• Self-healing
• Service discovery and load balancing
• 100% Open source, written in Go
Copyright 2017 ITRI 工業技術研究院
Scheduling GPUs on Kubernetes
• Nvidia drivers are installed
• Turned on alpha feature gate Accelerators across the
system
▪ --feature-gates="Accelerators=true“
• Nodes must be using docker engine as the container
runtime
5
Copyright 2017 ITRI 工業技術研究院
Scheduling GPUs on Kubernetes (cont.)
6
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
-
name: gpu-container-1
image: gcr.io/google_containers/pause:2.0
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 2 # requesting 2 GPUs
-
name: gpu-container-2
image: gcr.io/google_containers/pause:2.0
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 3 # requesting 3 GPUs
Copyright 2017 ITRI 工業技術研究院
Scheduling on Different GPU Versions
7
• Labeling nodes with GPU HW type
• Specify the GPU types via Node Affinity rules
Tesla P100
Node1 Node2
2 * K80 1 * P100
Tesla K80
gpu:k80 gpu:p100
Copyright 2017 ITRI 工業技術研究院
Access to CUDA libraries from Docker
nvidia-driver
libcuda.so
8
Copyright 2017 ITRI 工業技術研究院
TensorFlow
9
• Originally developed by the Google Brain Team
within Google's Machine Intelligence research
organization
• An open source software library for numerical
computation using data flow graphs
• Nodes in the graph represent mathematical
operations, while the graph edges represent the
multidimensional data arrays (tensors)
communicated between them.
• Support one or more CPUs or GPUs in a desktop,
server, or mobile device with a single API
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow
10
http://www.inwinstack.com/index.php/zh/2017/04/17/tensorflow-on-kubernetes/
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow (cont.)
11
• Replication
▪ In-graph
▪ Between-graph
• Training
▪ Asynchronous
▪ Synchronous
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S
12
• TensorFlow ecosystem
▪ https://github.com/tensorflow/ecosystem
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S (cont.)
13
• Prepare codes for distributed training
▪ Flags for configuring the task
▪ Construct the cluster and start the server
▪ Set the device before graph construction# Construct the cluster and start the server
ps_spec = FLAGS.ps_hosts.split(",")
worker_spec = FLAGS.worker_hosts.split(",")
cluster = tf.train.ClusterSpec({
"ps": ps_spec,
"worker": worker_spec})
server = tf.train.Server(
cluster, job_name=FLAGS.job_name,
task_index=FLAGS.task_index)
if FLAGS.job_name == "ps":
server.join()
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index, Cluster=cluster)):
# Construct the TensorFlow graph.
# Run the TensorFlow graph.
# Flags for configuring the task
flags.DEFINE_integer("task_index", None,
"Worker task index, should be >= 0. task_index=0 is "
"the master worker task that performs the variable "
"initialization.")
flags.DEFINE_string("ps_hosts", None,
"Comma-separated list of hostname:port pairs")
flags.DEFINE_string("worker_hosts", None,
"Comma-separated list of hostname:port pairs")
flags.DEFINE_string("job_name", None, "job name: worker or ps")
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S (cont.)
14
• Build docker image
▪ Prepare Dockerfile
▪ Build docker image
docker build -t <image_name>:v1 -f Dockerfile .
docker build -t macchiang/mnist:v7 -f Dockerfile .
docker push <image_name>:v1 Push image to docker hub
docker push macchiang/mnist:v7
FROM tensorflow/tensorflow:latest-gpu
COPY mnist_replicatensorflow/tensorflow.py /
ENTRYPOINT ["python", "/mnist_replica.py"]
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S (cont.)
15
• My revised history
▪ https://hub.docker.com/r/macchiang/mnist/tags/
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S (cont.)
16
• Specify parameters in jinja template file
▪ name, image, worker_replicas, ps_replicas, script, data_dir, and train_dir
▪ You may optionally specify credential_secret_name and
credential_secret_key if you need to read and write to Google Cloud
Storage
• Generate K8S YAML and create services and pods
▪ python render_template.py mnist.yaml.jinja | kubectl create -f -
command:
- "/mnist_replica.py"
args:
- "--data_dir=/data/raw-data"
- "--task_index=0"
- "--job_name=worker“
- "--worker_hosts=mnist-worker-0:5000,mnist-worker-1:5000“
- "--ps_hosts=mnist-ps-0:5000"
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S (cont.)
17
Worker0
Worker1
Service mnist-worker-0
Service mnist-worker-1
:5000
PS0
Service mnist-ps-0
:5000
:5000
Training Data
NFS
Training Result
NFS
Read
Write
Copyright 2017 ITRI 工業技術研究院
Distributed Tensorflow with CPUs
18
Container Orchestration Platform
35 Nodes
ImageNet Data
(145GB)
NFS
Training Result
NFS
2* Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
48GB Memory
• Inception Model
▪ Spent 9.23 days
35 Containers
Rethinking the Inception Architecture for Computer Vision
Copyright 2017 ITRI 工業技術研究院
Summary
• Kubernetes
▪ Production-grade container orchestration platform
▪ GPU resources management
a. Nvidia GPU only now
b. In Kuberntest 1.8, you can use NVIDIA device plugin.
» https://github.com/NVIDIA/k8s-device-plugin
• Kubernetes + Distributed TensorFlow
▪ Easy to build the distributed training cluster
▪ Leverage Kubernetes advantages
a. Restart failed container
b. Monitoring
c. Scheduling
19
Thank you!
macchiang@itri.org.tw
Kubernetes Taiwan User Group

More Related Content

What's hot

Canary Releases on Kubernetes with Spinnaker, Istio, & Prometheus (2020)
Canary Releases on Kubernetes with Spinnaker, Istio, & Prometheus (2020)Canary Releases on Kubernetes with Spinnaker, Istio, & Prometheus (2020)
Canary Releases on Kubernetes with Spinnaker, Istio, & Prometheus (2020)Kublr
 
网易云K8S应用实践 | practices for kubernetes cluster provisioning, management and ap...
网易云K8S应用实践 | practices for kubernetes cluster provisioning, management and ap...网易云K8S应用实践 | practices for kubernetes cluster provisioning, management and ap...
网易云K8S应用实践 | practices for kubernetes cluster provisioning, management and ap...Xiaohui Chen
 
Managing kubernetes deployment with operators
Managing kubernetes deployment with operatorsManaging kubernetes deployment with operators
Managing kubernetes deployment with operatorsCloud Technology Experts
 
Using source code management patterns to configure and secure your Kubernetes...
Using source code management patterns to configure and secure your Kubernetes...Using source code management patterns to configure and secure your Kubernetes...
Using source code management patterns to configure and secure your Kubernetes...Giovanni Galloro
 
Kubernetes stack reliability
Kubernetes stack reliabilityKubernetes stack reliability
Kubernetes stack reliabilityOleg Chunikhin
 
Zero-downtime deployment of Micro-services with Kubernetes
Zero-downtime deployment of Micro-services with KubernetesZero-downtime deployment of Micro-services with Kubernetes
Zero-downtime deployment of Micro-services with KubernetesWojciech Barczyński
 
How to Run Kubernetes in Restrictive Environments
How to Run Kubernetes in Restrictive EnvironmentsHow to Run Kubernetes in Restrictive Environments
How to Run Kubernetes in Restrictive EnvironmentsKublr
 
Gordon's secret session kubernetes on windows
Gordon's secret session   kubernetes on windowsGordon's secret session   kubernetes on windows
Gordon's secret session kubernetes on windowsDocker, Inc.
 
KubeCon EU 2016: ITNW (If This Now What): Orchestrating an Enterprise
KubeCon EU 2016: ITNW (If This Now What): Orchestrating an EnterpriseKubeCon EU 2016: ITNW (If This Now What): Orchestrating an Enterprise
KubeCon EU 2016: ITNW (If This Now What): Orchestrating an EnterpriseKubeAcademy
 
利用K8S實現高可靠應用
利用K8S實現高可靠應用利用K8S實現高可靠應用
利用K8S實現高可靠應用inwin stack
 
DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...
DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...
DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...Docker, Inc.
 
GitOps A/B testing with Istio and Helm
GitOps A/B testing with Istio and HelmGitOps A/B testing with Istio and Helm
GitOps A/B testing with Istio and HelmWeaveworks
 
Kubernetes as Infrastructure Abstraction
Kubernetes as Infrastructure AbstractionKubernetes as Infrastructure Abstraction
Kubernetes as Infrastructure AbstractionKublr
 
Kubernetes in Highly Restrictive Environments
Kubernetes in Highly Restrictive EnvironmentsKubernetes in Highly Restrictive Environments
Kubernetes in Highly Restrictive EnvironmentsKublr
 
Introduction to Git for Network Engineers (Lab Guide)
Introduction to Git for Network Engineers (Lab Guide)Introduction to Git for Network Engineers (Lab Guide)
Introduction to Git for Network Engineers (Lab Guide)Joel W. King
 
DCSF19 Kubernetes Security with OPA
DCSF19 Kubernetes Security with OPA DCSF19 Kubernetes Security with OPA
DCSF19 Kubernetes Security with OPA Docker, Inc.
 
KubeCon EU 2016: Killing containers to make weather beautiful
KubeCon EU 2016: Killing containers to make weather beautifulKubeCon EU 2016: Killing containers to make weather beautiful
KubeCon EU 2016: Killing containers to make weather beautifulKubeAcademy
 
Kubecon US 2019: Kubernetes Multitenancy WG Deep Dive
Kubecon US 2019: Kubernetes Multitenancy WG Deep DiveKubecon US 2019: Kubernetes Multitenancy WG Deep Dive
Kubecon US 2019: Kubernetes Multitenancy WG Deep DiveSanjeev Rampal
 
Cloud Native CI/CD with GitOps
Cloud Native CI/CD with GitOpsCloud Native CI/CD with GitOps
Cloud Native CI/CD with GitOpsKasper Nissen
 

What's hot (20)

Canary Releases on Kubernetes with Spinnaker, Istio, & Prometheus (2020)
Canary Releases on Kubernetes with Spinnaker, Istio, & Prometheus (2020)Canary Releases on Kubernetes with Spinnaker, Istio, & Prometheus (2020)
Canary Releases on Kubernetes with Spinnaker, Istio, & Prometheus (2020)
 
网易云K8S应用实践 | practices for kubernetes cluster provisioning, management and ap...
网易云K8S应用实践 | practices for kubernetes cluster provisioning, management and ap...网易云K8S应用实践 | practices for kubernetes cluster provisioning, management and ap...
网易云K8S应用实践 | practices for kubernetes cluster provisioning, management and ap...
 
Managing kubernetes deployment with operators
Managing kubernetes deployment with operatorsManaging kubernetes deployment with operators
Managing kubernetes deployment with operators
 
Using source code management patterns to configure and secure your Kubernetes...
Using source code management patterns to configure and secure your Kubernetes...Using source code management patterns to configure and secure your Kubernetes...
Using source code management patterns to configure and secure your Kubernetes...
 
Kubernetes stack reliability
Kubernetes stack reliabilityKubernetes stack reliability
Kubernetes stack reliability
 
Zero-downtime deployment of Micro-services with Kubernetes
Zero-downtime deployment of Micro-services with KubernetesZero-downtime deployment of Micro-services with Kubernetes
Zero-downtime deployment of Micro-services with Kubernetes
 
How to Run Kubernetes in Restrictive Environments
How to Run Kubernetes in Restrictive EnvironmentsHow to Run Kubernetes in Restrictive Environments
How to Run Kubernetes in Restrictive Environments
 
Red hat cloud platforms
Red hat cloud platformsRed hat cloud platforms
Red hat cloud platforms
 
Gordon's secret session kubernetes on windows
Gordon's secret session   kubernetes on windowsGordon's secret session   kubernetes on windows
Gordon's secret session kubernetes on windows
 
KubeCon EU 2016: ITNW (If This Now What): Orchestrating an Enterprise
KubeCon EU 2016: ITNW (If This Now What): Orchestrating an EnterpriseKubeCon EU 2016: ITNW (If This Now What): Orchestrating an Enterprise
KubeCon EU 2016: ITNW (If This Now What): Orchestrating an Enterprise
 
利用K8S實現高可靠應用
利用K8S實現高可靠應用利用K8S實現高可靠應用
利用K8S實現高可靠應用
 
DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...
DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...
DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...
 
GitOps A/B testing with Istio and Helm
GitOps A/B testing with Istio and HelmGitOps A/B testing with Istio and Helm
GitOps A/B testing with Istio and Helm
 
Kubernetes as Infrastructure Abstraction
Kubernetes as Infrastructure AbstractionKubernetes as Infrastructure Abstraction
Kubernetes as Infrastructure Abstraction
 
Kubernetes in Highly Restrictive Environments
Kubernetes in Highly Restrictive EnvironmentsKubernetes in Highly Restrictive Environments
Kubernetes in Highly Restrictive Environments
 
Introduction to Git for Network Engineers (Lab Guide)
Introduction to Git for Network Engineers (Lab Guide)Introduction to Git for Network Engineers (Lab Guide)
Introduction to Git for Network Engineers (Lab Guide)
 
DCSF19 Kubernetes Security with OPA
DCSF19 Kubernetes Security with OPA DCSF19 Kubernetes Security with OPA
DCSF19 Kubernetes Security with OPA
 
KubeCon EU 2016: Killing containers to make weather beautiful
KubeCon EU 2016: Killing containers to make weather beautifulKubeCon EU 2016: Killing containers to make weather beautiful
KubeCon EU 2016: Killing containers to make weather beautiful
 
Kubecon US 2019: Kubernetes Multitenancy WG Deep Dive
Kubecon US 2019: Kubernetes Multitenancy WG Deep DiveKubecon US 2019: Kubernetes Multitenancy WG Deep Dive
Kubecon US 2019: Kubernetes Multitenancy WG Deep Dive
 
Cloud Native CI/CD with GitOps
Cloud Native CI/CD with GitOpsCloud Native CI/CD with GitOps
Cloud Native CI/CD with GitOps
 

Similar to Distributed tensorflow on kubernetes

"Current and Planned Standards for Computer Vision and Machine Learning," a P...
"Current and Planned Standards for Computer Vision and Machine Learning," a P..."Current and Planned Standards for Computer Vision and Machine Learning," a P...
"Current and Planned Standards for Computer Vision and Machine Learning," a P...Edge AI and Vision Alliance
 
Kubernetes Java Operator
Kubernetes Java OperatorKubernetes Java Operator
Kubernetes Java OperatorAnthony Dahanne
 
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre..."APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...Edge AI and Vision Alliance
 
給 RD 的 Kubernetes 初體驗
給 RD 的 Kubernetes 初體驗給 RD 的 Kubernetes 初體驗
給 RD 的 Kubernetes 初體驗William Yeh
 
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...Edge AI and Vision Alliance
 
Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Running Distributed TensorFlow with GPUs on Mesos with DC/OS Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Running Distributed TensorFlow with GPUs on Mesos with DC/OS Mesosphere Inc.
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachNicola Ferraro
 
Kubernetes deployment on bare metal with container linux
Kubernetes deployment on bare metal with container linuxKubernetes deployment on bare metal with container linux
Kubernetes deployment on bare metal with container linuxmacchiang
 
Docker adventures in Continuous Delivery - Alex Vranceanu
Docker adventures in Continuous Delivery - Alex VranceanuDocker adventures in Continuous Delivery - Alex Vranceanu
Docker adventures in Continuous Delivery - Alex VranceanuITCamp
 
The path to a serverless-native era with Kubernetes
The path to a serverless-native era with KubernetesThe path to a serverless-native era with Kubernetes
The path to a serverless-native era with Kubernetessparkfabrik
 
Velocity NY 2018 "The Cloud Native Developer Workflow"
Velocity NY 2018 "The Cloud Native Developer Workflow"Velocity NY 2018 "The Cloud Native Developer Workflow"
Velocity NY 2018 "The Cloud Native Developer Workflow"Daniel Bryant
 
RTP NPUG: Ansible Intro and Integration with ACI
RTP NPUG: Ansible Intro and Integration with ACIRTP NPUG: Ansible Intro and Integration with ACI
RTP NPUG: Ansible Intro and Integration with ACIJoel W. King
 
Microservices & Serverless Architecture Principles Applied - Cisco Live Orlan...
Microservices & Serverless Architecture Principles Applied - Cisco Live Orlan...Microservices & Serverless Architecture Principles Applied - Cisco Live Orlan...
Microservices & Serverless Architecture Principles Applied - Cisco Live Orlan...Cisco DevNet
 
The App Developer's Kubernetes Toolbox
The App Developer's Kubernetes ToolboxThe App Developer's Kubernetes Toolbox
The App Developer's Kubernetes ToolboxNebulaworks
 
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...Patrick Chanezon
 

Similar to Distributed tensorflow on kubernetes (20)

"Current and Planned Standards for Computer Vision and Machine Learning," a P...
"Current and Planned Standards for Computer Vision and Machine Learning," a P..."Current and Planned Standards for Computer Vision and Machine Learning," a P...
"Current and Planned Standards for Computer Vision and Machine Learning," a P...
 
Kubernetes Java Operator
Kubernetes Java OperatorKubernetes Java Operator
Kubernetes Java Operator
 
NFV features in kubernetes
NFV features in kubernetesNFV features in kubernetes
NFV features in kubernetes
 
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre..."APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
 
Enabling NFV features in kubernetes
Enabling NFV features in kubernetesEnabling NFV features in kubernetes
Enabling NFV features in kubernetes
 
給 RD 的 Kubernetes 初體驗
給 RD 的 Kubernetes 初體驗給 RD 的 Kubernetes 初體驗
給 RD 的 Kubernetes 初體驗
 
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
 
Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Running Distributed TensorFlow with GPUs on Mesos with DC/OS Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Running Distributed TensorFlow with GPUs on Mesos with DC/OS
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps Approach
 
Kubernetes deployment on bare metal with container linux
Kubernetes deployment on bare metal with container linuxKubernetes deployment on bare metal with container linux
Kubernetes deployment on bare metal with container linux
 
Docker adventures in Continuous Delivery - Alex Vranceanu
Docker adventures in Continuous Delivery - Alex VranceanuDocker adventures in Continuous Delivery - Alex Vranceanu
Docker adventures in Continuous Delivery - Alex Vranceanu
 
The path to a serverless-native era with Kubernetes
The path to a serverless-native era with KubernetesThe path to a serverless-native era with Kubernetes
The path to a serverless-native era with Kubernetes
 
Moby KubeCon 2017
Moby KubeCon 2017Moby KubeCon 2017
Moby KubeCon 2017
 
Core os dna_automacon
Core os dna_automaconCore os dna_automacon
Core os dna_automacon
 
Kubernetes basics and hands on exercise
Kubernetes basics and hands on exerciseKubernetes basics and hands on exercise
Kubernetes basics and hands on exercise
 
Velocity NY 2018 "The Cloud Native Developer Workflow"
Velocity NY 2018 "The Cloud Native Developer Workflow"Velocity NY 2018 "The Cloud Native Developer Workflow"
Velocity NY 2018 "The Cloud Native Developer Workflow"
 
RTP NPUG: Ansible Intro and Integration with ACI
RTP NPUG: Ansible Intro and Integration with ACIRTP NPUG: Ansible Intro and Integration with ACI
RTP NPUG: Ansible Intro and Integration with ACI
 
Microservices & Serverless Architecture Principles Applied - Cisco Live Orlan...
Microservices & Serverless Architecture Principles Applied - Cisco Live Orlan...Microservices & Serverless Architecture Principles Applied - Cisco Live Orlan...
Microservices & Serverless Architecture Principles Applied - Cisco Live Orlan...
 
The App Developer's Kubernetes Toolbox
The App Developer's Kubernetes ToolboxThe App Developer's Kubernetes Toolbox
The App Developer's Kubernetes Toolbox
 
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
 

More from inwin stack

Migrating to Cloud Native Solutions
Migrating to Cloud Native SolutionsMigrating to Cloud Native Solutions
Migrating to Cloud Native Solutionsinwin stack
 
Cloud Native 下的應用網路設計
Cloud Native 下的應用網路設計Cloud Native 下的應用網路設計
Cloud Native 下的應用網路設計inwin stack
 
當電子發票遇見 Google Cloud Function
當電子發票遇見 Google Cloud Function當電子發票遇見 Google Cloud Function
當電子發票遇見 Google Cloud Functioninwin stack
 
運用高效、敏捷全新平台極速落實雲原生開發
運用高效、敏捷全新平台極速落實雲原生開發運用高效、敏捷全新平台極速落實雲原生開發
運用高效、敏捷全新平台極速落實雲原生開發inwin stack
 
The last mile of digital transformation AI大眾化:數位轉型的最後一哩
The last mile of digital transformation AI大眾化:數位轉型的最後一哩The last mile of digital transformation AI大眾化:數位轉型的最後一哩
The last mile of digital transformation AI大眾化:數位轉型的最後一哩inwin stack
 
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案inwin stack
 
An Open, Open source way to enable your Cloud Native Journey
An Open, Open source way to enable your Cloud Native JourneyAn Open, Open source way to enable your Cloud Native Journey
An Open, Open source way to enable your Cloud Native Journeyinwin stack
 
維運Kubernetes的兩三事
維運Kubernetes的兩三事維運Kubernetes的兩三事
維運Kubernetes的兩三事inwin stack
 
Serverless framework on kubernetes
Serverless framework on kubernetesServerless framework on kubernetes
Serverless framework on kubernetesinwin stack
 
Train.IO 【第六期-OpenStack 二三事】
Train.IO 【第六期-OpenStack 二三事】Train.IO 【第六期-OpenStack 二三事】
Train.IO 【第六期-OpenStack 二三事】inwin stack
 
Web後端技術的演變
Web後端技術的演變Web後端技術的演變
Web後端技術的演變inwin stack
 
以 Kubernetes 部屬 Spark 大數據計算環境
以 Kubernetes 部屬 Spark 大數據計算環境以 Kubernetes 部屬 Spark 大數據計算環境
以 Kubernetes 部屬 Spark 大數據計算環境inwin stack
 
Setup Hybrid Clusters Using Kubernetes Federation
Setup Hybrid Clusters Using Kubernetes FederationSetup Hybrid Clusters Using Kubernetes Federation
Setup Hybrid Clusters Using Kubernetes Federationinwin stack
 
基於 K8S 開發的 FaaS 專案 - riff
基於 K8S 開發的 FaaS 專案 - riff基於 K8S 開發的 FaaS 專案 - riff
基於 K8S 開發的 FaaS 專案 - riffinwin stack
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster inwin stack
 
Extend the Kubernetes API with CRD and Custom API Server
Extend the Kubernetes API with CRD and Custom API ServerExtend the Kubernetes API with CRD and Custom API Server
Extend the Kubernetes API with CRD and Custom API Serverinwin stack
 
Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)
Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)
Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)inwin stack
 
Build your own kubernetes apiserver and resource type
Build your own kubernetes apiserver and resource typeBuild your own kubernetes apiserver and resource type
Build your own kubernetes apiserver and resource typeinwin stack
 
Virtualization inside kubernetes
Virtualization inside kubernetesVirtualization inside kubernetes
Virtualization inside kubernetesinwin stack
 
Build the Blockchain as service (BaaS) Using Ethereum on Kubernetes
Build the Blockchain as service (BaaS) Using Ethereum on KubernetesBuild the Blockchain as service (BaaS) Using Ethereum on Kubernetes
Build the Blockchain as service (BaaS) Using Ethereum on Kubernetesinwin stack
 

More from inwin stack (20)

Migrating to Cloud Native Solutions
Migrating to Cloud Native SolutionsMigrating to Cloud Native Solutions
Migrating to Cloud Native Solutions
 
Cloud Native 下的應用網路設計
Cloud Native 下的應用網路設計Cloud Native 下的應用網路設計
Cloud Native 下的應用網路設計
 
當電子發票遇見 Google Cloud Function
當電子發票遇見 Google Cloud Function當電子發票遇見 Google Cloud Function
當電子發票遇見 Google Cloud Function
 
運用高效、敏捷全新平台極速落實雲原生開發
運用高效、敏捷全新平台極速落實雲原生開發運用高效、敏捷全新平台極速落實雲原生開發
運用高效、敏捷全新平台極速落實雲原生開發
 
The last mile of digital transformation AI大眾化:數位轉型的最後一哩
The last mile of digital transformation AI大眾化:數位轉型的最後一哩The last mile of digital transformation AI大眾化:數位轉型的最後一哩
The last mile of digital transformation AI大眾化:數位轉型的最後一哩
 
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
 
An Open, Open source way to enable your Cloud Native Journey
An Open, Open source way to enable your Cloud Native JourneyAn Open, Open source way to enable your Cloud Native Journey
An Open, Open source way to enable your Cloud Native Journey
 
維運Kubernetes的兩三事
維運Kubernetes的兩三事維運Kubernetes的兩三事
維運Kubernetes的兩三事
 
Serverless framework on kubernetes
Serverless framework on kubernetesServerless framework on kubernetes
Serverless framework on kubernetes
 
Train.IO 【第六期-OpenStack 二三事】
Train.IO 【第六期-OpenStack 二三事】Train.IO 【第六期-OpenStack 二三事】
Train.IO 【第六期-OpenStack 二三事】
 
Web後端技術的演變
Web後端技術的演變Web後端技術的演變
Web後端技術的演變
 
以 Kubernetes 部屬 Spark 大數據計算環境
以 Kubernetes 部屬 Spark 大數據計算環境以 Kubernetes 部屬 Spark 大數據計算環境
以 Kubernetes 部屬 Spark 大數據計算環境
 
Setup Hybrid Clusters Using Kubernetes Federation
Setup Hybrid Clusters Using Kubernetes FederationSetup Hybrid Clusters Using Kubernetes Federation
Setup Hybrid Clusters Using Kubernetes Federation
 
基於 K8S 開發的 FaaS 專案 - riff
基於 K8S 開發的 FaaS 專案 - riff基於 K8S 開發的 FaaS 專案 - riff
基於 K8S 開發的 FaaS 專案 - riff
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster
 
Extend the Kubernetes API with CRD and Custom API Server
Extend the Kubernetes API with CRD and Custom API ServerExtend the Kubernetes API with CRD and Custom API Server
Extend the Kubernetes API with CRD and Custom API Server
 
Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)
Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)
Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)
 
Build your own kubernetes apiserver and resource type
Build your own kubernetes apiserver and resource typeBuild your own kubernetes apiserver and resource type
Build your own kubernetes apiserver and resource type
 
Virtualization inside kubernetes
Virtualization inside kubernetesVirtualization inside kubernetes
Virtualization inside kubernetes
 
Build the Blockchain as service (BaaS) Using Ethereum on Kubernetes
Build the Blockchain as service (BaaS) Using Ethereum on KubernetesBuild the Blockchain as service (BaaS) Using Ethereum on Kubernetes
Build the Blockchain as service (BaaS) Using Ethereum on Kubernetes
 

Recently uploaded

9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIUdaiappa Ramachandran
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 

Recently uploaded (20)

9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AI
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 

Distributed tensorflow on kubernetes

  • 2. Copyright 2017 ITRI 工業技術研究院 Agenda • Kubernetes Introduction • Scheduling GPUs on Kubernetes • Distributed TensorFlow Introduction • Running Distributed TensorFlow on Kubernetes • Experience Sharing • Summary 2
  • 3. Copyright 2017 ITRI 工業技術研究院 What’s Kubernetes • “Kubernetes” is Greek for captain or pilot • Experiences from Google and design by Goolge • Kubernetes is a production-grade, open-source platform that orchestrates the placement (scheduling) and execution of application containers within and across computer clusters. • Masters manage the cluster and the nodes are used to host the running applications. 3
  • 4. Copyright 2017 ITRI 工業技術研究院 Why Kubernetes 4 • Automatic binpacking • Horizontal scaling • Automated rollouts and rollback • Service monitor • Self-healing • Service discovery and load balancing • 100% Open source, written in Go
  • 5. Copyright 2017 ITRI 工業技術研究院 Scheduling GPUs on Kubernetes • Nvidia drivers are installed • Turned on alpha feature gate Accelerators across the system ▪ --feature-gates="Accelerators=true“ • Nodes must be using docker engine as the container runtime 5
  • 6. Copyright 2017 ITRI 工業技術研究院 Scheduling GPUs on Kubernetes (cont.) 6 apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: gpu-container-1 image: gcr.io/google_containers/pause:2.0 resources: limits: alpha.kubernetes.io/nvidia-gpu: 2 # requesting 2 GPUs - name: gpu-container-2 image: gcr.io/google_containers/pause:2.0 resources: limits: alpha.kubernetes.io/nvidia-gpu: 3 # requesting 3 GPUs
  • 7. Copyright 2017 ITRI 工業技術研究院 Scheduling on Different GPU Versions 7 • Labeling nodes with GPU HW type • Specify the GPU types via Node Affinity rules Tesla P100 Node1 Node2 2 * K80 1 * P100 Tesla K80 gpu:k80 gpu:p100
  • 8. Copyright 2017 ITRI 工業技術研究院 Access to CUDA libraries from Docker nvidia-driver libcuda.so 8
  • 9. Copyright 2017 ITRI 工業技術研究院 TensorFlow 9 • Originally developed by the Google Brain Team within Google's Machine Intelligence research organization • An open source software library for numerical computation using data flow graphs • Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. • Support one or more CPUs or GPUs in a desktop, server, or mobile device with a single API
  • 10. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow 10 http://www.inwinstack.com/index.php/zh/2017/04/17/tensorflow-on-kubernetes/
  • 11. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow (cont.) 11 • Replication ▪ In-graph ▪ Between-graph • Training ▪ Asynchronous ▪ Synchronous
  • 12. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S 12 • TensorFlow ecosystem ▪ https://github.com/tensorflow/ecosystem
  • 13. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 13 • Prepare codes for distributed training ▪ Flags for configuring the task ▪ Construct the cluster and start the server ▪ Set the device before graph construction# Construct the cluster and start the server ps_spec = FLAGS.ps_hosts.split(",") worker_spec = FLAGS.worker_hosts.split(",") cluster = tf.train.ClusterSpec({ "ps": ps_spec, "worker": worker_spec}) server = tf.train.Server( cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index) if FLAGS.job_name == "ps": server.join() with tf.device(tf.train.replica_device_setter( worker_device="/job:worker/task:%d" % FLAGS.task_index, Cluster=cluster)): # Construct the TensorFlow graph. # Run the TensorFlow graph. # Flags for configuring the task flags.DEFINE_integer("task_index", None, "Worker task index, should be >= 0. task_index=0 is " "the master worker task that performs the variable " "initialization.") flags.DEFINE_string("ps_hosts", None, "Comma-separated list of hostname:port pairs") flags.DEFINE_string("worker_hosts", None, "Comma-separated list of hostname:port pairs") flags.DEFINE_string("job_name", None, "job name: worker or ps")
  • 14. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 14 • Build docker image ▪ Prepare Dockerfile ▪ Build docker image docker build -t <image_name>:v1 -f Dockerfile . docker build -t macchiang/mnist:v7 -f Dockerfile . docker push <image_name>:v1 Push image to docker hub docker push macchiang/mnist:v7 FROM tensorflow/tensorflow:latest-gpu COPY mnist_replicatensorflow/tensorflow.py / ENTRYPOINT ["python", "/mnist_replica.py"]
  • 15. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 15 • My revised history ▪ https://hub.docker.com/r/macchiang/mnist/tags/
  • 16. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 16 • Specify parameters in jinja template file ▪ name, image, worker_replicas, ps_replicas, script, data_dir, and train_dir ▪ You may optionally specify credential_secret_name and credential_secret_key if you need to read and write to Google Cloud Storage • Generate K8S YAML and create services and pods ▪ python render_template.py mnist.yaml.jinja | kubectl create -f - command: - "/mnist_replica.py" args: - "--data_dir=/data/raw-data" - "--task_index=0" - "--job_name=worker“ - "--worker_hosts=mnist-worker-0:5000,mnist-worker-1:5000“ - "--ps_hosts=mnist-ps-0:5000"
  • 17. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 17 Worker0 Worker1 Service mnist-worker-0 Service mnist-worker-1 :5000 PS0 Service mnist-ps-0 :5000 :5000 Training Data NFS Training Result NFS Read Write
  • 18. Copyright 2017 ITRI 工業技術研究院 Distributed Tensorflow with CPUs 18 Container Orchestration Platform 35 Nodes ImageNet Data (145GB) NFS Training Result NFS 2* Intel(R) Xeon(R) CPU E5620 @ 2.40GHz 48GB Memory • Inception Model ▪ Spent 9.23 days 35 Containers Rethinking the Inception Architecture for Computer Vision
  • 19. Copyright 2017 ITRI 工業技術研究院 Summary • Kubernetes ▪ Production-grade container orchestration platform ▪ GPU resources management a. Nvidia GPU only now b. In Kuberntest 1.8, you can use NVIDIA device plugin. » https://github.com/NVIDIA/k8s-device-plugin • Kubernetes + Distributed TensorFlow ▪ Easy to build the distributed training cluster ▪ Leverage Kubernetes advantages a. Restart failed container b. Monitoring c. Scheduling 19

Editor's Notes

  1. https://kubernetes.io/docs/tutorials/kubernetes-basics/cluster-intro/
  2. 去年四⽉中旬 Google 釋出 TensorFlow 0.8,新增加分散式運算能⼒,使 TensorFlow 可在數百台的機器上執⾏ 訓練程序,建立各種機器學習模型,將原本要耗費數天或數個星期的模型訓練過程縮短到數⼩時 而TensorFlow的工作(Job)可拆成多個相同功能的任務(Task),這些工作又分成Parameter server與Worker,兩者功能說明如下: Parameter server:主要根據梯度更新變數,並儲存於tf.Variable,可解釋為僅儲存模型的變數,並存放Variable副本。 Worker:通常稱為計算節點,主要執行密集型的Graph運算資源,並根據變數運算梯度,亦能儲存Graph副本。 • Client:是⽤於建立 TensorFlow 計算 Graph,並建立與叢集進⾏互動的tensorflow::Session ⾏ 程,⼀般由 Python 或 C++ 實作,單⼀客⼾端可以同時連接多個 TF 伺服器連接,同時也能被 多個 TF 伺服器連接. • Master Service:是⼀個 RPC 服務⾏程,⽤來遠端連線⼀系列分散式裝置,主要提供 tensorflow::Session介⾯,並負責透過 Worker Service 與⼯作的任務進⾏溝通. • Worker Service:是⼀個可以使⽤本地裝置(CPU 或 GPU)對部分 Graph 進⾏運算的 RPC 邏 輯,透過 worker_service.proto 介⾯來實作,所有 TensorFlow 伺服器均包含了 Worker Service 邏輯
  3. 在TensorFlow中启动分布式深度学习模型训练任务也有两种模式。一种为In-graph replication。在这种模式下神经网络的参数会都保存在同一个TensorFlow计算图中,只有计算会分配到不同计算服务器。另一种为Between-graph replication,这种模式下所有的计算服务器也会创建参数,但参数会通过统一的方式分配到参数服务器。因为In-graph replication处理海量数据的能力稍弱,所以Between-graph replication是一个更加常用的模式。 In-graph replication. In this approach, the client builds a single tf.Graph that contains one set of parameters (in tf.Variable nodes pinned to /job:ps); and multiple copies of the compute-intensive part of the model, each pinned to a different task in /job:worker. Between-graph replication. In this approach, there is a separate client for each /job:worker task, typically in the same process as the worker task. Each client builds a similar graph containing the parameters (pinned to /job:psas before using tf.train.replica_device_setter to map them deterministically to the same tasks); and a single copy of the compute-intensive part of the model, pinned to the local task in /job:worker. 深度学习模型常用的有两种分布式训练方式。一种是同步更新,另一种是异步更新。如上面的ppt所示,在同步更新模式下,所有服务器都会统一读取参数的取值,计算参数梯度,最后再统一更新。而在异步更新模式下,不同服务器会自己读取参数,计算梯度并更新参数,而不需要与其他服务器同步。同步更新的最大问题在于,不同服务器需要同步完成所有操作,于是快的服务器需要等待慢的服务器,资源利用率会相对低一些。而异步模式可能会使用陈旧的梯度更新参数导致训练的效果受到影响。不同的更新模式各有优缺点,很难统一的说哪一个更好,需要具体问题具体分析 Asynchronous training. In this approach, each replica of the graph has an independent training loop that executes without coordination. It is compatible with both forms of replication above. Synchronous training. In this approach, all of the replicas read the same values for the current parameters, compute gradients in parallel, and then apply them together. It is compatible with in-graph replication (e.g. using gradient averaging as in the CIFAR-10 multi-GPU trainer), and between-graph replication (e.g. using thetf.train.SyncReplicasOptimizer).