SlideShare a Scribd company logo
1 of 31
INTRODUCTION TO
TENSORFLOW
ARCHITECTURE
MANI SHANKAR GOSWAMI
@Mani_Shankar_G
BEFORE WE START…
• PLEASE UNDERSTAND TensorFlow DIFFERS FROM MOST DATA ENGINES OUT
THERE FOR OBVIOUS REASONS.
• TensorFlow differs from batch dataflow systems in two respects:
• The model supports multiple concurrent executions on overlapping subgraphs of the
overall graph.
• Individual vertices may have mutable state that can be shared between different
executions of the graph.
• Some References (picked from OSDI 16 Conference):
• The principal limitation of a batch dataflow system is that it requires the input data to
be immutable, and all of the sub-computations to be deterministic, so that the
system can re-execute sub-computations when machines in the cluster fail.
• For example, the SparkNet system for training deep neural networks on Spark takes
20 seconds to broadcast weights and collect updates from five workers [55]. As a
result, in these systems, each model update step must process larger batches,
slowing convergence [8]. We show in Subsection 6.3 that TensorFlow can train larger
models on larger clusters with step times as short as 2 seconds
WHAT IS TENSORFLOW?
Here is the formal definition picked from https://www.tensorflow.org/:
TensorFlow is an open source software library for numerical
computation using data flow graphs. Nodes in the graph represent
mathematical operations, while the graph edges represent the
multidimensional data arrays (tensors) communicated between them.
The flexible architecture allows you to deploy computation to one or
more CPUs or GPUs in a desktop, server, or mobile device with a single
API.
TensorFlow was originally developed by researchers and engineers
working on the Google Brain Team within Google's Machine
Intelligence research organization for the purposes of conducting
machine learning and deep neural networks research.
WHAT IS A DATA FLOW GRAPH ?
Consider a typical linear equation: y = W * x + b
where W is Weight, x is an example and b is bias.
This linear equation can be represented as a acyclic graph, as below:
Biases
Weight
Examples
MatMul
Add Relu Gradients
Updated
Weights
and Biases
GENERALIZING THE DATAFLOW GRAPH
Biases
….
Learning
Rate
Add -=Mul
Update
Biases
…
gradient
computation
Variables
&
Constant
s
Operations Updating
of
Variables
LAYERED VIEW
Network LayerDevice Layer
Kernel Execution Layer
Distributed Master
Data Flow
Controller
API LAYER
CLIENT LAYER
LIBRARIES
(Training/Inference Libs)
TENSORFLOW’S DEVICE INTERACTION VIEW
TensorFlow uses CUDA and cuDNN to control GPUs and boost
CPU GPU #0 GPU #1
cuDNN
CUDA
TENSORFLOW
EXECUTION PHASES
• By deferring the execution until the entire program is available,
TensorFlow optimizes the execution phase by using global
information about the computation
• Example:
• TensorFlow achieves high GPU utilization by using the graph’s dependency
structure to issue a sequence of kernels to the GPU without waiting for
intermediate results
• TensorFlow uses deferred execution via the dataflow graph to
offload larger chunks of work to accelerators.
CONSTRUCTION
PHASE
EXECUTION PHASE
CLIENT WORKERS
WORKER’S DEVICE INTERACTIONS
• The worker service in each task:
• handles requests from the master,
• schedules the execution of the kernels for the operations that comprise a local subgraph
• mediates direct communication between tasks.
• It optimized for running large graphs with low overhead
• It dispatches kernels to local devices and runs kernels in parallel when possible, for example by
using multiple CPU cores or GPU streams.
CLIENT MASTER WORKER
GPU #1
GPU
#2
CPU
#0
Session
WORKER’S SCHEDULING & PLACEMENT
ALGORITHM
• Uses COST Model to determine placement
• contains estimates of the sizes of the input and output tensors for each
graph node
• Uses estimates of the computation time required for each node
• statically estimated based on heuristics associated with different operation
types
• also uses metrics collected for placement decisions for earlier executions
of the graph
• placement algorithm first runs a simulated execution of the graph
• For each node, feasible devices are determined
• When multiple devices are eligible for a node execution
• algorithm uses a greedy heuristic; examines the effects on the completion time
using COST MODEL
• usually, device where the node’s operation would finish the soonest is generally
selected
• Applies constraints like colocation requirements
SINGLE MACHINE VS DISTRIBUTED SYSTEM
STRUCTURE
Client is one which creates computation graph during the construction phase
It creates a session to master and send the constructed graph for execution
Finally when client evaluates a node or nodes in graph, master starts the execution by distributing sub graphs to workers.
Client Master
GPU0 GPU1 GPUn
session
run
execute sub-graph
Single Process
Client
Process
Master
GPU0
session run
execute sub-graph
Distributed Version
GPU1
GPUn CPU0
worker process 1
GPU0 GPU1
GPUn CPU0
worker process 2
GPU0 GPU1
GPUn CPU0
worker process 3
worker
KERNEL EXECUTION
• TF manages two types of thread pools on each device to
parallelize operations; inter-op & intra-op thread pools
• Inter-op are normal thread pool used when two or more
operations get scheduled on same device.
• In few cases operations have multi-threaded kernel, they use
intra-op thread pool
CPU#0 CPU #1
A
B
F
D
E
C
Inter-
op
pool
Intra-
op
pool
SESSION ON A SINGLE PROCESS
tf.Session
CPU: 0 GPU: 0
with tf.Session() as sess:
sess.run(init_op)
for _ in range(STEPS):
sess.run(train)
CROSS-DEVICE COMMUNICATION
s += w * x + b
CPU
+=
S w
b
Add
MatMul
X
GPU #0
Worker
CROSS-DEVICE COMMUNICATION
CPU
+=
s w
b
Add
MatMul
X
GPU #0
SEND RECV
RECV SEND
SEND RECV
Worker
CREATING A CLUSTER
tf.Session
CPU: 0 GPU: 0
cluster = tf.train.ClusterSpecs ({"ps":
ps_hosts, "worker": worker_hosts})
server = tf.train.Server(cluster,
job_name = “worker”, task_index=0)
tf.train.Server
CPU: 0 GPU: 0
tf.train.Server
CPU: 0 GPU: 0
tf.train.Server
DISTRIBUTED COMMUNICATION (DATA PARALLELISM
& REPLICATION)
• master decides a sub graph for a worker, in this case model parameters are given to PS worker
* worker is responsible for deciding and placing nodes of the sub-graph on devices
• nodes are executed in multiple GPUs/CPU Cores simultaneously subject to dependency
resolution
Device 1
(PS)
+=
s w
b
CPU (PS)GPU #0
MatMul
x
Add
Worker #0 Worker #1
GPU #0
MatMul
x
Add
DISTRIBUTED COMMUNICATION (DATA
PARALLELISM)
• Transfers between local CPU and GPU devices use the cudaMemcpyAsync() API to overlap computation and
data transfer.
• Transfers between two local GPUs use peer-to-peer DMA, to avoid an expensive copy via the host CPU.
• Transfers between tasks uses RDMA over Converged Ethernet else uses gRPC over TCP
Device 1
(PS)
+=
s w
b
CPU (PS)GPU #0
MatMul
x
Add
GPU #0
Worker #0 Worker #1
SEND
RECV
SEND
SEND
RECV
RDMA
is_chief=tru
e
MatMul
x
Add
RECV
SEND
REPLICATED TRAINING VIEW
DISTRIBUTED COMMUNICATION (MODEL
PARALLELISM)
• In model parallelism, the graph’s operations are distributed across cluster
Device 1
(PS)
Device 2 (worker)
+=
s w
b
CPU GPU #0
MatMul
x
Add
GPU #0
Worker #0 Worker #1
DISTRIBUTED COMMUNICATION (MODEL PARALLELISM)
• Transfers between local CPU and GPU devices use the cudaMemcpyAsync() API to overlap computation and
data transfer.
• Transfers between two local GPUs use peer-to-peer DMA, to avoid an expensive copy via the host CPU.
• Transfers between tasks uses RDMA over Converged Ethernet else uses gRPC over TCP
Device 1
(PS)
+=
s w
b
CPU GPU #0
*
x
+
GPU #0
Worker #0 Worker #1
SEND
RECV
SEND
RECV
SEND
Dest:
worker#1,
GPU #0
Dest:
worker#0,
GPU #0
Dest:
worker#1,
GPU #0
SEND Dest:
worker#0,
CPU #0
RECV
RDMA
is_chief =
True
CHIEF WORKER
• Chief is a task which is assigned some additional responsibilities in the cluster.
• Its responsibilities:
• Check pointing:
• Saves graph state in a configured store like HDFS etc.
• Runs a configurable frequency
• Maintaining Summary
• Runs all summary operations
• Saving Models
• Step Counters
• Keeps an eye on total steps taken
• Recovery
• restores the graph from the most recent checkpoint and resumes training
where it stopped
• Initializing all the variables in graph
• Can be monitored through TensorBoard.
PARAMETER TASKS VS WORKER TASKS
• In TensorFlow workload in distributed in form of PS and workers tasks.
• PS tasks holds:
• Variables
• Update operations
• Worker tasks: holds
• Pre-processing
• Loss calculation
• Back Propagations
• Multiple workers and PS tasks can run simultaneously but TF ensures that PS
is sharded, ensures that same variable has one physical copy. There are
various algorithm which support PS task distribution considering load and
network .
• It also allows partitioning large variables (~10x GBs) into multiple PS tasks
TYPES OF TRAINING REPLICATION
• In Graph Replication
• Here single client connects to a master and requests distribution of
replicated graph along with data within all available workers.
• Works well for a small work load but beyond that does not scale well.
• Between Graph Replication (Recommended Approach)
• In this approach multiple clients take part in replication
• Each machine has a client which talks to the local master and gives cluster
information, graphs and data to be executed.
• Master ensures that PS tasks are shared based on cluster and schedules
tasks in local worker
• Worker ensures all communication and synchronizations.
• Between Graphs Replication can be of two types:
• Synchronous
• Asynchronous
ASYNCHRONOUS VS SYNCHRONOUS REPLICATION
model
input
Device 1
model
input
Device 2
model
input
Device 3
Add
Update
P
PS Server
model
input
Device 1
model
input
Device 2
model
input
Device 3
Update
P
PS Server
P
Update
Update
P
P
P
SYNCHRONOUS DATA PARALLELISM
ASYNCHRONOUS DATA PARALLELISM
OPTIMIZATIONS
• Common Subexpression Elimination
• Schedules tasks in such a way that time window for which
intermediate results are stored could be reduced.
• Using ASAP/ALAP calculation critical path of graph is determined to
estimate when to start the Receive nodes. This reduced the chances
of sudden spike of I/O
• Non blocking Kernels
• Lossy compression of higher precision internal representations when
sending data between device
• XLA (Accelerated Linear Algebra) is a domain-specific compiler for
linear algebra that optimizes TensorFlow computations.
• Tensors also enable other optimizations for memory management
and communication, such as RDMA and direct GPU-to-GPU transfer
FAULT TOLERANCE
• Check pointing ensures that latest state is always available
• If a non supervisor worker gets killed
• Considering workers are state less, the cluster manager when bring it up
back, it simply contacts PS task to get the updated parameter and
resumes
• If a PS task fails
• In this case chief/supervisor is responsible for noting the failure
• Supervisor/Chief interrupts training on all workers and restores all PS
tasks from the last check-point.
• If Chief itself fails
• Interrupt training and when it comes back up it restore from a
checkpoint.
• Monitored Training Session allows automating the recovery
• Another approach could be to use Zookeeper for chief election and pass
SERVING THE MODEL
• TensorFlow recommended way to serve model in production is
TF Serving
• Advantages
• Supports both online and batching mode
• Supports both hosted as well as libs approach
• Supports multiple model in a single process
• Supports Docker & Kuburnetes
BENCHMARKS
Instance type: NVIDIA® DGX-1™
GPU: 8x NVIDIA® Tesla® P100
OS: Ubuntu 16.04 LTS with tests run via Docker
CUDA / cuDNN: 8.0 / 5.1
TensorFlow GitHub hash: b1e174e
Benchmark GitHub hash: 9165a70
Build Command: bazel build -c opt --copt=-march="haswell" --config=cuda
//tensorflow/tools/pip_package:build_pip_package
REFERENCES & FURTHER READING
• Paper on Large-Scale Machine Learning on Heterogeneous
Distributed Systems
• TensorFlow Documentations
• TensorFlow Tutorials
• Hands-on Machine Learning with Sckit Learn and TensorFlow
by Aurélien Géron
THANK YOU!

More Related Content

What's hot

HashiCorpのNomadを使ったコンテナのスケジューリング手法
HashiCorpのNomadを使ったコンテナのスケジューリング手法HashiCorpのNomadを使ったコンテナのスケジューリング手法
HashiCorpのNomadを使ったコンテナのスケジューリング手法Masahito Zembutsu
 
VPP事始め
VPP事始めVPP事始め
VPP事始めnpsg
 
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...Deep Learning JP
 
[DL輪読会]Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Ima...
[DL輪読会]Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Ima...[DL輪読会]Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Ima...
[DL輪読会]Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Ima...Deep Learning JP
 
[D20] 高速Software Switch/Router 開発から得られた高性能ソフトウェアルータ・スイッチ活用の知見 (July Tech Fest...
[D20] 高速Software Switch/Router 開発から得られた高性能ソフトウェアルータ・スイッチ活用の知見 (July Tech Fest...[D20] 高速Software Switch/Router 開発から得られた高性能ソフトウェアルータ・スイッチ活用の知見 (July Tech Fest...
[D20] 高速Software Switch/Router 開発から得られた高性能ソフトウェアルータ・スイッチ活用の知見 (July Tech Fest...Tomoya Hibi
 
【OpenStack共同検証ラボ】OpenStack監視・ログ分析基盤の作り方 - OpenStack最新情報セミナー(2016年7月)
【OpenStack共同検証ラボ】OpenStack監視・ログ分析基盤の作り方 - OpenStack最新情報セミナー(2016年7月)【OpenStack共同検証ラボ】OpenStack監視・ログ分析基盤の作り方 - OpenStack最新情報セミナー(2016年7月)
【OpenStack共同検証ラボ】OpenStack監視・ログ分析基盤の作り方 - OpenStack最新情報セミナー(2016年7月)VirtualTech Japan Inc.
 
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説Takateru Yamagishi
 
速習! PostgreSQL専用HAソフトウェア: Patroni(PostgreSQL Conference Japan 2023 発表資料)
速習! PostgreSQL専用HAソフトウェア: Patroni(PostgreSQL Conference Japan 2023 発表資料)速習! PostgreSQL専用HAソフトウェア: Patroni(PostgreSQL Conference Japan 2023 発表資料)
速習! PostgreSQL専用HAソフトウェア: Patroni(PostgreSQL Conference Japan 2023 発表資料)NTT DATA Technology & Innovation
 
Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことNVIDIA Japan
 
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月 知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月 VirtualTech Japan Inc.
 
動的計画法の基礎と応用 ~色々使える大局的最適化法
動的計画法の基礎と応用 ~色々使える大局的最適化法動的計画法の基礎と応用 ~色々使える大局的最適化法
動的計画法の基礎と応用 ~色々使える大局的最適化法Seiichi Uchida
 
1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門NVIDIA Japan
 
PFP:材料探索のための汎用Neural Network Potential - 2021/10/4 QCMSR + DLAP共催
PFP:材料探索のための汎用Neural Network Potential - 2021/10/4 QCMSR + DLAP共催PFP:材料探索のための汎用Neural Network Potential - 2021/10/4 QCMSR + DLAP共催
PFP:材料探索のための汎用Neural Network Potential - 2021/10/4 QCMSR + DLAP共催Preferred Networks
 
Use ScyllaDB Alternator to Use Amazon DynamoDB API, Everywhere, Better, More ...
Use ScyllaDB Alternator to Use Amazon DynamoDB API, Everywhere, Better, More ...Use ScyllaDB Alternator to Use Amazon DynamoDB API, Everywhere, Better, More ...
Use ScyllaDB Alternator to Use Amazon DynamoDB API, Everywhere, Better, More ...ScyllaDB
 
分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~Hideki Tsunashima
 
Planet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamPlanet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamSATOSHI TAGOMORI
 
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~NTT Communications Technology Development
 

What's hot (20)

TVM の紹介
TVM の紹介TVM の紹介
TVM の紹介
 
HashiCorpのNomadを使ったコンテナのスケジューリング手法
HashiCorpのNomadを使ったコンテナのスケジューリング手法HashiCorpのNomadを使ったコンテナのスケジューリング手法
HashiCorpのNomadを使ったコンテナのスケジューリング手法
 
VPP事始め
VPP事始めVPP事始め
VPP事始め
 
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...
 
[DL輪読会]Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Ima...
[DL輪読会]Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Ima...[DL輪読会]Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Ima...
[DL輪読会]Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Ima...
 
[D20] 高速Software Switch/Router 開発から得られた高性能ソフトウェアルータ・スイッチ活用の知見 (July Tech Fest...
[D20] 高速Software Switch/Router 開発から得られた高性能ソフトウェアルータ・スイッチ活用の知見 (July Tech Fest...[D20] 高速Software Switch/Router 開発から得られた高性能ソフトウェアルータ・スイッチ活用の知見 (July Tech Fest...
[D20] 高速Software Switch/Router 開発から得られた高性能ソフトウェアルータ・スイッチ活用の知見 (July Tech Fest...
 
【OpenStack共同検証ラボ】OpenStack監視・ログ分析基盤の作り方 - OpenStack最新情報セミナー(2016年7月)
【OpenStack共同検証ラボ】OpenStack監視・ログ分析基盤の作り方 - OpenStack最新情報セミナー(2016年7月)【OpenStack共同検証ラボ】OpenStack監視・ログ分析基盤の作り方 - OpenStack最新情報セミナー(2016年7月)
【OpenStack共同検証ラボ】OpenStack監視・ログ分析基盤の作り方 - OpenStack最新情報セミナー(2016年7月)
 
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
 
速習! PostgreSQL専用HAソフトウェア: Patroni(PostgreSQL Conference Japan 2023 発表資料)
速習! PostgreSQL専用HAソフトウェア: Patroni(PostgreSQL Conference Japan 2023 発表資料)速習! PostgreSQL専用HAソフトウェア: Patroni(PostgreSQL Conference Japan 2023 発表資料)
速習! PostgreSQL専用HAソフトウェア: Patroni(PostgreSQL Conference Japan 2023 発表資料)
 
How to run P4 BMv2
How to run P4 BMv2How to run P4 BMv2
How to run P4 BMv2
 
Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないこと
 
PostgreSQLコミュニティに飛び込もう
PostgreSQLコミュニティに飛び込もうPostgreSQLコミュニティに飛び込もう
PostgreSQLコミュニティに飛び込もう
 
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月 知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月
 
動的計画法の基礎と応用 ~色々使える大局的最適化法
動的計画法の基礎と応用 ~色々使える大局的最適化法動的計画法の基礎と応用 ~色々使える大局的最適化法
動的計画法の基礎と応用 ~色々使える大局的最適化法
 
1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門
 
PFP:材料探索のための汎用Neural Network Potential - 2021/10/4 QCMSR + DLAP共催
PFP:材料探索のための汎用Neural Network Potential - 2021/10/4 QCMSR + DLAP共催PFP:材料探索のための汎用Neural Network Potential - 2021/10/4 QCMSR + DLAP共催
PFP:材料探索のための汎用Neural Network Potential - 2021/10/4 QCMSR + DLAP共催
 
Use ScyllaDB Alternator to Use Amazon DynamoDB API, Everywhere, Better, More ...
Use ScyllaDB Alternator to Use Amazon DynamoDB API, Everywhere, Better, More ...Use ScyllaDB Alternator to Use Amazon DynamoDB API, Everywhere, Better, More ...
Use ScyllaDB Alternator to Use Amazon DynamoDB API, Everywhere, Better, More ...
 
分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~
 
Planet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamPlanet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: Bigdam
 
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
 

Similar to An Introduction to TensorFlow architecture

Performance measures
Performance measuresPerformance measures
Performance measuresDivya Tiwari
 
Parallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxParallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxkrnaween
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresCloudLightning
 
System models for distributed and cloud computing
System models for distributed and cloud computingSystem models for distributed and cloud computing
System models for distributed and cloud computingpurplesea
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesMurtadha Alsabbagh
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit
 
FIne Grain Multithreading
FIne Grain MultithreadingFIne Grain Multithreading
FIne Grain MultithreadingDharmesh Tank
 
Neural Networks with Google TensorFlow
Neural Networks with Google TensorFlowNeural Networks with Google TensorFlow
Neural Networks with Google TensorFlowDarshan Patel
 
Cloud computing system models for distributed and cloud computing
Cloud computing system models for distributed and cloud computingCloud computing system models for distributed and cloud computing
Cloud computing system models for distributed and cloud computinghrmalik20
 
Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...hrmalik20
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performances.rohit
 
Performance comparison of row per slave and rows set
Performance comparison of row per slave and rows setPerformance comparison of row per slave and rows set
Performance comparison of row per slave and rows seteSAT Publishing House
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operationSubhas Kumar Ghosh
 

Similar to An Introduction to TensorFlow architecture (20)

Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Performance measures
Performance measuresPerformance measures
Performance measures
 
Parallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxParallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptx
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
 
System models for distributed and cloud computing
System models for distributed and cloud computingSystem models for distributed and cloud computing
System models for distributed and cloud computing
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentation
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and Disadvantages
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
1.prallelism
1.prallelism1.prallelism
1.prallelism
 
1.prallelism
1.prallelism1.prallelism
1.prallelism
 
Unit-3.ppt
Unit-3.pptUnit-3.ppt
Unit-3.ppt
 
Tf paper ppt
Tf paper pptTf paper ppt
Tf paper ppt
 
FIne Grain Multithreading
FIne Grain MultithreadingFIne Grain Multithreading
FIne Grain Multithreading
 
Pregel
PregelPregel
Pregel
 
Neural Networks with Google TensorFlow
Neural Networks with Google TensorFlowNeural Networks with Google TensorFlow
Neural Networks with Google TensorFlow
 
Cloud computing system models for distributed and cloud computing
Cloud computing system models for distributed and cloud computingCloud computing system models for distributed and cloud computing
Cloud computing system models for distributed and cloud computing
 
Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performance
 
Performance comparison of row per slave and rows set
Performance comparison of row per slave and rows setPerformance comparison of row per slave and rows set
Performance comparison of row per slave and rows set
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 

Recently uploaded

Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 

Recently uploaded (20)

Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 

An Introduction to TensorFlow architecture

  • 2. BEFORE WE START… • PLEASE UNDERSTAND TensorFlow DIFFERS FROM MOST DATA ENGINES OUT THERE FOR OBVIOUS REASONS. • TensorFlow differs from batch dataflow systems in two respects: • The model supports multiple concurrent executions on overlapping subgraphs of the overall graph. • Individual vertices may have mutable state that can be shared between different executions of the graph. • Some References (picked from OSDI 16 Conference): • The principal limitation of a batch dataflow system is that it requires the input data to be immutable, and all of the sub-computations to be deterministic, so that the system can re-execute sub-computations when machines in the cluster fail. • For example, the SparkNet system for training deep neural networks on Spark takes 20 seconds to broadcast weights and collect updates from five workers [55]. As a result, in these systems, each model update step must process larger batches, slowing convergence [8]. We show in Subsection 6.3 that TensorFlow can train larger models on larger clusters with step times as short as 2 seconds
  • 3. WHAT IS TENSORFLOW? Here is the formal definition picked from https://www.tensorflow.org/: TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research.
  • 4. WHAT IS A DATA FLOW GRAPH ? Consider a typical linear equation: y = W * x + b where W is Weight, x is an example and b is bias. This linear equation can be represented as a acyclic graph, as below: Biases Weight Examples MatMul Add Relu Gradients Updated Weights and Biases
  • 5. GENERALIZING THE DATAFLOW GRAPH Biases …. Learning Rate Add -=Mul Update Biases … gradient computation Variables & Constant s Operations Updating of Variables
  • 6. LAYERED VIEW Network LayerDevice Layer Kernel Execution Layer Distributed Master Data Flow Controller API LAYER CLIENT LAYER LIBRARIES (Training/Inference Libs)
  • 7. TENSORFLOW’S DEVICE INTERACTION VIEW TensorFlow uses CUDA and cuDNN to control GPUs and boost CPU GPU #0 GPU #1 cuDNN CUDA TENSORFLOW
  • 8. EXECUTION PHASES • By deferring the execution until the entire program is available, TensorFlow optimizes the execution phase by using global information about the computation • Example: • TensorFlow achieves high GPU utilization by using the graph’s dependency structure to issue a sequence of kernels to the GPU without waiting for intermediate results • TensorFlow uses deferred execution via the dataflow graph to offload larger chunks of work to accelerators. CONSTRUCTION PHASE EXECUTION PHASE CLIENT WORKERS
  • 9. WORKER’S DEVICE INTERACTIONS • The worker service in each task: • handles requests from the master, • schedules the execution of the kernels for the operations that comprise a local subgraph • mediates direct communication between tasks. • It optimized for running large graphs with low overhead • It dispatches kernels to local devices and runs kernels in parallel when possible, for example by using multiple CPU cores or GPU streams. CLIENT MASTER WORKER GPU #1 GPU #2 CPU #0 Session
  • 10. WORKER’S SCHEDULING & PLACEMENT ALGORITHM • Uses COST Model to determine placement • contains estimates of the sizes of the input and output tensors for each graph node • Uses estimates of the computation time required for each node • statically estimated based on heuristics associated with different operation types • also uses metrics collected for placement decisions for earlier executions of the graph • placement algorithm first runs a simulated execution of the graph • For each node, feasible devices are determined • When multiple devices are eligible for a node execution • algorithm uses a greedy heuristic; examines the effects on the completion time using COST MODEL • usually, device where the node’s operation would finish the soonest is generally selected • Applies constraints like colocation requirements
  • 11. SINGLE MACHINE VS DISTRIBUTED SYSTEM STRUCTURE Client is one which creates computation graph during the construction phase It creates a session to master and send the constructed graph for execution Finally when client evaluates a node or nodes in graph, master starts the execution by distributing sub graphs to workers. Client Master GPU0 GPU1 GPUn session run execute sub-graph Single Process Client Process Master GPU0 session run execute sub-graph Distributed Version GPU1 GPUn CPU0 worker process 1 GPU0 GPU1 GPUn CPU0 worker process 2 GPU0 GPU1 GPUn CPU0 worker process 3 worker
  • 12. KERNEL EXECUTION • TF manages two types of thread pools on each device to parallelize operations; inter-op & intra-op thread pools • Inter-op are normal thread pool used when two or more operations get scheduled on same device. • In few cases operations have multi-threaded kernel, they use intra-op thread pool CPU#0 CPU #1 A B F D E C Inter- op pool Intra- op pool
  • 13. SESSION ON A SINGLE PROCESS tf.Session CPU: 0 GPU: 0 with tf.Session() as sess: sess.run(init_op) for _ in range(STEPS): sess.run(train)
  • 14. CROSS-DEVICE COMMUNICATION s += w * x + b CPU += S w b Add MatMul X GPU #0 Worker
  • 15. CROSS-DEVICE COMMUNICATION CPU += s w b Add MatMul X GPU #0 SEND RECV RECV SEND SEND RECV Worker
  • 16. CREATING A CLUSTER tf.Session CPU: 0 GPU: 0 cluster = tf.train.ClusterSpecs ({"ps": ps_hosts, "worker": worker_hosts}) server = tf.train.Server(cluster, job_name = “worker”, task_index=0) tf.train.Server CPU: 0 GPU: 0 tf.train.Server CPU: 0 GPU: 0 tf.train.Server
  • 17. DISTRIBUTED COMMUNICATION (DATA PARALLELISM & REPLICATION) • master decides a sub graph for a worker, in this case model parameters are given to PS worker * worker is responsible for deciding and placing nodes of the sub-graph on devices • nodes are executed in multiple GPUs/CPU Cores simultaneously subject to dependency resolution Device 1 (PS) += s w b CPU (PS)GPU #0 MatMul x Add Worker #0 Worker #1 GPU #0 MatMul x Add
  • 18. DISTRIBUTED COMMUNICATION (DATA PARALLELISM) • Transfers between local CPU and GPU devices use the cudaMemcpyAsync() API to overlap computation and data transfer. • Transfers between two local GPUs use peer-to-peer DMA, to avoid an expensive copy via the host CPU. • Transfers between tasks uses RDMA over Converged Ethernet else uses gRPC over TCP Device 1 (PS) += s w b CPU (PS)GPU #0 MatMul x Add GPU #0 Worker #0 Worker #1 SEND RECV SEND SEND RECV RDMA is_chief=tru e MatMul x Add RECV SEND
  • 20. DISTRIBUTED COMMUNICATION (MODEL PARALLELISM) • In model parallelism, the graph’s operations are distributed across cluster Device 1 (PS) Device 2 (worker) += s w b CPU GPU #0 MatMul x Add GPU #0 Worker #0 Worker #1
  • 21. DISTRIBUTED COMMUNICATION (MODEL PARALLELISM) • Transfers between local CPU and GPU devices use the cudaMemcpyAsync() API to overlap computation and data transfer. • Transfers between two local GPUs use peer-to-peer DMA, to avoid an expensive copy via the host CPU. • Transfers between tasks uses RDMA over Converged Ethernet else uses gRPC over TCP Device 1 (PS) += s w b CPU GPU #0 * x + GPU #0 Worker #0 Worker #1 SEND RECV SEND RECV SEND Dest: worker#1, GPU #0 Dest: worker#0, GPU #0 Dest: worker#1, GPU #0 SEND Dest: worker#0, CPU #0 RECV RDMA is_chief = True
  • 22. CHIEF WORKER • Chief is a task which is assigned some additional responsibilities in the cluster. • Its responsibilities: • Check pointing: • Saves graph state in a configured store like HDFS etc. • Runs a configurable frequency • Maintaining Summary • Runs all summary operations • Saving Models • Step Counters • Keeps an eye on total steps taken • Recovery • restores the graph from the most recent checkpoint and resumes training where it stopped • Initializing all the variables in graph • Can be monitored through TensorBoard.
  • 23. PARAMETER TASKS VS WORKER TASKS • In TensorFlow workload in distributed in form of PS and workers tasks. • PS tasks holds: • Variables • Update operations • Worker tasks: holds • Pre-processing • Loss calculation • Back Propagations • Multiple workers and PS tasks can run simultaneously but TF ensures that PS is sharded, ensures that same variable has one physical copy. There are various algorithm which support PS task distribution considering load and network . • It also allows partitioning large variables (~10x GBs) into multiple PS tasks
  • 24. TYPES OF TRAINING REPLICATION • In Graph Replication • Here single client connects to a master and requests distribution of replicated graph along with data within all available workers. • Works well for a small work load but beyond that does not scale well. • Between Graph Replication (Recommended Approach) • In this approach multiple clients take part in replication • Each machine has a client which talks to the local master and gives cluster information, graphs and data to be executed. • Master ensures that PS tasks are shared based on cluster and schedules tasks in local worker • Worker ensures all communication and synchronizations. • Between Graphs Replication can be of two types: • Synchronous • Asynchronous
  • 25. ASYNCHRONOUS VS SYNCHRONOUS REPLICATION model input Device 1 model input Device 2 model input Device 3 Add Update P PS Server model input Device 1 model input Device 2 model input Device 3 Update P PS Server P Update Update P P P SYNCHRONOUS DATA PARALLELISM ASYNCHRONOUS DATA PARALLELISM
  • 26. OPTIMIZATIONS • Common Subexpression Elimination • Schedules tasks in such a way that time window for which intermediate results are stored could be reduced. • Using ASAP/ALAP calculation critical path of graph is determined to estimate when to start the Receive nodes. This reduced the chances of sudden spike of I/O • Non blocking Kernels • Lossy compression of higher precision internal representations when sending data between device • XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that optimizes TensorFlow computations. • Tensors also enable other optimizations for memory management and communication, such as RDMA and direct GPU-to-GPU transfer
  • 27. FAULT TOLERANCE • Check pointing ensures that latest state is always available • If a non supervisor worker gets killed • Considering workers are state less, the cluster manager when bring it up back, it simply contacts PS task to get the updated parameter and resumes • If a PS task fails • In this case chief/supervisor is responsible for noting the failure • Supervisor/Chief interrupts training on all workers and restores all PS tasks from the last check-point. • If Chief itself fails • Interrupt training and when it comes back up it restore from a checkpoint. • Monitored Training Session allows automating the recovery • Another approach could be to use Zookeeper for chief election and pass
  • 28. SERVING THE MODEL • TensorFlow recommended way to serve model in production is TF Serving • Advantages • Supports both online and batching mode • Supports both hosted as well as libs approach • Supports multiple model in a single process • Supports Docker & Kuburnetes
  • 29. BENCHMARKS Instance type: NVIDIA® DGX-1™ GPU: 8x NVIDIA® Tesla® P100 OS: Ubuntu 16.04 LTS with tests run via Docker CUDA / cuDNN: 8.0 / 5.1 TensorFlow GitHub hash: b1e174e Benchmark GitHub hash: 9165a70 Build Command: bazel build -c opt --copt=-march="haswell" --config=cuda //tensorflow/tools/pip_package:build_pip_package
  • 30. REFERENCES & FURTHER READING • Paper on Large-Scale Machine Learning on Heterogeneous Distributed Systems • TensorFlow Documentations • TensorFlow Tutorials • Hands-on Machine Learning with Sckit Learn and TensorFlow by Aurélien Géron

Editor's Notes

  1. Client is one which creates computation graph during the construction phase It creates a session to master and send the constructed graph for execution Finally when client evaluates a node or nodes in graph, master starts the execution by distributing sub graphs to workers.