http://pipeline.ai
Title
PipelineAI Distributed Spark ML + Tensorflow AI + GPU Workshop
*A GPU-based cloud instance will be provided to each attendee as part of this event
Highlights
We will each build an end-to-end, continuous Tensorflow AI model training and deployment pipeline on our own GPU-based cloud instance.
At the end, we will combine our cloud instances to create the LARGEST Distributed Tensorflow AI Training and Serving Cluster in the WORLD!
Agenda
Spark ML
Tensorflow AI
Storing and Serving Models with HDFS
Trade-offs of CPU vs. *GPU, Scale Up vs. Scale Out
CUDA + cuDNN GPU Development Overview
Tensorflow Model Checkpointing, Saving, Exporting, and Importing
Distributed Tensorflow AI Model Training (Distributed Tensorflow)
Centralized Logging and Visualizing of Distributed Tensorflow Training (Tensorboard)
Distributed Tensorflow AI Model Serving/Predicting (Tensorflow Serving)
Centralized Logging and Metrics Collection (Prometheus, Grafana)
Continuous Tensorflow AI Model Deployment (Tensorflow, Airflow)
Hybrid Cross-Cloud and On-Premise Deployments (Kubernetes)
High-Performance and Fault-Tolerant Microsservices using Request Batching and Circuit Breakers (NetflixOSS)
Github Repo
https://github.com/fluxcapacitor/pipeline
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Â
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ UAI Conference
1. HIGH PERFORMANCE TENSORFLOW IN
PRODUCTION WITH GPUS!
CHRIS FREGLY,
FOUNDER @ PIPELINE.AI
ML TRAIN, SYDNEY 2017
2. INTRODUCTIONS: ME
§ Chris Fregly, Research Engineer @
§ Formerly Netflix and Databricks
§ Advanced Spark and TensorFlow Meetup
Please Join Our 20,000+ Members Globally!
* San Francisco
* Chicago
* Washington DC
* London
Please
Join!!
3. INTRODUCTIONS: YOU
§ Software Engineer or Data Scientist interested in optimizing
and deploying TensorFlow models to production
§ Assume you have a working knowledge of TensorFlow
4. 100% OPEN SOURCE CODE
§ https://github.com/fluxcapacitor/pipeline/
§ Please Star this Repo! J
§ Slides, code, notebooks, Docker images available here:
https://github.com/fluxcapacitor/pipeline/
gpu.ml
5. HANDS-ON EXERCISES
§ Combo of Jupyter Notebooks and Command Line
§ Command Line through Jupyter Terminal
§ Some Exercises Based on Experimental Features
Warning: You Will See Errors. You will be OK!!
6. CONTENT NOTES
§ 50% Training Optimizations (GPUs, XLA, JIT)
§ 50% Predicting Optimizations (XLA, AOT, TF Serving)
§ Why Heavy Focus on Predicting?
§ Training: boring batch, O(num_data_scientists)
§ Inference: exciting realtime, O(num_users_of_app)
§ We Use Simple Models to Highlight Optimizations
Warning: This is not intro material. You will be OK!
7. YOU WILL LEARNâŚ
§ Part 1: TensorFlow Model Training
§ TensorFlow and GPUs
§ Inspect and Debug Models
§ Distributed Training Across a Cluster
§ Optimize Training with Queues, Dataset API, and JIT XLA Compiler
§ Part 2: TensorFlow Model Deploying and Serving
§ Optimize Predicting with AOT XLA and Graph Transform Tool (GTT)
§ Deploy Model and Predict
§ Key Components of TensorFlow Serving
§ Optimize TensorFlow Serving Runtime
8. AGENDA
§ GPUs and TensorFlow
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model Training with XLA JIT Compiler
§ Optimize Model Predicting with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
10. SETUP ENVIRONMENT
§ Step 1: Browse to the following:
http://allocator.community.pipeline.ai/allocate
§ Step 2: Browse to the following:
http://<ip-address>
§ Step 3: Browse around.
I will provide a username/password in a bit!
Need Help?
Use the Chat!
12. LETâS EXPLORE OUR ENVIRONMENT
§ Navigate to the following notebook:
01_Explore_Environment
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
15. SETTING UP TENSORFLOW WITH GPUS
§ Very Painful!
§ Especially inside Docker
§ Use nvidia-docker
§ Especially on Kubernetes!
§ Use Kubernetes 1.7+
§ http://pipeline.ai for GitHub + DockerHub Links
16. GPU HALF-PRECISION SUPPORT
§ FP16, INT8 are âHalf Precisionâ
§ Supported by Pascal P100 (2016) and Volta V100 (2017)
§ Flexible FP32 GPU Cores Can Fit 2 FP16âs for 2x Throughput!
§ Half-Precision is OK for Approximate Deep Learning Use Cases
17. VOLTA V100 RECENTLY ANNOUNCED
§ 84 Streaming Multiprocessors (SMâs)
§ 5,376 GPU Cores
§ 672 Tensor Cores (ie. Google TPU)
§ Mixed FP16/FP32 Precision
§ More Shared Memory
§ New L0 Instruction Cache
§ Faster L1 Data Cache
§ V100 vs. P100 Performance
§ 12x TFLOPS @ Peak Training
§ 6x Inference Throughput
18. V100 AND CUDA 9
§ Independent Thread Scheduling - Finally!!
§ Similar to CPU fine-grained thread synchronization semantics
§ Allows GPU to yield execution of any thread
§ Still Optimized for SIMT (Same Instruction Multiple Thread)
§ SIMT units automatically scheduled together
§ Explicit Synchronization
P100 V100
19. GPU CUDA PROGRAMMING
§ Barbaric, But Fun Barbaric!
§ Must Know Underlying Hardware Very Well
§ Many Great Debuggers/Profilers
§ Hardware Changes are Painful!
§ Newer CUDA versions
automatically JIT-compile old
CUDA code to new NVPTX
§ Not optimal, of course
20. CUDA STREAMS
§ Asynchronous I/O Transfer
§ Overlap Compute and I/O
§ Keeps GPUs Saturated
§ Fundamental to Queue Framework in TensorFlow
21. LETâS SEE WHAT THIS THING CAN DO!
§ Navigate to the following notebook:
01a_Explore_GPU
01b_Explore_Numba
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
22. AGENDA
§ GPUs and TensorFlow
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model Training with XLA JIT Compiler
§ Optimize Model Predicting with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
23. TRAINING TERMINOLOGY
§ Tensors: N-Dimensional Arrays
§ ie. Scalar, Vector, Matrix
§ Operations: MatMul, Add, SummaryLog,âŚ
§ Graph: Graph of Operations (DAG)
§ Session: Contains Graph(s)
§ Feeds: Feed inputs into Operation
§ Fetches: Fetch output from Operation
§ Variables: What we learn through training
§ aka âweightsâ, âparametersâ
§ Devices: Hardware device on which we train
-TensorFlow-
Trains
Variables
-User-
Fetches
Outputs
-User-
Feeds
Inputs
-TensorFlow-
Performs
Operations
-TensorFlow-
Flows
Tensors
with tf.device(âworker:0/device/gpu:0,worker:1/device/gpu:0â)
24. TRAINING DEVICES
§ cpu:0
§ By default, all CPUs
§ Requires extra config to target a CPU
§ gpu:0..n
§ Each GPU has a unique id
§ TF usually prefers a single GPU
§ xla_cpu:0, xla_gpu:0..n
§ âJIT Compiler Deviceâ
§ Hints TensorFlow to attempt JIT Compile
with tf.device(â/cpu:0â):
with tf.device(â/gpu:0â):
with tf.device(â/gpu:1â):
25. TRAINING METRICS: TENSORBOARD
§ Summary Ops
§ Event Files
/root/tensorboard/linear/<version>/eventsâŚ
§ Tags
§ Organize data within Tensorboard UI
loss_summary_op = tf.summary.scalar('loss',
loss)
merge_all_summary_op = tf.summary.merge_all()
summary_writer = tf.summary.FileWriter(
'/root/tensorboard/linear/<version>',
graph=sess.graph)
26. TRAINING ON EXISTING INFRASTRUCTURE
§ Data Processing
§ HDFS/Hadoop
§ Spark
§ Containers
§ Docker
§ Schedulers
§ Kubernetes
§ Mesos
<dependency>
<groupId>org.tensorflow</groupId>
<artifactId>tensorflow-hadoop</artifactId>
</dependency>
https://github.com/tensorflow/ecosystem
27. TRAINING PIPELINES: QUEUE + DATASET
§ Donât Use feed_dict for Production Workloads!!
§ feed_dict Requires Python <-> C++ Serialization
§ Retrieval is Single-threaded, Synchronous, SLOW!
§ Canât Retrieve Until Current Batch is Complete
§ CPUs/GPUs Not Fully Utilized!
§ Use Queue or Dataset API
28. QUEUES
§ More than Just a Traditional Queue
§ Perform I/O, pre-processing, cropping, shuffling
§ Pulls from HDFS, S3, Google Storage, Kafka, ...
§ Combine many small files into large TFRecord files
§ Typically use CPUs to focus GPUs on compute
§ Uses CUDA Streams
29. DATA MOVEMENT WITH QUEUES
§ GPU Pulls Batch from Queue (CUDA Streams)
§ GPU pulls next batch while processing current batch
GPUs Stay Fully Utilized!
30. QUEUE CAPACITY PLANNING
§ batch_size
§ # examples / batch (ie. 64 jpg)
§ Limited by GPU RAM
§ num_processing_threads
§ CPU threads pull and pre-process batches of data
§ Limited by CPU Cores
§ queue_capacity
§ Limited by CPU RAM (ie. 5 * batch_size)
31. DETECT UNDERUTILIZED CPUS, GPUS
§ Instrument training code to generate âtimelinesâ
§ Analyze with Google Web
Tracing Framework (WTF)
§ Monitor CPU with `top`, GPU with `nvidia-smi`
http://google.github.io/tracing-framework/
from tensorflow.python.client import timeline
trace =
timeline.Timeline(step_stats=run_metadata.step_stats)
with open('timeline.json', 'w') as trace_file:
trace_file.write(
trace.generate_chrome_trace_format(show_memory=True))
32. LETâS FEED DATA WITH A QUEUE
§ Navigate to the following notebook:
02_Feed_Queue_HDFS
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
35. TENSORFLOW MODEL
§ MetaGraph
§ Combines GraphDef and Metadata
§ GraphDef
§ Architecture of your model (nodes, edges)
§ Metadata
§ Asset: Accompanying assets to your model
§ SignatureDef: Maps external : internal tensors
§ Variables
§ Stored separately during training (checkpoint)
§ Allows training to continue from any checkpoint
§ Variables are âfrozenâ into Constants when deployed for inference
GraphDef
x
W
mul add
b
MetaGraph
Metadata
Assets
SignatureDef
Tags
Version
Variables:
âWâ : 0.328
âbâ : -1.407
37. LETâS TRAIN A MODEL (CPU)
§ Navigate to the following notebook:
03_Train_Model_CPU
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
38. LETâS TRAIN A MODEL (GPU)
§ Navigate to the following notebook:
03a_Train_Model_GPU
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
39. TENSORFLOW DEBUGGER
§ Step through Operations
§ Inspect Inputs and Outputs
§ Wrap Session in Debug Session
sess = tf.Session(config=config)
sess =
tf_debug.LocalCLIDebugWrapperSession(sess)
40. LETâS DEBUG A MODEL
§ Navigate to the following notebook:
04_Debug_Model
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
41. BATCH NORMALIZATION
§ Each Mini-Batch May Have Wildly Different Distributions
§ Normalize per batch (and layer)
§ Speeds up Training!!
§ Weights are Learned Quicker
§ Final Model is More Accurate
§ Final mean and variance will be folded into Graph later
-- Always Use Batch Normalization! --
z = tf.matmul(a_prev, W)
a = tf.nn.relu(z)
a_mean, a_var = tf.nn.moments(a, [0])
scale = tf.Variable(tf.ones([depth/channels]))
beta = tf.Variable(tf.zeros ([depth/channels]))
bn = tf.nn.batch_normalizaton(a, a_mean, a_var,
beta, scale, 0.001)
42. AGENDA
§ GPU Environment
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model Training with XLA JIT Compiler
§ Optimize Model Predicting with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
43. MULTI-GPU TRAINING (SINGLE NODE)
§ Variables stored on CPU (cpu:0)
§ Model graph (aka âreplicaâ, âtowerâ)
is copied to each GPU(gpu:0, gpu:1, âŚ)
Multi-GPU Training Steps:
1. CPU transfers model to each GPU
2. CPU waits on all GPUs to finish batch
3. CPU copies all gradients back from all GPUs
4. CPU synchronizes + AVG all gradients from GPUs
5. CPU updates GPUs with new variables/weights
6. Repeat Step 1 until reaching stop condition (ie. max_epochs)
44. DISTRIBUTED, MULTI-NODE TRAINING
§ TensorFlow Automatically Inserts Send and Receive Ops into Graph
§ Parameter Server Synchronously Aggregates Updates to Variables
§ Nodes with Multiple GPUs will Pre-Aggregate Before Sending to PS
Worker0 Worker0
Worker1
Worker0 Worker1 Worker2
gpu0 gpu1
gpu2 gpu3
gpu0 gpu1
gpu2 gpu3
gpu0 gpu1
gpu2 gpu3
gpu0
gpu1
gpu0
gpu0
45. SYNCHRONOUS VS. ASYNCHRONOUS
§ Synchronous
§ Nodes compute gradients
§ Nodes update Parameter Server (PS)
§ Nodes sync on PS for latest gradients
§ Asynchronous
§ Some nodes delay in computing gradients
§ Nodes donât update PS
§ Nodes get stale gradients from PS
§ May not converge due to stale reads!
46. DATA PARALLEL VS MODEL PARALLEL
§ Data Parallel (âBetween-Graph Replicationâ)
§ Send exact same model to each device
§ Each device operates on its partition of data
§ ie. Spark sends same function to many workers
§ Each worker operates on their partition of data
§ Model Parallel (âIn-Graph Replicationâ)
§ Send different partition of model to each device
§ Each device operates on all data
Very Difficult!!
Required for Large Models.
(GPU RAM Limitation)
47. DISTRIBUTED TENSORFLOW CONCEPTS
§ Client
§ Program that builds a TF Graph, constructs a session, interacts with the cluster
§ Written in Python, C++
§ Cluster
§ Set of distributed nodes executing a graph
§ Nodes can play any role
§ Jobs (âRolesâ)
§ Parameter Server (âpsâ) stores and updates variables
§ Worker (âworkerâ) performs compute-intensive tasks (stateless)
§ Assigned 0..* tasks
§ Task (âServer Processâ)
âpsâ and âworkerâ are
conventional names
48. CHIEF WORKER
§ Worker Task 0 is Chosen by Default
§ Task 0 is guaranteed to exist
§ Implements Maintenance Tasks
§ Writes checkpoints
§ Initializes parameters at start of training
§ Writes log summaries
§ Parameter Server health checks
49. NODE AND PROCESS FAILURES
§ Checkpoint to Persistent Storage (HDFS, S3)
§ Use MonitoredTrainingSession and Hooks
§ Use a Good Cluster Orchestrator (ie. Kubernetes,Mesos)
§ Understand Failure Modes and Recovery States
Stateless, Not Bad: Training Continues Stateful, Bad: Training Must Stop Dios Mio! Long Night AheadâŚ
50. VALIDATING DISTRIBUTED MODEL
§ Separate Training and Validation Clusters
§ Validate using Saved Checkpoints from Parameter Servers
§ Avoids Resource Contention
Training
Cluster
Validation
Cluster
Parameter Server
Cluster
51. LETâS TRAIN WITH DISTRIBUTED CPU
§ Navigate to the following notebook:
05_Train_Model_Distributed_CPU
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
52. LETâS TRAIN WITH DISTRIBUTED GPU
§ Navigate to the following notebook:
05a_Train_Model_Distributed_GPU
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
53. NEW(âISH): EXPERIMENT + ESTIMATOR
§ Higher-Level APIs Simplify Distributed Training
§ Picks Up Configuration from Environment
§ Supports Custom Models (ie. Keras)
§ Used for Training, Validation, and Prediction
§ API is Changing, but Patterns Remain the Same
§ Works Well with Google Cloud ML (Surprised?!)
56. AGENDA
§ GPUs and TensorFlow
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model Training with XLA JIT Compiler
§ Optimize Model Predicting with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
57. XLA FRAMEWORK
§ Accelerated Linear Algebra (XLA)
§ Goals:
§ Reduce reliance on custom operators
§ Improve execution speed
§ Improve memory usage
§ Reduce mobile footprint
§ Improve portability
§ Helps TF Stay Flexible and Performant
58. XLA HIGH LEVEL OPTIMIZER (HLO)
§ Compiler Intermediate Representation (IR)
§ Independent of source and target language
§ Define Graphs using HLO Language
§ XLA Step 1 Emits Target-Independent HLO
§ XLA Step 2 Emits Target-Dependent LLVM
§ LLVM Emits Native Code Specific to Target
§ Supports x86-64, ARM64 (CPU), and NVPTX (GPU)
59. JIT COMPILER
§ Just-In-Time Compiler
§ Built on XLA Framework
§ Goals:
§ Reduce memory movement â especially useful on GPUs
§ Reduce overhead of multiple function calls
§ Similar to Spark Operator Fusing in Spark 2.0
§ Unroll Loops, Fuse Operators, Fold Constants, âŚ
§ Scope to session, device, or `with jit_scope():`
60. VISUALIZING JIT COMPILER IN ACTION
Before After
Google Web Tracing Framework:
http://google.github.io/tracing-framework/
from tensorflow.python.client import timeline
trace =
timeline.Timeline(step_stats=run_metadata.step_stats)
with open('timeline.json', 'w') as trace_file:
trace_file.write(
trace.generate_chrome_trace_format(show_memory=True))
62. LETâS TRAIN WITH XLA CPU
§ Navigate to the following notebook:
06_Train_Model_XLA_CPU
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
63. LETâS TRAIN WITH XLA GPU
§ Navigate to the following notebook:
06a_Train_Model_XLA_GPU
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
64. AGENDA
§ GPUs and TensorFlow
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model Training with XLA JIT Compiler
§ Optimize Model Predicting with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
65. AOT COMPILER
§ Standalone, Ahead-Of-Time (AOT) Compiler
§ Built on XLA framework
§ tfcompile
§ Creates executable with minimal TensorFlow Runtime needed
§ Includes only dependencies needed by subgraph computation
§ Creates functions with feeds (inputs) and fetches (outputs)
§ Packaged as cc_libary header and object files to link into your app
§ Commonly used for mobile device inference graph
§ Currently, only CPU x86-64 and ARM are supported - no GPU
66. GRAPH TRANSFORM TOOL (GTT)
§ Optimize Trained Models for Inference
§ Remove training-only Ops (checkpoint, drop out, logs)
§ Remove unreachable nodes between given feed -> fetch
§ Fuse adjacent operators to improve memory bandwidth
§ Fold final batch norm mean and variance into variables
§ Round weights/variables improves compression (ie. 70%)
§ Quantize weights and activations simplifies model
§ FP32 down to INT8
68. AFTER STRIPPING UNUSED NODES
§ Optimizations
§ strip_unused_nodes
§ Results
§ Graph much simpler
§ File size much smaller
69. AFTER REMOVING UNUSED NODES
§ Optimizations
§ strip_unused_nodes
§ remove_nodes
§ Results
§ Pesky nodes removed
§ File size a bit smaller
70. AFTER FOLDING CONSTANTS
§ Optimizations
§ strip_unused_nodes
§ remove_nodes
§ fold_constants
§ Results
§ Placeholders (feeds) -> Variables*
(*Why Variables and not Constants?)
71. AFTER FOLDING BATCH NORMS
§ Optimizations
§ strip_unused_nodes
§ remove_nodes
§ fold_constants
§ fold_batch_norms
§ Results
§ Graph remains the same
§ File size approximately the same
72. WEIGHT QUANTIZATION
§ FP16 and INT8 Are Smaller and Computationally Simpler
§ Weights/Variables are Constants
§ Easy to Linearly Quantize
73. AFTER QUANTIZING WEIGHTS
§ Optimizations
§ strip_unused_nodes
§ remove_nodes
§ fold_constants
§ fold_batch_norms
§ quantize_weights
§ Results
§ Graph is same, file size is smaller, compute is faster
74. LETâS OPTIMIZE FOR INFERENCE
§ Navigate to the following notebook:
07_Optimize_Model*
(*Why just CPU version? Why not both CPU and GPU?)
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
76. ACTIVATION QUANTIZATION
§ Activations Not Known Ahead of Time
§ Depends on input, not easy to quantize
§ Requires Additional Calibration Step
§ Use a ârepresentativeâ dataset
§ Per Neural Network LayerâŚ
§ Collect histogram of activation values
§ Generate many quantized distributions with different saturation thresholds
§ Choose threshold to minimizeâŚ
KL_divergence(ref_distribution, quant_distribution)
§ Not Much Time or Data is Required (Minutes on Commodity Hardware)
79. LETâS OPTIMIZE FOR INFERENCE
§ Navigate to the following notebook:
08_Optimize_Model_Activations
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
80. LINEARIZE GRAPH EXECUTION ORDER
§ https://github.com/yaroslavvb/stuff
Linearize to
minimize graph
memory usage
81. AGENDA
§ GPUs and TensorFlow
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model Training with XLA JIT Compiler
§ Optimize Model Predicting with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
82. MODEL SERVING TERMINOLOGY
§ Inference
§ Only Forward Propagation through Network
§ Predict, Classify, Regress, âŚ
§ Bundle
§ GraphDef, Variables, Metadata, âŚ
§ Assets
§ ie. Map of ClassificationID -> String
§ {9283: âpenguinâ, 9284: âbridgeâ}
§ Version
§ Every Model Has a Version Number (Integer)
§ Version Policy
§ ie. Serve Only Latest (Highest), Serve Both Latest and Previous, âŚ
83. TENSORFLOW SERVING FEATURES
§ Supports Auto-Scaling
§ Custom Loaders beyond File-based
§ Tune for Low-latency or High-throughput
§ Serve Diff Models/Versions in Same Process
§ Customize Models Types beyond HashMap and TensorFlow
§ Customize Version Policies for A/B and Bandit Tests
§ Support Request Draining for Graceful Model Updates
§ Enable Request Batching for Diff Use Cases and HW
§ Supports Optimized Transport with GRPC and Protocol Buffers
84. PREDICTION SERVICE
§ Predict (Original, Generic)
§ Input: List of Tensor
§ Output: List of Tensor
§ Classify
§ Input: List of tf.Example (key, value) pairs
§ Output: List of (class_label: String, score: float)
§ Regress
§ Input: List of tf.Example (key, value) pairs
§ Output: List of (label: String, score: float)
86. MULTI-HEADED INFERENCE
§ Multiple âHeadsâ of Model
§ Return class and scores to be fed into another model
§ Inputs Propagated Forward Only Once
§ Optimizes Bandwidth, CPU, Latency, Memory, Coolness
87. BUILD YOUR OWN MODEL SERVER (?!)
§ Adapt GRPC(Google) <-> HTTP (REST of the World)
§ Perform Batch Inference vs. Request/Response
§ Handle Requests Asynchronously
§ Support Mobile, Embedded Inference
§ Customize Request Batching
§ Add Circuit Breakers, Fallbacks
§ Control Latency Requirements
§ Reduce Number of Moving Parts
#include
âtensorflow_serving/model_servers/server_core.hâ
class MyTensorFlowModelServer {
ServerCore::Options options;
// set options (model name, path, etc)
std::unique_ptr<ServerCore> core;
TF_CHECK_OK(
ServerCore::Create(std::move(options), &core)
);
}
Compile and Link
libtensorflow.so
89. LETâS DEPLOY OPTIMIZED MODEL
§ Navigate to the following notebook:
09_Deploy_Optimized_Model
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
90. AGENDA
§ GPUs and TensorFlow
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model Training with XLA JIT Compiler
§ Optimize Model Predicting with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
91. REQUEST BATCH TUNING
§ max_batch_size
§ Enables throughput/latency tradeoff
§ Bounded by RAM
§ batch_timeout_micros
§ Defines batch time window, latency upper-bound
§ Bounded by RAM
§ num_batch_threads
§ Defines parallelism
§ Bounded by CPU cores
§ max_enqueued_batches
§ Defines queue upper bound, throttling
§ Bounded by RAM
Reaching either threshold
will trigger a batch
92. BATCH SCHEDULER STRATEGIES
§ BasicBatchScheduler
§ Best for homogeneous request types (ie. always classify or always regress)
§ Async callback upon max_batch_size or batch_timeout_micros
§ BatchTask encapsulates unit of work to be batched
§ SharedBatchScheduler
§ Best for heterogeneous request types, multi-step inference, ensembles, âŚ
§ Groups BatchTasks into separate queues to form homogenous batches
§ Processes batches fairly through interleaving
§ StreamingBatchScheduler
§ Mixed CPU/GPU/IO-bound workloads
§ Provides fine-grained control for complex, multi-phase inference logic
You Must Experiment to Find
the Best Strategy for You!!
Co-locate and Isolate Homogenous Workloads
93. LETâS DEPLOY OPTIMIZED MODEL
§ Navigate to the following notebook:
10_Optimize_Model_Server
§ https://github.com/fluxcapacitor/pipeline/gpu.ml/
notebooks/
94. AGENDA
§ GPUs and TensorFlow
§ Train and Debug TensorFlow Model
§ Train with Distributed TensorFlow Cluster
§ Optimize Model Training with XLA JIT Compiler
§ Optimize Model Predicting with XLA AOT and Graph Transforms
§ Deploy Model to TensorFlow Serving Runtime
§ Optimize TensorFlow Serving Runtime
§ Wrap-up and Q&A
95. YOU HAVE JUST LEARNEDâŚ
§ Part 1: TensorFlow Model Training
§ TensorFlow and GPUs
§ Inspect and Debug Models
§ Distributed Training Across a Cluster
§ Optimize Training with Queues, Dataset API, and JIT XLA Compiler
§ Part 2: TensorFlow Model Deploying and Serving
§ Optimize Predicting with AOT XLA and Graph Transform Tool (GTT)
§ Deploy Model and Predict
§ Key Components of TensorFlow Serving
§ Optimize TensorFlow Serving Runtime