SlideShare a Scribd company logo
1 of 29
TensorRT Survey
issue.hsu@gmail.com
2017
TensorRT
• NVIDIA TensorRT™ is a high-performance deep learning inference optimizer and runtime that
delivers low latency, high-throughput inference for deep learning applications.
• NVIDIA released TensorRT last year with the goal of accelerating deep learning inference for
production deployment.
2
Deploying a model with TensorRT
3
UFF stands for Universal Framework
Format, which is TensorRT’s internal
format used to represent the network
graph before running optimizations
perform optimizations for
specified parameters such as
batch size, precision, and
workspace memory for the
target deployment GPU
The output of the
TensorRT optimization is
a runtime inference engine
that can be serialized to
disk.
load and deserialize a
saved plan file to
create a TensorRT
engine object
A plan file includes not
only the weights, but also
the schedule for the kernels
to execute the network.
TensorRT supported layers
• TensorRT supported layers
• Convolution
• LSTM and GRU
• Activation: ReLU, tanh, sigmoid
• Pooling: max and average
• Scaling
• Element wise operations
• LRN
• Fully-connected
• SoftMax
• Deconvolution
• TensorRT provides a Custom Layer API to enable you
to define your own custom layers that aren’t natively
supported
• These custom layers are defined using C++ to make it easy
to leverage highly optimized CUDA libraries like cuDNN
and cuBLAS
4
TensorRT Optimizations
• TensorRT Optimizations
• Layer and tensor fusion and elimination of unused layers
• FP16 and INT8 reduced precision calibration
• Target-specific autotuning
• Efficient memory reuse
• Multi-Stream Execution
• TensorRT performs these optimizations automatically under the hood for you.
• All you need to specify is the UFF inference graph to optimize, the inference batch size, the
amount of workspace GPU memory (used for CUDA kernel scratch space), and the target
inference precision, as the following code shows.
•
5
Optimization 1: Layer & Tensor Fusion
• TensorRT parses the network computational graph and looks for opportunities to
perform graph optimizations.
• These graph optimizations do not change the underlying computation in the
graph: instead, they look to restructure the graph to perform the operations
much faster and more efficiently.
• TensorRT can also eliminate the concatenation layers in “concat” by preallocating output
buffers and writing into them in a strided fashion.
6
Optimization 2: FP16 and INT8 Precision
Calibration
• Most deep learning frameworks train neural networks in full 32-bit precision (FP32).
• Once the model is fully trained, inference computations can use half precision FP16 or even INT8 tensor operations, since
gradient backpropagation is not required for inference.
• Using lower precision results in smaller model size, lower memory utilization and latency, and higher throughput.
• TensorRT can deploy models in FP32, FP16 and INT8
• To quantize full-precision information into INT8 while minimizing accuracy loss, TensorRT must perform a
process called calibration to determine how best to represent the weights and activations as 8-bit integers.
• The calibration step requires you to provide TensorRT with a representative sample of the input training data.
• No additional fine tuning or retraining of the model is necessary, and you don’t need to have access to the entire training
dataset.
• Calibration is a completely automated and parameter-free method for converting FP32 to INT8.
7
Optimization 3: Kernel Auto-tuning
• During the optimization phase TensorRT also chooses from hundreds
of specialized kernels, many of them hand-tuned and optimized for a
range of parameters and target platforms.
• As an example, there are several different algorithms to do convolutions.
• TensorRT will pick the implementation from a library of kernels that delivers
the best performance for the target GPU, input data size, filter size, tensor
layout, batch size and other parameters.
• This ensures that the deployed model is performance tuned for the
specific deployment platform as well as for the specific neural
network being deployed.
8
Optimization 4: Dynamic Tensor Memory
• TensorRT reduces memory footprint and improves memory reuse by
allocating memory for each tensor only for the duration of its usage,
avoiding memory allocation overhead for fast and efficient execution.
9
Optimization 5: Multi-Stream Execution
• Scales to multiple input streams, by processing them in parallel using
the same model and weights
10
TensorRT Run-Time Inference
• You’re now ready to deploy your application with TensorRT
• You’ve so far imported a trained TensorFlow model into TensorRT, and performed a number of
optimizations to generate a runtime engine.
• And you’ve serialized this engine to disk as an engine plan file.
• You performed all these steps offline, and only once prior to deployment.
• The next step is to load serialized models into your runtime environment and
perform inference on new data.
11
• TensorRT Lite API is a highly abstracted
interface that handles standard tasks like
creating the logger, deserializing the engine
from a plan file to create a runtime, and
allocating GPU memory for the engine.
• During inference, it also manages data
transfer to and from GPU automatically, so
you can just create an engine and start
processing data.
More about INT8
12
Quantization
• It’s always a tradeoff between range and precision of the INT8
representation.
• Minimize information loss, since FP32 → INT8 is just re-encoding information
13
How to optimize threshold selection?
• “Relative Entropy” of two encodings
• INT8 model encodes the same information as the original FP32 model.
• We want to minimize loss of information.
• Loss of information is measured by Kullback-Leibler divergence (AKA relative
entropy or information divergence).
• P, Q - two discrete probability distributions.
• KL_divergence(P,Q):= SUM(P[i] * log(P[i] / Q[i] ), i)
• Intuition: KL divergence measures the amount of information lost when
approximating a given encoding.
14
Solution: Calibration
• Calibration Dataset
• Representative.
• Diverse.
• Ideally a subset of validation dataset.
• 1000s of samples
• Calibration
• Run FP32 inference on Calibration Dataset.
• For each Layer:
• collect histograms of activations.
• generate many quantized distributions with different saturation thresholds.
• pick threshold which minimizes KL_divergence(ref_distr, quant_distr).
• Entire process takes a few minutes on a typical desktop workstation.
15
INT8 workflow in TensorRT
• You will need:
• Model trained in FP32.
• Calibration dataset.
• TensorRT will:
• Run inference in FP32 on calibration dataset.
• Collect required statistics.
• Run calibration algorithm → optimal scaling factors.
• Quantize FP32 weights → INT8.
• Generate “CalibrationTable” and INT8 execution engine.
16
Entropy Calibration - pseudocode
Input: FP32 histogram H with 2048 bins: bin[ 0 ], …, bin[ 2047 ]
For i in range( 128 , 2048 ):
P = [ bin[ 0 ] , ..., bin[ i-1 ] ] // reference_distribution
outliers_count = sum( bin[ i ] , bin[ i+1 ] , … , bin[ 2047 ] )
P[ i-1 ] += outliers_count
P /= sum(P) // normalize distribution P
Q = quantize [ bin[ 0 ], …, bin[ i-1 ] ] into 128 levels // candidate_distribution
expand Q to ‘ i ’ bins
Q /= sum(Q) // normalize distribution Q
divergence[ i ] = KL_divergence( P, Q)
End For
Find index ‘m’ for which divergence[ m ] is minimal
threshold = ( m + 0.5 ) * ( width of a bin )
17
Candidate distribution Q
• KL_divergence(P, Q) requires that len(P) == len(Q)
• Candidate distribution Q is generated after merging ‘ i ’ bins from bin[0] to bin[i-1] into 128 bins
• Afterwards Q has to be ‘expanded’ again into ‘i’ bins
• Here is a simple example:
reference distribution P consisting of 8 bins, we want to quantize into 2 bins:
P = [ 1, 0, 2, 3, 5, 3, 1, 7]
we merge into 2 bins (8 / 2 = 4 consecutive bins are merged into one bin)
[1 + 0 + 2 + 3 , 5 + 3 + 1 + 7] = [6, 16]
then proportionally expand back to 8 bins, we preserve empty bins from the original distribution P:
Q = [ 6/3, 0, 6/3, 6/3, 16/4, 16/4, 16/4, 16/4] = [ 2, 0, 2, 2, 4, 4, 4, 4]
now we should normalize both distributions, after that we can compute KL_divergence
P /= sum(P) Q /= sum(Q)
result = KL_divergence(P, Q)
18
INT8 conv kernel - pseudocode
// I8 input tensors: I8_input, I8_weights, I8 output tensors: I8_output
// F32 bias (original bias from the F32 model)
// F32 scaling factors: input_scale, output_scale, weights_scale[K]
I32_gemm_out = I8_input * I8_weights // Compute INT8 GEMM (DP4A)
F32_gemm_out = (float)I32_gemm_out // Cast I32 GEMM output to F32 float
// At this point we have F32_gemm_out which is scaled by ( input_scale * weights_scale[K] ),
// but to store the final result in int8 we need to have scale equal to "output_scale", so we have to rescale:
// (this multiplication is done in F32, *_gemm_out arrays are in NCHW format)
for i in 0, ... K-1:
rescaled_F32_gemm_out[ :, i, :, :] = F32_gemm_out[ :, i, :, :] * [ output_scale / (input_scale * weights_scale[ i ] ) ]
// Add bias, to perform addition we have to rescale original F32 bias so that it's scaled with "output_scale"
rescaled_F32_gemm_out _with_bias = rescaled_F32_gemm_out + output_scale * bias
// Perform ReLU (in F32)
F32_result = ReLU(rescaled_F32_gemm_out _with_bias)
// Convert to INT8 and save to global
I8_output = Saturate( Round_to_nearest_integer( F32_result ) )
19
Results - Accuracy
and Performance
• All optimizations enabled.
• ILSVRC2012 validation dataset, batch =
25 images.
• Accuracy was measured on 500 batches
which were not used for the calibration.
20
Open challenges /
improvements
• Unsigned int8 for activations after
ReLU.
• Fine tuning of saturation thresholds.
• A better solution in tensorflow with
asymmetric quantization for above two?
• RNNs → open research problem.
• Dynamic Compute Graph
• Expose API for accepting custom, user
provided scale factors.
21
Reference
• TensorRT 3: Faster TensorFlow Inference and Volta Support
• 8-bit Inference with TensorRT
• Using TensorRT to Optimize Caffe Models in Python
• How to Quantize Neural Networks with TensorFlow
22
Summary of NN Compiler
Provider Framework Graph opt. Backend opt. INT8 support Runtime
inference
Format Open
source
Target
Nvidia Caffe /
Tensorflow
TensorRT TensorRT TensorRT
Precision
Calibration
TensorRT
runtime engine
NCHW No GPU/NVDLA
Google Tensorflow TF lite (toco) NNAPI ??? Proper quantized
training is
necessary before
conversion
TF lite
interpreter
NHWC Yes CPU
Amazon MxNet NNVM TVM mxnet.ndarray.co
ntrib.quantize
TVM runtime Depends
on Target
Yes CPU/GPU/…
23
• Generally, [NHWC] is the default for most frameworks (like Tensorflow) and [NCHW] is the optimal format to use when training on NVIDIA GPUs using cuDNN
• TF lite quantized conversion expect the models to be annotated with "fake quantization" nodes that record the dynamic range of the tensors. which means that
proper quantized training is necessary before conversion
END
24
Backup
25
Model convertion
• https://github.com/ysh329/deep-learning-model-convertor
• https://github.com/Microsoft/MMdnn
• https://github.com/hahnyuan/nn_tools
26
Graph and Target Optimizations
• NNVM
• https://github.com/dmlc/nnvm/tree/master/src/pass
• https://github.com/dmlc/nnvm/tree/master/src/compiler
• TF lite
• https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/toco
• https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/optimize_for_in
ference.py
• https://www.tensorflow.org/performance/
• TensorRT
• Layer & Tensor Fusion
• FP16 and INT8 Precision Calibration
• Kernel Auto-tuning
• Dynamic Tensor Memory
• Multi-Stream Execution
• All are offline optimized
27
Quantization reference code
• Solver::Solve()
• net_->Forward(&loss);
• Net::ForwardFromTo()
• Solver::Test()
• test_net->Forward(&iter_loss);
• Net::ForwardFromTo()
Net::ForwardFromTo() {
StartQuantization(); // Add and set QuantizationParams in each layer
for (int i = start; i <= end; ++i) {
float layer_loss = layers_[i]->Forward(bottom_vecs_[i], top_vecs_[i]);
loss += layer_loss;
}
FinishQuantization(); // UpdateQuantizationRangeInLayers
return loss;
}
28
NV TensorRT container
• https://ngc.nvidia.com/registry/nvidia-tensorrt
• https://github.com/NVIDIA/nvidia-docker
• https://devblogs.nvidia.com/parallelforall/nvidia-docker-gpu-server-
application-deployment-made-easy/
• https://devblogs.nvidia.com/parallelforall/tensorrt-container/
29

More Related Content

What's hot

TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT   TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT Mia Chang
 
PyTorch Introduction
PyTorch IntroductionPyTorch Introduction
PyTorch IntroductionYash Kawdiya
 
論文紹介 "DARTS: Differentiable Architecture Search"
論文紹介 "DARTS: Differentiable Architecture Search"論文紹介 "DARTS: Differentiable Architecture Search"
論文紹介 "DARTS: Differentiable Architecture Search"Yuta Koreeda
 
Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことNVIDIA Japan
 
Non-autoregressive text generation
Non-autoregressive text generationNon-autoregressive text generation
Non-autoregressive text generationnlab_utokyo
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorchJun Young Park
 
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs
【DL輪読会】Perceiver io  a general architecture for structured inputs &amp; outputs 【DL輪読会】Perceiver io  a general architecture for structured inputs &amp; outputs
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs Deep Learning JP
 
「解説資料」VideoMix: Rethinking Data Augmentation for Video Classification
「解説資料」VideoMix: Rethinking Data Augmentation for  Video Classification「解説資料」VideoMix: Rethinking Data Augmentation for  Video Classification
「解説資料」VideoMix: Rethinking Data Augmentation for Video ClassificationTakumi Ohkuma
 
近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision TransformerYusuke Uchida
 
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説Takateru Yamagishi
 
TensorFlow XLAは、 中で何をやっているのか?
TensorFlow XLAは、 中で何をやっているのか?TensorFlow XLAは、 中で何をやっているのか?
TensorFlow XLAは、 中で何をやっているのか?Mr. Vengineer
 
[DL輪読会]画像を使ったSim2Realの現況
[DL輪読会]画像を使ったSim2Realの現況[DL輪読会]画像を使ったSim2Realの現況
[DL輪読会]画像を使ったSim2Realの現況Deep Learning JP
 
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot LearningMasahiro Suzuki
 
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object DetectionDeep Learning JP
 
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演Preferred Networks
 

What's hot (20)

Efficient Det
Efficient DetEfficient Det
Efficient Det
 
TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT   TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT
 
BERT+XLNet+RoBERTa
BERT+XLNet+RoBERTaBERT+XLNet+RoBERTa
BERT+XLNet+RoBERTa
 
PyTorch Introduction
PyTorch IntroductionPyTorch Introduction
PyTorch Introduction
 
論文紹介 "DARTS: Differentiable Architecture Search"
論文紹介 "DARTS: Differentiable Architecture Search"論文紹介 "DARTS: Differentiable Architecture Search"
論文紹介 "DARTS: Differentiable Architecture Search"
 
Pytorch
PytorchPytorch
Pytorch
 
PyTorch under the hood
PyTorch under the hoodPyTorch under the hood
PyTorch under the hood
 
Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないこと
 
Non-autoregressive text generation
Non-autoregressive text generationNon-autoregressive text generation
Non-autoregressive text generation
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs
【DL輪読会】Perceiver io  a general architecture for structured inputs &amp; outputs 【DL輪読会】Perceiver io  a general architecture for structured inputs &amp; outputs
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs
 
「解説資料」VideoMix: Rethinking Data Augmentation for Video Classification
「解説資料」VideoMix: Rethinking Data Augmentation for  Video Classification「解説資料」VideoMix: Rethinking Data Augmentation for  Video Classification
「解説資料」VideoMix: Rethinking Data Augmentation for Video Classification
 
近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer
 
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
 
TensorFlow XLAは、 中で何をやっているのか?
TensorFlow XLAは、 中で何をやっているのか?TensorFlow XLAは、 中で何をやっているのか?
TensorFlow XLAは、 中で何をやっているのか?
 
[DL輪読会]画像を使ったSim2Realの現況
[DL輪読会]画像を使ったSim2Realの現況[DL輪読会]画像を使ったSim2Realの現況
[DL輪読会]画像を使ったSim2Realの現況
 
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning
 
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
 
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
 

Similar to TensorRT survey

Inference accelerators
Inference acceleratorsInference accelerators
Inference acceleratorsDarshanG13
 
Computer Architecture and Organization
Computer Architecture and OrganizationComputer Architecture and Organization
Computer Architecture and Organizationssuserdfc773
 
Machine learning Experiments report
Machine learning Experiments report Machine learning Experiments report
Machine learning Experiments report AlmkdadAli
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...Bharath Sudharsan
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowEmanuel Di Nardo
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptxruvex
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...Edge AI and Vision Alliance
 
Intel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataIntel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataDESMOND YUEN
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SHow I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SBrandon Liu
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA Taiwan
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Client side machine learning
Client side machine learningClient side machine learning
Client side machine learningKumar Abhinav
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
Parallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingParallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingJason Liu
 
24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdfFrangoCamila
 

Similar to TensorRT survey (20)

Inference accelerators
Inference acceleratorsInference accelerators
Inference accelerators
 
Computer Architecture and Organization
Computer Architecture and OrganizationComputer Architecture and Organization
Computer Architecture and Organization
 
Machine learning Experiments report
Machine learning Experiments report Machine learning Experiments report
Machine learning Experiments report
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
 
aspice
aspiceaspice
aspice
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
 
Intel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataIntel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big Data
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Chainer v4 and v5
Chainer v4 and v5Chainer v4 and v5
Chainer v4 and v5
 
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SHow I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Client side machine learning
Client side machine learningClient side machine learning
Client side machine learning
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Parallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingParallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based Modeling
 
24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf
 

More from Yi-Hsiu Hsu

Glow introduction
Glow introductionGlow introduction
Glow introductionYi-Hsiu Hsu
 
Yocto Project introduction
Yocto Project introductionYocto Project introduction
Yocto Project introductionYi-Hsiu Hsu
 
Understand more about C
Understand more about CUnderstand more about C
Understand more about CYi-Hsiu Hsu
 
Introduction to memory order consume
Introduction to memory order consumeIntroduction to memory order consume
Introduction to memory order consumeYi-Hsiu Hsu
 
RISC-V Introduction
RISC-V IntroductionRISC-V Introduction
RISC-V IntroductionYi-Hsiu Hsu
 
GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64Yi-Hsiu Hsu
 
Introduction to armv8 aarch64
Introduction to armv8 aarch64Introduction to armv8 aarch64
Introduction to armv8 aarch64Yi-Hsiu Hsu
 

More from Yi-Hsiu Hsu (8)

Glow introduction
Glow introductionGlow introduction
Glow introduction
 
Yocto Project introduction
Yocto Project introductionYocto Project introduction
Yocto Project introduction
 
Understand more about C
Understand more about CUnderstand more about C
Understand more about C
 
Introduction to memory order consume
Introduction to memory order consumeIntroduction to memory order consume
Introduction to memory order consume
 
RISC-V Introduction
RISC-V IntroductionRISC-V Introduction
RISC-V Introduction
 
Memory model
Memory modelMemory model
Memory model
 
GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64
 
Introduction to armv8 aarch64
Introduction to armv8 aarch64Introduction to armv8 aarch64
Introduction to armv8 aarch64
 

Recently uploaded

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

TensorRT survey

  • 2. TensorRT • NVIDIA TensorRT™ is a high-performance deep learning inference optimizer and runtime that delivers low latency, high-throughput inference for deep learning applications. • NVIDIA released TensorRT last year with the goal of accelerating deep learning inference for production deployment. 2
  • 3. Deploying a model with TensorRT 3 UFF stands for Universal Framework Format, which is TensorRT’s internal format used to represent the network graph before running optimizations perform optimizations for specified parameters such as batch size, precision, and workspace memory for the target deployment GPU The output of the TensorRT optimization is a runtime inference engine that can be serialized to disk. load and deserialize a saved plan file to create a TensorRT engine object A plan file includes not only the weights, but also the schedule for the kernels to execute the network.
  • 4. TensorRT supported layers • TensorRT supported layers • Convolution • LSTM and GRU • Activation: ReLU, tanh, sigmoid • Pooling: max and average • Scaling • Element wise operations • LRN • Fully-connected • SoftMax • Deconvolution • TensorRT provides a Custom Layer API to enable you to define your own custom layers that aren’t natively supported • These custom layers are defined using C++ to make it easy to leverage highly optimized CUDA libraries like cuDNN and cuBLAS 4
  • 5. TensorRT Optimizations • TensorRT Optimizations • Layer and tensor fusion and elimination of unused layers • FP16 and INT8 reduced precision calibration • Target-specific autotuning • Efficient memory reuse • Multi-Stream Execution • TensorRT performs these optimizations automatically under the hood for you. • All you need to specify is the UFF inference graph to optimize, the inference batch size, the amount of workspace GPU memory (used for CUDA kernel scratch space), and the target inference precision, as the following code shows. • 5
  • 6. Optimization 1: Layer & Tensor Fusion • TensorRT parses the network computational graph and looks for opportunities to perform graph optimizations. • These graph optimizations do not change the underlying computation in the graph: instead, they look to restructure the graph to perform the operations much faster and more efficiently. • TensorRT can also eliminate the concatenation layers in “concat” by preallocating output buffers and writing into them in a strided fashion. 6
  • 7. Optimization 2: FP16 and INT8 Precision Calibration • Most deep learning frameworks train neural networks in full 32-bit precision (FP32). • Once the model is fully trained, inference computations can use half precision FP16 or even INT8 tensor operations, since gradient backpropagation is not required for inference. • Using lower precision results in smaller model size, lower memory utilization and latency, and higher throughput. • TensorRT can deploy models in FP32, FP16 and INT8 • To quantize full-precision information into INT8 while minimizing accuracy loss, TensorRT must perform a process called calibration to determine how best to represent the weights and activations as 8-bit integers. • The calibration step requires you to provide TensorRT with a representative sample of the input training data. • No additional fine tuning or retraining of the model is necessary, and you don’t need to have access to the entire training dataset. • Calibration is a completely automated and parameter-free method for converting FP32 to INT8. 7
  • 8. Optimization 3: Kernel Auto-tuning • During the optimization phase TensorRT also chooses from hundreds of specialized kernels, many of them hand-tuned and optimized for a range of parameters and target platforms. • As an example, there are several different algorithms to do convolutions. • TensorRT will pick the implementation from a library of kernels that delivers the best performance for the target GPU, input data size, filter size, tensor layout, batch size and other parameters. • This ensures that the deployed model is performance tuned for the specific deployment platform as well as for the specific neural network being deployed. 8
  • 9. Optimization 4: Dynamic Tensor Memory • TensorRT reduces memory footprint and improves memory reuse by allocating memory for each tensor only for the duration of its usage, avoiding memory allocation overhead for fast and efficient execution. 9
  • 10. Optimization 5: Multi-Stream Execution • Scales to multiple input streams, by processing them in parallel using the same model and weights 10
  • 11. TensorRT Run-Time Inference • You’re now ready to deploy your application with TensorRT • You’ve so far imported a trained TensorFlow model into TensorRT, and performed a number of optimizations to generate a runtime engine. • And you’ve serialized this engine to disk as an engine plan file. • You performed all these steps offline, and only once prior to deployment. • The next step is to load serialized models into your runtime environment and perform inference on new data. 11 • TensorRT Lite API is a highly abstracted interface that handles standard tasks like creating the logger, deserializing the engine from a plan file to create a runtime, and allocating GPU memory for the engine. • During inference, it also manages data transfer to and from GPU automatically, so you can just create an engine and start processing data.
  • 13. Quantization • It’s always a tradeoff between range and precision of the INT8 representation. • Minimize information loss, since FP32 → INT8 is just re-encoding information 13
  • 14. How to optimize threshold selection? • “Relative Entropy” of two encodings • INT8 model encodes the same information as the original FP32 model. • We want to minimize loss of information. • Loss of information is measured by Kullback-Leibler divergence (AKA relative entropy or information divergence). • P, Q - two discrete probability distributions. • KL_divergence(P,Q):= SUM(P[i] * log(P[i] / Q[i] ), i) • Intuition: KL divergence measures the amount of information lost when approximating a given encoding. 14
  • 15. Solution: Calibration • Calibration Dataset • Representative. • Diverse. • Ideally a subset of validation dataset. • 1000s of samples • Calibration • Run FP32 inference on Calibration Dataset. • For each Layer: • collect histograms of activations. • generate many quantized distributions with different saturation thresholds. • pick threshold which minimizes KL_divergence(ref_distr, quant_distr). • Entire process takes a few minutes on a typical desktop workstation. 15
  • 16. INT8 workflow in TensorRT • You will need: • Model trained in FP32. • Calibration dataset. • TensorRT will: • Run inference in FP32 on calibration dataset. • Collect required statistics. • Run calibration algorithm → optimal scaling factors. • Quantize FP32 weights → INT8. • Generate “CalibrationTable” and INT8 execution engine. 16
  • 17. Entropy Calibration - pseudocode Input: FP32 histogram H with 2048 bins: bin[ 0 ], …, bin[ 2047 ] For i in range( 128 , 2048 ): P = [ bin[ 0 ] , ..., bin[ i-1 ] ] // reference_distribution outliers_count = sum( bin[ i ] , bin[ i+1 ] , … , bin[ 2047 ] ) P[ i-1 ] += outliers_count P /= sum(P) // normalize distribution P Q = quantize [ bin[ 0 ], …, bin[ i-1 ] ] into 128 levels // candidate_distribution expand Q to ‘ i ’ bins Q /= sum(Q) // normalize distribution Q divergence[ i ] = KL_divergence( P, Q) End For Find index ‘m’ for which divergence[ m ] is minimal threshold = ( m + 0.5 ) * ( width of a bin ) 17
  • 18. Candidate distribution Q • KL_divergence(P, Q) requires that len(P) == len(Q) • Candidate distribution Q is generated after merging ‘ i ’ bins from bin[0] to bin[i-1] into 128 bins • Afterwards Q has to be ‘expanded’ again into ‘i’ bins • Here is a simple example: reference distribution P consisting of 8 bins, we want to quantize into 2 bins: P = [ 1, 0, 2, 3, 5, 3, 1, 7] we merge into 2 bins (8 / 2 = 4 consecutive bins are merged into one bin) [1 + 0 + 2 + 3 , 5 + 3 + 1 + 7] = [6, 16] then proportionally expand back to 8 bins, we preserve empty bins from the original distribution P: Q = [ 6/3, 0, 6/3, 6/3, 16/4, 16/4, 16/4, 16/4] = [ 2, 0, 2, 2, 4, 4, 4, 4] now we should normalize both distributions, after that we can compute KL_divergence P /= sum(P) Q /= sum(Q) result = KL_divergence(P, Q) 18
  • 19. INT8 conv kernel - pseudocode // I8 input tensors: I8_input, I8_weights, I8 output tensors: I8_output // F32 bias (original bias from the F32 model) // F32 scaling factors: input_scale, output_scale, weights_scale[K] I32_gemm_out = I8_input * I8_weights // Compute INT8 GEMM (DP4A) F32_gemm_out = (float)I32_gemm_out // Cast I32 GEMM output to F32 float // At this point we have F32_gemm_out which is scaled by ( input_scale * weights_scale[K] ), // but to store the final result in int8 we need to have scale equal to "output_scale", so we have to rescale: // (this multiplication is done in F32, *_gemm_out arrays are in NCHW format) for i in 0, ... K-1: rescaled_F32_gemm_out[ :, i, :, :] = F32_gemm_out[ :, i, :, :] * [ output_scale / (input_scale * weights_scale[ i ] ) ] // Add bias, to perform addition we have to rescale original F32 bias so that it's scaled with "output_scale" rescaled_F32_gemm_out _with_bias = rescaled_F32_gemm_out + output_scale * bias // Perform ReLU (in F32) F32_result = ReLU(rescaled_F32_gemm_out _with_bias) // Convert to INT8 and save to global I8_output = Saturate( Round_to_nearest_integer( F32_result ) ) 19
  • 20. Results - Accuracy and Performance • All optimizations enabled. • ILSVRC2012 validation dataset, batch = 25 images. • Accuracy was measured on 500 batches which were not used for the calibration. 20
  • 21. Open challenges / improvements • Unsigned int8 for activations after ReLU. • Fine tuning of saturation thresholds. • A better solution in tensorflow with asymmetric quantization for above two? • RNNs → open research problem. • Dynamic Compute Graph • Expose API for accepting custom, user provided scale factors. 21
  • 22. Reference • TensorRT 3: Faster TensorFlow Inference and Volta Support • 8-bit Inference with TensorRT • Using TensorRT to Optimize Caffe Models in Python • How to Quantize Neural Networks with TensorFlow 22
  • 23. Summary of NN Compiler Provider Framework Graph opt. Backend opt. INT8 support Runtime inference Format Open source Target Nvidia Caffe / Tensorflow TensorRT TensorRT TensorRT Precision Calibration TensorRT runtime engine NCHW No GPU/NVDLA Google Tensorflow TF lite (toco) NNAPI ??? Proper quantized training is necessary before conversion TF lite interpreter NHWC Yes CPU Amazon MxNet NNVM TVM mxnet.ndarray.co ntrib.quantize TVM runtime Depends on Target Yes CPU/GPU/… 23 • Generally, [NHWC] is the default for most frameworks (like Tensorflow) and [NCHW] is the optimal format to use when training on NVIDIA GPUs using cuDNN • TF lite quantized conversion expect the models to be annotated with "fake quantization" nodes that record the dynamic range of the tensors. which means that proper quantized training is necessary before conversion
  • 26. Model convertion • https://github.com/ysh329/deep-learning-model-convertor • https://github.com/Microsoft/MMdnn • https://github.com/hahnyuan/nn_tools 26
  • 27. Graph and Target Optimizations • NNVM • https://github.com/dmlc/nnvm/tree/master/src/pass • https://github.com/dmlc/nnvm/tree/master/src/compiler • TF lite • https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/toco • https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/optimize_for_in ference.py • https://www.tensorflow.org/performance/ • TensorRT • Layer & Tensor Fusion • FP16 and INT8 Precision Calibration • Kernel Auto-tuning • Dynamic Tensor Memory • Multi-Stream Execution • All are offline optimized 27
  • 28. Quantization reference code • Solver::Solve() • net_->Forward(&loss); • Net::ForwardFromTo() • Solver::Test() • test_net->Forward(&iter_loss); • Net::ForwardFromTo() Net::ForwardFromTo() { StartQuantization(); // Add and set QuantizationParams in each layer for (int i = start; i <= end; ++i) { float layer_loss = layers_[i]->Forward(bottom_vecs_[i], top_vecs_[i]); loss += layer_loss; } FinishQuantization(); // UpdateQuantizationRangeInLayers return loss; } 28
  • 29. NV TensorRT container • https://ngc.nvidia.com/registry/nvidia-tensorrt • https://github.com/NVIDIA/nvidia-docker • https://devblogs.nvidia.com/parallelforall/nvidia-docker-gpu-server- application-deployment-made-easy/ • https://devblogs.nvidia.com/parallelforall/tensorrt-container/ 29

Editor's Notes

  1. python api interface, 一次只能給一個plan: trt.utils.write_engine_to_file("./data/mnist/new_mnist.engine", engine.serialize())  Build by ur own through utility help, It need API, By ur self 所以要runtime pick, 還需要每個plan都跑過一次 才知道哪個plan最好?  Yes picking is by naive way
  2. 有關於 no INT8 for winograd這部分, 是有數據證明int8做winograd賺不到嗎?  Not yet support for good INT8 version in all form of Winograd  Sometime due to GPU clocks and memory bandwidth
  3. 我先是预先跑500张图,每个节点处的feature map dump出来,逐个分析。 图片是随机挑选的。个人感觉最终Accuracy对门限不是太敏感,稍微有些波动也无大碍
  4. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc https://github.com/apache/incubator-mxnet/blob/master/src/operator/contrib/quantize-inl.h https://github.com/tidsp/caffe-jacinto/blob/caffe-0.16/src/caffe/net.cpp