2. TensorRT
• NVIDIA TensorRT™ is a high-performance deep learning inference optimizer and runtime that
delivers low latency, high-throughput inference for deep learning applications.
• NVIDIA released TensorRT last year with the goal of accelerating deep learning inference for
production deployment.
2
3. Deploying a model with TensorRT
3
UFF stands for Universal Framework
Format, which is TensorRT’s internal
format used to represent the network
graph before running optimizations
perform optimizations for
specified parameters such as
batch size, precision, and
workspace memory for the
target deployment GPU
The output of the
TensorRT optimization is
a runtime inference engine
that can be serialized to
disk.
load and deserialize a
saved plan file to
create a TensorRT
engine object
A plan file includes not
only the weights, but also
the schedule for the kernels
to execute the network.
4. TensorRT supported layers
• TensorRT supported layers
• Convolution
• LSTM and GRU
• Activation: ReLU, tanh, sigmoid
• Pooling: max and average
• Scaling
• Element wise operations
• LRN
• Fully-connected
• SoftMax
• Deconvolution
• TensorRT provides a Custom Layer API to enable you
to define your own custom layers that aren’t natively
supported
• These custom layers are defined using C++ to make it easy
to leverage highly optimized CUDA libraries like cuDNN
and cuBLAS
4
5. TensorRT Optimizations
• TensorRT Optimizations
• Layer and tensor fusion and elimination of unused layers
• FP16 and INT8 reduced precision calibration
• Target-specific autotuning
• Efficient memory reuse
• Multi-Stream Execution
• TensorRT performs these optimizations automatically under the hood for you.
• All you need to specify is the UFF inference graph to optimize, the inference batch size, the
amount of workspace GPU memory (used for CUDA kernel scratch space), and the target
inference precision, as the following code shows.
•
5
6. Optimization 1: Layer & Tensor Fusion
• TensorRT parses the network computational graph and looks for opportunities to
perform graph optimizations.
• These graph optimizations do not change the underlying computation in the
graph: instead, they look to restructure the graph to perform the operations
much faster and more efficiently.
• TensorRT can also eliminate the concatenation layers in “concat” by preallocating output
buffers and writing into them in a strided fashion.
6
7. Optimization 2: FP16 and INT8 Precision
Calibration
• Most deep learning frameworks train neural networks in full 32-bit precision (FP32).
• Once the model is fully trained, inference computations can use half precision FP16 or even INT8 tensor operations, since
gradient backpropagation is not required for inference.
• Using lower precision results in smaller model size, lower memory utilization and latency, and higher throughput.
• TensorRT can deploy models in FP32, FP16 and INT8
• To quantize full-precision information into INT8 while minimizing accuracy loss, TensorRT must perform a
process called calibration to determine how best to represent the weights and activations as 8-bit integers.
• The calibration step requires you to provide TensorRT with a representative sample of the input training data.
• No additional fine tuning or retraining of the model is necessary, and you don’t need to have access to the entire training
dataset.
• Calibration is a completely automated and parameter-free method for converting FP32 to INT8.
7
8. Optimization 3: Kernel Auto-tuning
• During the optimization phase TensorRT also chooses from hundreds
of specialized kernels, many of them hand-tuned and optimized for a
range of parameters and target platforms.
• As an example, there are several different algorithms to do convolutions.
• TensorRT will pick the implementation from a library of kernels that delivers
the best performance for the target GPU, input data size, filter size, tensor
layout, batch size and other parameters.
• This ensures that the deployed model is performance tuned for the
specific deployment platform as well as for the specific neural
network being deployed.
8
9. Optimization 4: Dynamic Tensor Memory
• TensorRT reduces memory footprint and improves memory reuse by
allocating memory for each tensor only for the duration of its usage,
avoiding memory allocation overhead for fast and efficient execution.
9
10. Optimization 5: Multi-Stream Execution
• Scales to multiple input streams, by processing them in parallel using
the same model and weights
10
11. TensorRT Run-Time Inference
• You’re now ready to deploy your application with TensorRT
• You’ve so far imported a trained TensorFlow model into TensorRT, and performed a number of
optimizations to generate a runtime engine.
• And you’ve serialized this engine to disk as an engine plan file.
• You performed all these steps offline, and only once prior to deployment.
• The next step is to load serialized models into your runtime environment and
perform inference on new data.
11
• TensorRT Lite API is a highly abstracted
interface that handles standard tasks like
creating the logger, deserializing the engine
from a plan file to create a runtime, and
allocating GPU memory for the engine.
• During inference, it also manages data
transfer to and from GPU automatically, so
you can just create an engine and start
processing data.
13. Quantization
• It’s always a tradeoff between range and precision of the INT8
representation.
• Minimize information loss, since FP32 → INT8 is just re-encoding information
13
14. How to optimize threshold selection?
• “Relative Entropy” of two encodings
• INT8 model encodes the same information as the original FP32 model.
• We want to minimize loss of information.
• Loss of information is measured by Kullback-Leibler divergence (AKA relative
entropy or information divergence).
• P, Q - two discrete probability distributions.
• KL_divergence(P,Q):= SUM(P[i] * log(P[i] / Q[i] ), i)
• Intuition: KL divergence measures the amount of information lost when
approximating a given encoding.
14
15. Solution: Calibration
• Calibration Dataset
• Representative.
• Diverse.
• Ideally a subset of validation dataset.
• 1000s of samples
• Calibration
• Run FP32 inference on Calibration Dataset.
• For each Layer:
• collect histograms of activations.
• generate many quantized distributions with different saturation thresholds.
• pick threshold which minimizes KL_divergence(ref_distr, quant_distr).
• Entire process takes a few minutes on a typical desktop workstation.
15
16. INT8 workflow in TensorRT
• You will need:
• Model trained in FP32.
• Calibration dataset.
• TensorRT will:
• Run inference in FP32 on calibration dataset.
• Collect required statistics.
• Run calibration algorithm → optimal scaling factors.
• Quantize FP32 weights → INT8.
• Generate “CalibrationTable” and INT8 execution engine.
16
17. Entropy Calibration - pseudocode
Input: FP32 histogram H with 2048 bins: bin[ 0 ], …, bin[ 2047 ]
For i in range( 128 , 2048 ):
P = [ bin[ 0 ] , ..., bin[ i-1 ] ] // reference_distribution
outliers_count = sum( bin[ i ] , bin[ i+1 ] , … , bin[ 2047 ] )
P[ i-1 ] += outliers_count
P /= sum(P) // normalize distribution P
Q = quantize [ bin[ 0 ], …, bin[ i-1 ] ] into 128 levels // candidate_distribution
expand Q to ‘ i ’ bins
Q /= sum(Q) // normalize distribution Q
divergence[ i ] = KL_divergence( P, Q)
End For
Find index ‘m’ for which divergence[ m ] is minimal
threshold = ( m + 0.5 ) * ( width of a bin )
17
18. Candidate distribution Q
• KL_divergence(P, Q) requires that len(P) == len(Q)
• Candidate distribution Q is generated after merging ‘ i ’ bins from bin[0] to bin[i-1] into 128 bins
• Afterwards Q has to be ‘expanded’ again into ‘i’ bins
• Here is a simple example:
reference distribution P consisting of 8 bins, we want to quantize into 2 bins:
P = [ 1, 0, 2, 3, 5, 3, 1, 7]
we merge into 2 bins (8 / 2 = 4 consecutive bins are merged into one bin)
[1 + 0 + 2 + 3 , 5 + 3 + 1 + 7] = [6, 16]
then proportionally expand back to 8 bins, we preserve empty bins from the original distribution P:
Q = [ 6/3, 0, 6/3, 6/3, 16/4, 16/4, 16/4, 16/4] = [ 2, 0, 2, 2, 4, 4, 4, 4]
now we should normalize both distributions, after that we can compute KL_divergence
P /= sum(P) Q /= sum(Q)
result = KL_divergence(P, Q)
18
19. INT8 conv kernel - pseudocode
// I8 input tensors: I8_input, I8_weights, I8 output tensors: I8_output
// F32 bias (original bias from the F32 model)
// F32 scaling factors: input_scale, output_scale, weights_scale[K]
I32_gemm_out = I8_input * I8_weights // Compute INT8 GEMM (DP4A)
F32_gemm_out = (float)I32_gemm_out // Cast I32 GEMM output to F32 float
// At this point we have F32_gemm_out which is scaled by ( input_scale * weights_scale[K] ),
// but to store the final result in int8 we need to have scale equal to "output_scale", so we have to rescale:
// (this multiplication is done in F32, *_gemm_out arrays are in NCHW format)
for i in 0, ... K-1:
rescaled_F32_gemm_out[ :, i, :, :] = F32_gemm_out[ :, i, :, :] * [ output_scale / (input_scale * weights_scale[ i ] ) ]
// Add bias, to perform addition we have to rescale original F32 bias so that it's scaled with "output_scale"
rescaled_F32_gemm_out _with_bias = rescaled_F32_gemm_out + output_scale * bias
// Perform ReLU (in F32)
F32_result = ReLU(rescaled_F32_gemm_out _with_bias)
// Convert to INT8 and save to global
I8_output = Saturate( Round_to_nearest_integer( F32_result ) )
19
20. Results - Accuracy
and Performance
• All optimizations enabled.
• ILSVRC2012 validation dataset, batch =
25 images.
• Accuracy was measured on 500 batches
which were not used for the calibration.
20
21. Open challenges /
improvements
• Unsigned int8 for activations after
ReLU.
• Fine tuning of saturation thresholds.
• A better solution in tensorflow with
asymmetric quantization for above two?
• RNNs → open research problem.
• Dynamic Compute Graph
• Expose API for accepting custom, user
provided scale factors.
21
22. Reference
• TensorRT 3: Faster TensorFlow Inference and Volta Support
• 8-bit Inference with TensorRT
• Using TensorRT to Optimize Caffe Models in Python
• How to Quantize Neural Networks with TensorFlow
22
23. Summary of NN Compiler
Provider Framework Graph opt. Backend opt. INT8 support Runtime
inference
Format Open
source
Target
Nvidia Caffe /
Tensorflow
TensorRT TensorRT TensorRT
Precision
Calibration
TensorRT
runtime engine
NCHW No GPU/NVDLA
Google Tensorflow TF lite (toco) NNAPI ??? Proper quantized
training is
necessary before
conversion
TF lite
interpreter
NHWC Yes CPU
Amazon MxNet NNVM TVM mxnet.ndarray.co
ntrib.quantize
TVM runtime Depends
on Target
Yes CPU/GPU/…
23
• Generally, [NHWC] is the default for most frameworks (like Tensorflow) and [NCHW] is the optimal format to use when training on NVIDIA GPUs using cuDNN
• TF lite quantized conversion expect the models to be annotated with "fake quantization" nodes that record the dynamic range of the tensors. which means that
proper quantized training is necessary before conversion
python api interface, 一次只能給一個plan: trt.utils.write_engine_to_file("./data/mnist/new_mnist.engine", engine.serialize())
Build by ur own through utility help, It need API, By ur self
所以要runtime pick, 還需要每個plan都跑過一次 才知道哪個plan最好?
Yes picking is by naive way
有關於 no INT8 for winograd這部分, 是有數據證明int8做winograd賺不到嗎?
Not yet support for good INT8 version in all form of Winograd
Sometime due to GPU clocks and memory bandwidth