High Performance Deep Learning on Edge Devices With Apache MXNet:
Deep network based models are marked by an asymmetry between the large amount of compute power needed to train a model, and the relatively small amount of compute power needed to deploy a trained model for inference. This is particularly true in computer vision tasks such as object detection or image classification, where millions of labeled images and large numbers of GPUs are needed to produce an accurate model that can be deployed for inference on low powered devices with a single CPU. The challenge when deploying vision models on these low powered devices though, is getting inference to run efficiently enough to allow for near real time processing of a video stream. Fortunately Apache MXNet provides the tools to solve this issues, allowing users to create highly performant models with tools like separable convolutions, quantized weights and sparsity exploitation as well as providing custom hardware kernels to ensure inference calculations are accelerated to the maximum amount allowed by the hardware the model is being deployed on. This is demonstrated though a state of the art MXNet based vision network running in near real time on a low powered Raspberry Pi device. We finally discuss how running inference at the edge as well as leveraging MXNet’s efficient modeling tools can be used to massively drive down compute costs for deploying deep networks in a production system at scale.
22. Infrastructure GPU CPU IoT Mobile
Amazon AI : Artificial Intelligence In The Hands Of Every Developer
Engines MXNet TensorFlow Caffe Theano Pytorch CNTK
Platforms Amazon
ML
Spark &
EMR
Kinesis Batch ECS
Services
Rekognition Polly
ChatSpeechVision
Lex
23. Infrastructure GPU CPU IoT Mobile
Amazon AI : Artificial Intelligence In The Hands Of Every Developer
Engines MXNet TensorFlow Caffe Theano Pytorch CNTK
24. Overview
Motivating Problems in DL at the Edge
Why Apache MXNet
From the Metal To the Models With
MXNet
DL at the Edge with AWS
35. 35%
Outpacing
Contributors
Diverse Community
0 40,000
Yutian Li (Stanford)
Nan Zhu (MSFT)
Liang Depeng (Sun Yat-sen U.)
Xingjian Shi (HKUST)
Tianjun Xiao (Tesla)
Chiyuan Zhang (MIT)
Yao Wang (AWS)
Jian Guo (TuSimple)
Yizhi Liu (Mediav)
Sandeep K. (AWS)
Sergey Kolychev (Whitehat)
Eric Xie (AWS)
Tianqi Chen (UW)
Mu Li (AWS)
Bing Su (Apple)
*As of 3/30/17
**Amazon @35% of Contributions
| Amazon Contributions
| Torch, Theano, CNTK
Apple, Tesla, Microsoft, NYU,
MIT, Stanford, Lots of others..
|
Apache MXNet | Community
41. The Metal: Heterogeneity
In the Cloud
• X86_64
• CUDA GPU
At the Edge
• X86_64, X86_32, ARM, Arch64, Android, iOS
• OpenCL GPU, CUDA GPU, Metal GPU
• NEON DSP, Hexagon DSP
• Custom Accelerators, FPGA
42. The Metal: Performance Gap
Low End:
Raspberry Pi 3
- 32 Bit ARMv7
- ARM NEON
- 1GB Ram
High End:
NVIDIA Jetson
- ARM Arch64
- 128 CUDA Cores
- 8GB RAM
48. Cheaper Convolutions: Separable Convolutions
Good for devices that can’t run lots of multiplications in parallel
Convolve separately over each depth channel of input
followed by 1x1 convolutions to merge channels
49. Depth Separable Convolutions in MXNet
>>> import mxnet as mx
>>> x = mx.sym.Variable('x')
>>> w = mx.sym.Variable('w')
>>> b = mx.sym.Variable('b')
>>> xslice = mx.sym.SliceChannel(data=x, num_outputs=num_group, axis=1)
>>> wslice = mx.sym.SliceChannel(data=w, num_outputs=num_group, axis=0)
>>> bslice = mx.sym.SliceChannel(data=b, num_outputs=num_group, axis=0)
>>> y_sep = mx.sym.Concat(*[mx.sym.Convolution(data=xslice[i],
weight=wslice[i], bias=bslice[i], num_filter=num_filter//num_group,
kernel=kernel, stride=stride, pad=pad) for i in range(num_group)])
>>> y = mx.sym.Convolution(data=x, weight=w, bias=b, num_filter=num_filter,
num_group=num_group, kernel=kernel, stride=stride, pad=pad)
50. Fewer Parameters: Quantization
Good for devices with hardware to accelerate low precision operations
Map activations into lower bit-width buckets and multiply with quantized weights
55. Fewer Parameters: Efficient Architectures
SqueezeNet: AlexNet Accuracy with 50x Fewer Parameters
Good for devices with low RAM that can’t hold all weights for larger
models concurrently in memory
59. Edge Model Optimization Benefits The Cloud
Models with fewer parameters often
generalize better
Tricks from the edge can be applied in
the cloud
Pre-processing with edge models decreases
compute load in the cloud
60. Overview
Motivating Problems in DL at the Edge
Why Apache MXNet
From the Metal To the Models With
MXNet
DL at the Edge with AWS
61. Tons of GPUs and CPUs
Serverless
At the Edge, On IoT Devices
Prediction
The Challenge For Artificial Intelligence: SCALE
Tons of GPUs
Elastic capacity
Training
Pre-built images
Aggressive migration
New data created on AWS
Data
PBs of existing data
62. p2 instances
Up to 40k CUDA cores
Deep Learning AMI
Pre-configured for Deep Learning
CFN Template
Launch a Deep Learning Cluster
AWS Tools for Deep Learning
63. AWS Deep Learning AMI: One-Click Deep Learning
Kepler, Volta
& Skylake
Apache
MXNet
Python 2/3 Notebooks
& Examples
(and others)
67. Manage and Monitor Models on The Fly
AWS
Captured Data
Upload
Tagged
Data
Escalate to
AI Service
Escalate to
Custom
Model on P2
Deploy
and
Manage
Model