Aran Khanna, Software Engineer, Amazon Web Services at MLconf ATL 2017

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Aran Khanna, AI Engineer, AWS Deep Learning
Deep Learning at the Edge With
Apache MXNet
Amazon AI
GRT Intern

Deep Neural Networks
Inputs Outputs

Deep Neural Networks At The Edge

Overview
Motivating Problems in DL at the Edge
Why Apache MXNet
From the Metal To the Models With
MXNet
DL at the Edge with AWS

Why The Edge, When We have the Cloud?
VS.

Latency
VS.

Latency
Connectivity
VS.

Latency
Connectivity
Cost
VS.

Latency
Connectivity
Cost
Privacy/Security
VS.

Motivating Examples
• Real Time Filtering (Neural Style Transfer)

Motivating Examples
• Industrial IoT (Out of Distribution/Anomaly Detection)

Motivating Examples
• Robotics (Object Detection and Recognition)

Motivating Examples
• Autonomous Driving Systems

Infrastructure GPU CPU IoT Mobile
Amazon AI : Artificial Intelligence In The Hands Of Every Developer
Engines MXNet TensorFlow Caffe Theano Pytorch CNTK
Platforms Amazon
ML
Spark &
EMR
Kinesis Batch ECS
Services
Rekognition Polly
ChatSpeechVision
Lex

Infrastructure GPU CPU IoT Mobile
Amazon AI : Artificial Intelligence In The Hands Of Every Developer
Engines MXNet TensorFlow Caffe Theano Pytorch CNTK

Flexible Portable Performance
Mixed Programming API Runs Everywhere Near Linear Scaling
Apache MXNet | Differentiators

>>> import mxnet as mx
>>> a = mx.nd.zeros((100, 50))
>>> b = mx.nd.ones((100, 50))
>>> c = a + b
>>> c += 1
>>> print(c)
IMPERATIVE
NDARRAY API
>>> net = mx.symbol.Variable('data')
>>> net = mx.symbol.FullyConnected(data=net, num_hidden=128)
>>> net = mx.symbol.SoftmaxOutput(data=net)
>>> texec = mx.module.Module(net)
>>> texec.forward(data=c)
>>> texec.backward()
DECLARATIVE
SYMBOLIC
EXECUTOR
Apache MXNet | Flexible Programming

Ideal
Inception v3
Resnet
Alexnet
88%
Efficiency
1 2 4 8 16 32 64 128 256
No. of GPUs
Apache MXNet | Efficient Scaling

Apache MXNet | On Mobile Devices
https://mxnet.incubator.apache.org
/how_to/smart_device.html

mxnet.incubator.apache.org/get_started/install.html
Apache MXNet | On IoT Devices

Most
Open
Best On
AWS
Optimized for deep learning on
AWS
Accepted into the Apache Incubator
Apache MXNet | Community

35%
Outpacing
Contributors
Diverse Community
0 40,000
Yutian Li (Stanford)
Nan Zhu (MSFT)
Liang Depeng (Sun Yat-sen U.)
Xingjian Shi (HKUST)
Tianjun Xiao (Tesla)
Chiyuan Zhang (MIT)
Yao Wang (AWS)
Jian Guo (TuSimple)
Yizhi Liu (Mediav)
Sandeep K. (AWS)
Sergey Kolychev (Whitehat)
Eric Xie (AWS)
Tianqi Chen (UW)
Mu Li (AWS)
Bing Su (Apple)
*As of 3/30/17
**Amazon @35% of Contributions
| Amazon Contributions
| Torch, Theano, CNTK
Apple, Tesla, Microsoft, NYU,
MIT, Stanford, Lots of others..
|
Apache MXNet | Community

Apache MXNet | Apple CoreML
pip install mxnet-to-coreml

Apache MXNet | Easy to Get Started
http://gluon.mxnet.io/

What Are the Challenges at the Edge?

The Metal: Heterogeneity
In the Cloud
• X86_64
• CUDA GPU

The Metal: Heterogeneity
In the Cloud
• X86_64
• CUDA GPU
At the Edge
• X86_64, X86_32, ARM, Arch64, Android, iOS
• OpenCL GPU, CUDA GPU, Metal GPU
• NEON DSP, Hexagon DSP
• Custom Accelerators, FPGA

The Metal: Performance Gap
Low End:
Raspberry Pi 3
- 32 Bit ARMv7
- ARM NEON
- 1GB Ram
High End:
NVIDIA Jetson
- ARM Arch64
- 128 CUDA Cores
- 8GB RAM

The Models: Where is Our Cost?
Convolutions are expensive

The Models: Where is Our Cost?
Models are generally over parameterized

Cheaper Convolutions: Winograd
Convolution in Time Domain = Pointwise Multiplication in Frequency Domain
Under the Hood in MXNet with integrations in NNPACK, CUDA etc.

Cheaper Convolutions: Separable Convolutions
Good for devices that can’t run lots of multiplications in parallel
Convolve separately over each depth channel of input
followed by 1x1 convolutions to merge channels

Depth Separable Convolutions in MXNet
>>> x = mx.sym.Variable('x')
>>> w = mx.sym.Variable('w')
>>> b = mx.sym.Variable('b')
>>> xslice = mx.sym.SliceChannel(data=x, num_outputs=num_group, axis=1)
>>> wslice = mx.sym.SliceChannel(data=w, num_outputs=num_group, axis=0)
>>> bslice = mx.sym.SliceChannel(data=b, num_outputs=num_group, axis=0)
>>> y_sep = mx.sym.Concat(*[mx.sym.Convolution(data=xslice[i],
weight=wslice[i], bias=bslice[i], num_filter=num_filter//num_group,
kernel=kernel, stride=stride, pad=pad) for i in range(num_group)])
>>> y = mx.sym.Convolution(data=x, weight=w, bias=b, num_filter=num_filter,
num_group=num_group, kernel=kernel, stride=stride, pad=pad)

Fewer Parameters: Quantization
Good for devices with hardware to accelerate low precision operations
Map activations into lower bit-width buckets and multiply with quantized weights

Quantization in MXNet
>>> min0 = mx.nd.array([0.0])
>>> max0 = mx.nd.array([1.0])
>>> sym = mx.nd.array([[0.1392, 0.5928], [0.6027, 0.8579]]
>>> quantized_sym, min1, max1 = mx.nd.contrib.quantize(a, min0, max0,
out_type='uint8')
>>> dequantized_sym = mx.nd.contrib.dequantize(quantized_sym, min1, max1,
out_type='float32')

Fewer Parameters: Weight Pruning
Prune unused weights during training
Good at high sparsity for devices with fast sparse multiplication

Weight Pruning in MXNet
>>> # Assume we have defined a model and training data set
>>> model.fit(train,
>>> eval_data=val,
>>> eval_metric='acc',
>>> num_epoch=10,
>>> optimizer='sparsesgd’,
>>> optimizer_params={'learning_rate' : 0.1,
>>> 'wd' : 0.004,
>>> 'momentum' : 0.9,
>>> 'pruning_switch_epoch' : 5,
>>> 'weight_sparsity' : 0.8,
>>> 'bias_sparsity' : 0.0,
>>> }

Fewer Parameters: Efficient Architectures
SqueezeNet: AlexNet Accuracy with 50x Fewer Parameters
Good for devices with low RAM that can’t hold all weights for larger
models concurrently in memory

Efficient Architectures in MXNet
https://mxnet.incubator.apache.org/model_zoo/

Fewer Parameters: Tensor Decompositions
CVPR paper at arxiv.org/abs/1706.00439
Code at https://github.com/tensorly/tensorly

Table of Model Optimization Techniques
Winograd
Convolutions
Separable
Convolutions
Quantization Tensor
Contractions
Sparsity
Exploitation
Weight
Sharing
CPU
Acceleration
+ ++ = ++ + +
GPU
Acceleration
+ + + + = +
Model
Size
= = - - - -
Model
Accuracy
= - - - - -
Specialized
Hardware
Acceleration
+ + ++ + + +

Edge Model Optimization Benefits The Cloud
Models with fewer parameters often
generalize better
Tricks from the edge can be applied in
the cloud
Pre-processing with edge models decreases
compute load in the cloud

Tons of GPUs and CPUs
Serverless
At the Edge, On IoT Devices
Prediction
The Challenge For Artificial Intelligence: SCALE
Tons of GPUs
Elastic capacity
Training
Pre-built images
Aggressive migration
New data created on AWS
Data
PBs of existing data

p2 instances
Up to 40k CUDA cores
Deep Learning AMI
Pre-configured for Deep Learning
CFN Template
Launch a Deep Learning Cluster
AWS Tools for Deep Learning

AWS Deep Learning AMI: One-Click Deep Learning
Kepler, Volta
& Skylake
Apache
MXNet
Python 2/3 Notebooks
& Examples
(and others)

https://aws.amazon.com/amazon-ai/amis/

Manage and Monitor Models on The Fly
AWS
Captured Data
Upload
Tagged
Data
Escalate to
AI Service
Escalate to
Custom
Model on P2
Deploy
and
Manage
Model

Local Learning Loop
Poorly
Classified
Data
Updated
Model
Fine Tune Model With
Accurate Classification

Getting Started with MXNet at the Edge+ AWS IoT
http://amzn.to/2h6kPvY

Running AI In Production on AWS Today

Thank You!
Aran Khanna – arankhan@amazon.com
GRT Intern

Aran Khanna, Software Engineer, Amazon Web Services at MLconf ATL 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Aran Khanna, Software Engineer, Amazon Web Services at MLconf ATL 2017

Similar to Aran Khanna, Software Engineer, Amazon Web Services at MLconf ATL 2017 (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Aran Khanna, Software Engineer, Amazon Web Services at MLconf ATL 2017

Editor's Notes