Squeezing Deep Learning Into Mobile Phones

Squeezing Deep Learning into mobile phones
- A Practitioners guide
Anirudh Koul

Anirudh Koul , @anirudhkoul , http://koul.ai
Project Lead, Seeing AI (SeeingAI.com)
Applied Researcher, Microsoft AI & Research
Akoul at Microsoft dot com
Currently working on applying artificial intelligence for
Hololens, autonomous robots and accessibility
Along with Eugene Seleznev, Saqib Shaikh, Meher Kasam

Why Deep Learning On Mobile?
Latency Privacy

Response Time Limits – Powers of 10
0.1 second : Reacting instantly
1.0 seconds : User’s flow of thought
10 seconds : Keeping the user’s attention
[Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:

Mobile Deep Learning Recipe
Mobile Inference Engine + Pretrained Model = DL App
(Efficient) (Efficient)

Use Cloud APIs
Microsoft Cognitive Services
Clarifai
Google Cloud Vision
IBM Watson Services
Amazon Rekognition
Tip : Resize to 224x224 at under 50% compression with bilinear interpolation
before network transmission
But don’t resize for Text / OCR projects

Microsoft Cognitive Services
Models won the 2015 ImageNet Large Scale Visual Recognition Challenge
Vision, Face, Emotion, Video and 21 other topics

Custom Vision Service (customvision.ai) – Drag and drop training
Tip : Upload 30 photos per class for make prototype model
Upload 200 photos per class for more robust production model
More distinct the shape/type of object, lesser images required.

Custom Vision Service (customvision.ai) – Drag and drop training
Tip : Use Fatkun Browser Extension to download images from Search Engine,
or use Bing Image Search API to programmatically download photos with
proper rights

http://deeplearningkit.org/2015/12/28/deeplearningkit-deep-learning-for-ios-tested-on-iphone-6s-tvos-and-os-x-developed-in-metal-and-swift/
Energy to train
Convolutional
Neural Network
Energy to use
Convolutional
Neural Network

Base PreTrained Model
ImageNet – 1000 Object Categorizer
Inception
Resnet

Running pre-trained models on mobile
Core ML
Tensorflow
Caffe2
Snapdragon Neural Processing Engine
MXNet
CNNDroid
DeepLearningKit
Torch

Core ML
From Apple, for iOS 11
Convert Caffe/Tensorflow model to CoreML model in 3 lines:
import coremltools
coreml_model = coremltools.converters.caffe.convert('my_caffe_model.caffemodel’)
coreml_model.save('my_model.mlmodel’)
Add model to iOS project and call for prediction.
Direct support for Keras, Caffe, scikit-learn, XGBoost, LibSVM
Builds on top of low-level primitives
Accelerate, BNNS, Metal Performance Shaders (MPS)
Noticable speedup between MPS (iOS 10) vs CoreML implementation (iOS 11)
(same model, same hardware)
Automatically minimizes memory footprint and power consumption

CoreML Benchmark - Pick a DNN for your mobile architecture
Model Top-1
Accuracy
Size of Model
(MB)
iPhone 6
Execution
Time (ms)
iPhone 6S
Execution
Time (ms)
iPhone 7
Execution
Time (ms)
VGG 16 71 553 4556 254 208
Inception v3 78 95 637 98 90
Resnet 50 75 103 557 72 64
MobileNet 71 17 109 52 32
SqueezeNet 57 5 78 29 24
2014 2015 2016
Huge
improvement
in hardware
in 2015

Putting out more frames than an art gallery

Tensorflow
Easy pipeline to bring Tensorflow models to mobile
Excellent documentation
Optimizations to bring model to mobile
Upcoming : XLA (Accelerated Linear Algebra)
compiler to optimize for hardware

Caffe2
From Facebook
Under 1 MB of binary size
Built for Speed :
For ARM CPU : Uses NEON Kernels, NNPack
For iPhone GPU : Uses Metal Performance Shaders and Metal
For Android GPU : Uses Qualcomm Snapdragon NPE (4-5x speedup)
ONNX format support to import models from CNTK/PyTorch

Snapdragon Neural Processing Engine (NPE) SDK
Like CoreML for Qualcomm Snapdragon chips
Published speedup of 4-5x
On about half the Android phones
Identifies best target core for inference - GPU, DSP or CPU
Customizable to choose between battery power and performance
Supports importing models from Caffe, Caffe2, Tensorflow

MXNET
Amalgamation : Pack all the code in a single source file
Pro:
• Cross Platform (iOS, Android), Easy porting
• Usable in any programming language
Con:
• CPU only, Slow
https://github.com/Leliana/WhatsThis

CNNdroid
GPU accelerated CNNs for Android
Supports Caffe, Torch and Theano models
~30-40x Speedup using mobile GPU vs CPU (AlexNet)
Internally, CNNdroid expresses data parallelism for different layers, instead
of leaving to the GPU’s hardware scheduler

DeepLearningKit
Platform : iOS, OS X and tvOS (Apple TV)
DNN Type : CNNs models trained in Caffe
Runs on mobile GPU, uses Metal
Pro : Fast, directly ingests Caffe models
Con : Unmaintained

Running pre-trained models on mobile
Mobile Library Platform GPU DNN Architecture
Supported
Trained Models
Supported
CoreML iOS Yes CNN, RNN, SciKit Keras, Tensorflow,
MXNet
Tensorflow iOS/Android Yes CNN,RNN,LSTM, etc Tensorflow
Caffe2 iOS/Android Yes CNN Caffe2, CNTK, PyTorch
Snapdragon NPE Android Yes CNN, RNN, LSTM Caffe, Caffe2,
Tensorflow
CNNDroid Android Yes CNN Caffe, Torch, Theano
DeepLearningKit iOS Yes CNN Caffe
MXNet iOS/Android No CNN,RNN,LSTM, etc MXNet
Torch iOS/Android No CNN,RNN,LSTM, etc Torch

Possible Long Term Route for fastest speed on each phone
Train a model using your favorite DNN library
Import it to Tensorflow/Keras with Tensorflow
For iOS :
Use Keras and CoreML
For Android :
For Qualcomm chips (~50% of android phones) :
Using Snapdragon NPE
For remaining phones
Use Tensorflow Mobile
Model (Tensorflow format)
Keras +
CoreML
Snapdragon
NPE
Tensorflow
Mobile
iOS Android
Qualcomm
Chips
Remaining

Learn Playing an Accordion
3 months

Learn Playing an Accordion
3 months
Knows Piano
Fine Tune Skills
1 week

I got a dataset, Now What?
Step 1 : Find a pre-trained model
Step 2 : Fine tune a pre-trained model
Step 3 : Run using existing frameworks
“Don’t Be A Hero”
- Andrej Karpathy

How to find pretrained models for my task?
Search “Model Zoo”
Microsoft Cognitive Toolkit (previously called CNTK) – 50 Models
Caffe Model Zoo
Keras
Tensorflow
MXNet

AlexNet, 2012 (simplified)
[Krizhevsky, Sutskever,Hinton’12]
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11
n-dimension
Feature
representation

Deciding how to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

Deciding when to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

CoreML exporter from customvision.ai – Drag and drop training
5 minute shortcut to finetuning and getting model ready in CoreML format

CoreML exporter from customvision.ai
– Drag and drop training
5 minute shortcut to training, finetuning and
getting model ready in CoreML format
Drag and drop interface

Building a DL Website in 1 week

Less Data + Smaller Networks = Faster browser training

Several JavaScript Libraries
Run large CNNs
• Tensorfire
• WebDNN
• Keras-JS
• MXNetJS
• CaffeJS
Train and Run CNNs
• DeepLearn.js
• ConvNetJS
Train and Run LSTMs
• Brain.js
• Synaptic.js
Train and Run NNs
• Mind.js
• DN2A

ConvNetJS – Train + Infer on CPU
Both Train and Test NNs in browser
Train CNNs in browser

DeepLearn.js – Train + Infer on GPU
a

DeepLearn.js – Train + Infer on GPU
Uses WebGL to perform computation on GPU (including backprop)
Immediate execution model for inference (like Numpy)
Delayed execution model for training (like TensorFlow)
Upcoming tools to export weights from Tensorflow checkpoints

Tensorfire – Infer on GPU
Import models from Keras/Tensorflow
Any GPU works (including AMD), runs faster than TensorFlow on Macbook
Pro in browser
Supports low-precision math
Transforms NN weights into WebGL textures for speedup
Similar library : WebDNN.js

Keras.js
Run Keras models in browser, with GPU support.

Brain.JS
Train and run NNs in browser
Supports Feedforward, RNN, LSTM, GRU
No CNNs
Demo : http://brainjs.com/
Trained NN to recognize color contrast

MXNetJS
On Firefox and Microsoft Edge, performance is 8x faster than Chrome.
Optimization difference because of ASM.js.

Building a Crowdsourced Data Collector
in 1 months

Barcode recognition from Seeing AI
Live Guide user in finding a barcode with audio cues
With
Server
Decode barcode to identify product
Tech MPSCNN running on mobile GPU + barcode library
Metrics 40 FPS (~25 ms) on iPhone 7
Aim : Help blind users identify products using barcode
Issue : Blind users don’t know where the barcode is

Currency recognition from Seeing AI
Aim : Identify currency
Live Identify denomination of paper currency instantly
With
Server
-
Tech Task specific CNN running on mobile GPU
Metrics 40 FPS (~25 ms) on iPhone 7

Training Data Collection App
Request volunteers to take photos of objects
in non-obvious settings
Sends photos to cloud, trains model nightly
Newsletter shows the best photos from volunteers
Let them compete for fame

Daily challenge - Collected by volunteers

Challenge: Can you fool a Deep Neural Network?
Challenge users to find flaws in DNN
Helps trains a robust classifier with much lesser photos

What you want
https://www.flickr.com/photos/kenjonbro/9075514760/ and http://www.newcars.com/land-rover/range-rover-sport/2016
$2000$200,000
What you can afford

11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
fc, 4096
fc, 4096
fc, 1000
AlexNet, 8 layers
(ILSVRC 2012)
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015

11x11 conv, 96, /4, pool/2
3x3 conv, 384
3x3 conv, 384
fc, 4096
fc, 4096
fc, 1000
AlexNet, 8 layers
(ILSVRC 2012)
3x3 conv, 64
3x3 conv, 128
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
fc, 4096
fc, 4096
fc, 1000
VGG, 19 layers
(ILSVRC 2014)
input
Conv
7x7+ 2(S)
MaxPool
3x3+ 2(S)
LocalRespNorm
Conv
1x1+ 1(V)
Conv
3x3+ 1(S)
LocalRespNorm
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
AveragePool
7x7+ 1(V)
FC
Conv
1x1+ 1(S)
FC
FC
Soft maxAct ivat ion
soft max0
Conv
1x1+ 1(S)
FC
FC
soft max1
soft max2
GoogleNet, 22 layers
(ILSVRC 2014)
Revolution of Depth

AlexNet, 8 layers
(ILSVRC 2012)
ResNet, 152 layers
(ILSVRC 2015)
3x3 conv, 64
3x3 conv, 128
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
fc, 4096
fc, 4096
fc, 1000
11x11 conv, 96, /4, pool/2
3x3 conv, 384
3x3 conv, 384
fc, 4096
fc, 4096
fc, 1000
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 512, /2
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
ave pool, fc 1000
7x7 conv, 64, /2, pool/2
VGG, 19 layers
(ILSVRC 2014)
Revolution of Depth
Ultra
deep

ResNet, 152 layers 1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
7x7 conv, 64, /2, pool/2
Revolution of Depth

28.2
25.8
16.4
11.7
7.3 6.7
3.6 2.9
ILSVRC'10 ILSVRC'11 ILSVRC'12
AlexNet
ILSVRC'13 ILSVRC'14
VGG
ILSVRC'14
GoogleNet
ILSVRC'15
ResNet
ILSVRC'16
Ensemble
ImageNet Classification top-5 error (%)
shallow 8 layers
19 layers 22 layers
152 layers
Revolution of Depth vs Classification Accuracy
Ensemble of
Resnet, Inception
Resnet, Inception
and Wide Residual
Network

Your Budget - Smartphone Floating Point Operations Per Second (2015)
http://pages.experts-exchange.com/processing-power-compared/

iPhone X is more powerful than a Macbook Pro
https://thenextweb.com/apple/2017/09/12/apples-new-iphone-x-already-destroying-android-devices-g/

Accuracy vs Operations Per Image Inference
Size is proportional
to num parameters
Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016
552 MB
240 MB
What we want

Accuracy Per Parameter
Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016

Pick your DNN Architecture for your mobile architecture
Resnet Family
Under 64 ms on iPhone 7 using Metal GPU
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, "Deep Residual Learning for Image Recognition”, 2015

CoreML Benchmark - Pick your DNN for your mobile architecture
Model Top-1
Accuracy
Size of Model
(MB)
Million
Multi Adds
iPhone 6
Execution
Time (ms)
iPhone 6S
Execution
Time (ms)
iPhone 7
Execution
Time (ms)
VGG 16 71 553 15300 4556 254 208
Inception v3 78 95 5000 637 98 90
Resnet 50 75 103 3900 557 72 64
MobileNet 71 17 569 109 52 32
SqueezeNet 57 5 1700 78 29 24

MobileNet family
Splits the convolution into a 3x3 depthwise conv and a 1x1 pointwise
conv
Tune with two parameters – Width Multiplier and resolution multiplier
Andrew G. Howard et al, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017

Comparison for DNN architectures for Object Detection
Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017

Strategies to make DNNs even more efficient
Shallow networks
Compressing pre-trained networks
Designing compact layers
Quantizing parameters
Network binarization

Pruning
Aim : Remove all connections
with absolute weights below a
threshold
Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015

Observation : Most parameters in Fully Connected Layers
AlexNet 240 MB VGG-16 552 MB
96% of all
parameters
90% of all
parameters

Pruning gets quickest model compression without accuracy loss
AlexNet 240 MB VGG-16 552 MB
First layer which directly interacts with image is sensitive
and cannot be pruned too much without hurting
accuracy

Weight Sharing
Idea : Cluster weights with similar values together, and store in a dictionary.
Codebook
Huffman coding
HashedNets
Simplest implementation:
• Round all weights into 256 levels
• Tensorflow export script reduces inception zip file from 87 MB to 26 MB with
1% drop in precision

Selective training to keep networks shallow
Idea : Augment data limited to how your network will be used
Example : If making a selfie app, no benefit in rotating training images
beyond +-45 degrees. Your phone will anyway rotate.
Followed by WordLens / Google Translate
Example : Add blur if analyzing mobile phone frames

Design consideration for custom architectures – Small Filters
Three layers of 3x3 convolutions >> One layer of 7x7 convolution
Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutions
Replace NxN convolutions with stack of 1xN and Nx1
Fewer parameters 
Less compute 
More non-linearity 
Better
Faster
Stronger
Andrej Karpathy, CS-231n Notes, Lecture 11

SqueezeNet - AlexNet-level accuracy in 0.5 MB
SqueezeNet base 4.8 MB
SqueezeNet compressed 0.5 MB
80.3% top-5 Accuracy on ImageNet
0.72 GFLOPS/image
Fire Block
Forrest N. Iandola, Song Han et al, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size"

Reduced precision
Reduce precision from 32 bits to <=16 bits or lesser
Use stochastic rounding for best results
In Practice:
• Ristretto + Caffe
• Automatic Network quantization
• Finds balance between compression rate and accuracy
• Apple Metal Performance Shaders automatically quantize to 16 bits
• Tensorflow has 8 bit quantization support
• Gemmlowp – Low precision matrix multiplication library

Binary weighted Networks
Idea :Reduce the weights to -1,+1
Speedup : Convolution operation can be approximated by only summation
and subtraction
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

XNOR-Net
Idea :Reduce both weights + inputs to -1,+1
Speedup : Convolution operation can be approximated by XNOR and
Bitcount operations
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

Challenges
Off the shelf CNNs not robust for video
Solutions:
• Collective confidence over several frames
• CortexNet

Building a DL App and get
$10 million in funding
(or a PhD)

DeepX Toolkit
Nicholas D. Lane et al, “DXTK : Enabling Resource-efficient Deep Learning on Mobile and Embedded Devices with the DeepX Toolkit",2016

EIE : Efficient Inference Engine on Compressed DNNs
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark Horowitz, William Dally, "EIE: Efficient Inference Engine on Compressed Deep Neural Network", 2016
189x faster on CPU
13x faster on GPU

How to access the slides in 1 second
Link posted here -> @anirudhkoul

Squeezing Deep Learning Into Mobile Phones

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Squeezing Deep Learning Into Mobile Phones

Similar to Squeezing Deep Learning Into Mobile Phones (20)

Recently uploaded

Recently uploaded (20)

Squeezing Deep Learning Into Mobile Phones

Editor's Notes