SlideShare a Scribd company logo
1 of 31
Download to read offline
Jameson Toole
Creating smaller, faster, production-worthy
mobile machine learning models
O’Reilly AI London, 2019
@
“We showcase this approach by training an 8.3 billion parameter
transformer language model with 8-way model parallelism and 64-way
data parallelism on 512 GPUs, making it the largest transformer based
language model ever trained at 24x the size of BERT and 5.6x the size of
GPT-2.” - MegatronLM, 2019
@
Are we going in the right direction?
@
https://www.technologyreview.com/s/613630/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/
Training Megatron-ML from scratch: 0.3 kW x 220 hours x 512 GPUs =
33,914 kw
3X yearly energy consumption of the average American
@
Does my model enable the largest number of
people to iterate as fast as possible using the
fewest amount of resources on the most
devices?
@
How do you teach a microwave its name?
@
How do you teach a microwave its name?
Edge intelligence: small,
efficient neural networks
that run directly
on-device.
@
How do you teach a _____ to _____?
@
Edge Intelligence is necessary and inevitable.
Latency: too much data, too fast
Power: radios use too much energy
Connectivity: internet access isn’t guaranteed
Cost: compute and bandwidth aren’t free
Privacy: some data should stay in the hands of users
@
Most intelligence will be at the edge.
<100M
servers
3B
phones
12B
IoT
150B
embedded devices
= 1 billion devices
@
The Edge Intelligence lifecycle.
@
Model selection
75MB: Avg size of
Top-100 app
348KB: SRAM SparkFun Edge
Development Board
@
Model selection: macro-architecture
Design Principles
● Keep activation maps large by downsampling later or using atrous
(dilated) convolutions
● Use more channels, but fewer layers
● Spend more time optimizing expensive input and output blocks, they
are usually 15-25% of your computation cost
@
Model selection: macro-architecture
Backbones
● MobileNet (20mb)
● SqueezeNet (5mb)
Layers
● Depthwise Separable
Convolutions
● Bilinear upsampling
8-9X reduction in
computation cost
https://arxiv.org/abs/1704.04861
@
Model selection: micro-architecture
Design Principles
● Add a width multiplier to control the number of parameters with a
hyperparameter: kernel x kernel x channel x w
● Use 1x1 convolutions instead of 3x3 convolutions where possible
● Arrange layers so they can be fused before inference (e.g. bias + batch
norm)
@
Training small, fast models
Most neural networks are massively
over-parameterized.
@
Training small, fast models: distillation
Knowledge distillation: a smaller “student”
network learns from a larger “teacher”
Results:
1. ResNet on CIFAR10:
a. 46X smaller,
b. 10% less accurate
2. ResNet on ImageNet:
a. 2X smaller
b. 2% less accurate
3. TinyBert on Squad:
a. 7.5X smaller,
b. 3% less accurate
https://nervanasystems.github.io/distiller/knowledge_distillation.html
https://arxiv.org/abs/1802.05668v1
https://arxiv.org/abs/1909.10351v2
@
Training small, fast models: pruning
Iterative pruning: periodically removing
unimportant weights and / or filters during
training.
Results:
1. AlexNet and VGG on ImageNet:
a. Weight Level: 9-11X smaller
b. Filter Level: 2-3X smaller
c. No accuracy loss
2. No clear consensus on whether pruning
is required vs training smaller networks
from scratch.
https://arxiv.org/abs/1506.02626
https://arxiv.org/abs/1608.08710
https://arxiv.org/abs/1810.05270v2
https://arxiv.org/abs/1510.00149v5
2 5
2 5
1 1
3 21 4
7 6
1 8
9 2
2 5
2 5
0 0
3 20 4
7 6
0 8
9 2
Weight Level - smallest, not always faster
Filter Level - smaller, faster
@
Compressing models via quantization
Quantizing 32-bit floating point weights to low precision
integers decreases size and (sometimes) increases
speed.
https://medium.com/@kaustavtamuly/compressing-and-accelerating-high-dimensional-neural-networks-6b501983c0c8
@
Compressing models via quantization
Post-training quantization: train networks normally, quantize once after training.
Training aware quantization: periodically removing unimportant weights and / or filters during
training.
Weights and activations: quantize both weights and activations to increase speed
Results:
1. Post-training 8-bit quantization: 4X smaller with <2% accuracy loss
2. Training aware quantization: 8-16X smaller with minimal accuracy loss
3. Quantizing weights and activations can result in a 2-3X speed increase on CPUs
https://arxiv.org/abs/1806.08342
@
Deployment: embracing combinatorics
@
Deployment: embracing combinatorics
Design Principles
● Train multiple models targeting different devices: OS x device
● Use native formats and frameworks
● Leverage available DSPs
● Monitor performance across devices
@
Putting it all together
@
Putting it all together
Edge Intelligence Lifecycle
● Model selection: use efficient layers, parameterize model size
● Training: distill / prune for 2-10X smaller models, little accuracy loss
● Quantization: 8-bit models 4X smaller, 2-3X faster, no accuracy loss
● Deployment: use native formats that leverage available DSPs
● Improvement: put the right model on the right device at the right time
@
6327 kb / 7 fps iPhone X 28kb / +50 fps iPhone X
225x
smaller
1.6 million parameters 6,300 parameters
Putting it all together
@
Putting it all together
“TinyBERT is empirically effective and achieves comparable results with
BERT in GLUE datasets, while being 7.5x smaller and 9.4x faster on
inference.” - Jiao et al
“Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB,
again with no loss of accuracy.” - Han et al
“The model itself takes up less than 20KB of Flash storage space … and it
only needs 30KB of RAM to operate.” - Peter Warden at TensorFlow Dev
Summit 2019
@
Open questions and future work
Need better support for quantized operations.
Need more rigorous study of model optimization vs task complexity.
Will platform-aware architecture search be helpful?
Can MLIR solve the combinatorics problem?
@
Complete Platform for Edge Intelligence
@
Complete Platform for Edge Intelligence
@
Benefits of using Fritz
Mobile Developers
● Prepared + Pretrained
● Simple APIs
● Fast, Secure, On-device
Machine Learning Engineers
● Iterate on Mobile
● Benchmark + Optimize
● Analytics
Try yourself:
Fritz AI Studio
App Store
Google Play
Working at the edge?
Questions?
@jamesonthecrow
jameson@fritz.ai
https://www.fritz.ai
Join the community!
heartbeat.fritz.ai

More Related Content

Similar to Creating smaller, faster, production-ready mobile machine learning models.

Deep learning at the edge: 100x Inference improvement on edge devices
Deep learning at the edge: 100x Inference improvement on edge devicesDeep learning at the edge: 100x Inference improvement on edge devices
Deep learning at the edge: 100x Inference improvement on edge devicesNumenta
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 
Scalable Learning in Computer Vision
Scalable Learning in Computer VisionScalable Learning in Computer Vision
Scalable Learning in Computer Visionbutest
 
A Lock-Free Algorithm of Tree-Based Reduction for Large Scale Clustering on G...
A Lock-Free Algorithm of Tree-Based Reduction for Large Scale Clustering on G...A Lock-Free Algorithm of Tree-Based Reduction for Large Scale Clustering on G...
A Lock-Free Algorithm of Tree-Based Reduction for Large Scale Clustering on G...Ruo Ando
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...Bomm Kim
 
DLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningDLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningBrodmann17
 
Retraining Quantized Neural Network Models with Unlabeled Data.pdf
Retraining Quantized Neural Network Models with Unlabeled Data.pdfRetraining Quantized Neural Network Models with Unlabeled Data.pdf
Retraining Quantized Neural Network Models with Unlabeled Data.pdfKundjanasith Thonglek
 
Development of 3D convolutional neural network to recognize human activities ...
Development of 3D convolutional neural network to recognize human activities ...Development of 3D convolutional neural network to recognize human activities ...
Development of 3D convolutional neural network to recognize human activities ...journalBEEI
 
Accelerating stochastic gradient descent using adaptive mini batch size3
Accelerating stochastic gradient descent using adaptive mini batch size3Accelerating stochastic gradient descent using adaptive mini batch size3
Accelerating stochastic gradient descent using adaptive mini batch size3muayyad alsadi
 
Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.Dr. Kim (Kyllesbech Larsen)
 
Beyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksBeyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksJunKudo2
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaginggeetachauhan
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)Ryousei Takano
 
Convolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsConvolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsMohsen Jafarzadeh
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model CompressionApache MXNet
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5Peter Lawrey
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...Sunghoon Joo
 
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...Virendra Uppalwar
 

Similar to Creating smaller, faster, production-ready mobile machine learning models. (20)

Deep learning at the edge: 100x Inference improvement on edge devices
Deep learning at the edge: 100x Inference improvement on edge devicesDeep learning at the edge: 100x Inference improvement on edge devices
Deep learning at the edge: 100x Inference improvement on edge devices
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Scalable Learning in Computer Vision
Scalable Learning in Computer VisionScalable Learning in Computer Vision
Scalable Learning in Computer Vision
 
A Lock-Free Algorithm of Tree-Based Reduction for Large Scale Clustering on G...
A Lock-Free Algorithm of Tree-Based Reduction for Large Scale Clustering on G...A Lock-Free Algorithm of Tree-Based Reduction for Large Scale Clustering on G...
A Lock-Free Algorithm of Tree-Based Reduction for Large Scale Clustering on G...
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
 
DLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningDLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep Learning
 
Retraining Quantized Neural Network Models with Unlabeled Data.pdf
Retraining Quantized Neural Network Models with Unlabeled Data.pdfRetraining Quantized Neural Network Models with Unlabeled Data.pdf
Retraining Quantized Neural Network Models with Unlabeled Data.pdf
 
Development of 3D convolutional neural network to recognize human activities ...
Development of 3D convolutional neural network to recognize human activities ...Development of 3D convolutional neural network to recognize human activities ...
Development of 3D convolutional neural network to recognize human activities ...
 
cbs_sips2005
cbs_sips2005cbs_sips2005
cbs_sips2005
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
Accelerating stochastic gradient descent using adaptive mini batch size3
Accelerating stochastic gradient descent using adaptive mini batch size3Accelerating stochastic gradient descent using adaptive mini batch size3
Accelerating stochastic gradient descent using adaptive mini batch size3
 
Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.
 
Beyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksBeyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networks
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaging
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
 
Convolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsConvolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic hands
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model Compression
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
 
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...
 

Recently uploaded

Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 

Recently uploaded (20)

Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 

Creating smaller, faster, production-ready mobile machine learning models.

  • 1. Jameson Toole Creating smaller, faster, production-worthy mobile machine learning models O’Reilly AI London, 2019
  • 2. @ “We showcase this approach by training an 8.3 billion parameter transformer language model with 8-way model parallelism and 64-way data parallelism on 512 GPUs, making it the largest transformer based language model ever trained at 24x the size of BERT and 5.6x the size of GPT-2.” - MegatronLM, 2019
  • 3. @ Are we going in the right direction?
  • 4. @ https://www.technologyreview.com/s/613630/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/ Training Megatron-ML from scratch: 0.3 kW x 220 hours x 512 GPUs = 33,914 kw 3X yearly energy consumption of the average American
  • 5. @ Does my model enable the largest number of people to iterate as fast as possible using the fewest amount of resources on the most devices?
  • 6. @ How do you teach a microwave its name?
  • 7. @ How do you teach a microwave its name? Edge intelligence: small, efficient neural networks that run directly on-device.
  • 8. @ How do you teach a _____ to _____?
  • 9. @ Edge Intelligence is necessary and inevitable. Latency: too much data, too fast Power: radios use too much energy Connectivity: internet access isn’t guaranteed Cost: compute and bandwidth aren’t free Privacy: some data should stay in the hands of users
  • 10. @ Most intelligence will be at the edge. <100M servers 3B phones 12B IoT 150B embedded devices = 1 billion devices
  • 12. @ Model selection 75MB: Avg size of Top-100 app 348KB: SRAM SparkFun Edge Development Board
  • 13. @ Model selection: macro-architecture Design Principles ● Keep activation maps large by downsampling later or using atrous (dilated) convolutions ● Use more channels, but fewer layers ● Spend more time optimizing expensive input and output blocks, they are usually 15-25% of your computation cost
  • 14. @ Model selection: macro-architecture Backbones ● MobileNet (20mb) ● SqueezeNet (5mb) Layers ● Depthwise Separable Convolutions ● Bilinear upsampling 8-9X reduction in computation cost https://arxiv.org/abs/1704.04861
  • 15. @ Model selection: micro-architecture Design Principles ● Add a width multiplier to control the number of parameters with a hyperparameter: kernel x kernel x channel x w ● Use 1x1 convolutions instead of 3x3 convolutions where possible ● Arrange layers so they can be fused before inference (e.g. bias + batch norm)
  • 16. @ Training small, fast models Most neural networks are massively over-parameterized.
  • 17. @ Training small, fast models: distillation Knowledge distillation: a smaller “student” network learns from a larger “teacher” Results: 1. ResNet on CIFAR10: a. 46X smaller, b. 10% less accurate 2. ResNet on ImageNet: a. 2X smaller b. 2% less accurate 3. TinyBert on Squad: a. 7.5X smaller, b. 3% less accurate https://nervanasystems.github.io/distiller/knowledge_distillation.html https://arxiv.org/abs/1802.05668v1 https://arxiv.org/abs/1909.10351v2
  • 18. @ Training small, fast models: pruning Iterative pruning: periodically removing unimportant weights and / or filters during training. Results: 1. AlexNet and VGG on ImageNet: a. Weight Level: 9-11X smaller b. Filter Level: 2-3X smaller c. No accuracy loss 2. No clear consensus on whether pruning is required vs training smaller networks from scratch. https://arxiv.org/abs/1506.02626 https://arxiv.org/abs/1608.08710 https://arxiv.org/abs/1810.05270v2 https://arxiv.org/abs/1510.00149v5 2 5 2 5 1 1 3 21 4 7 6 1 8 9 2 2 5 2 5 0 0 3 20 4 7 6 0 8 9 2 Weight Level - smallest, not always faster Filter Level - smaller, faster
  • 19. @ Compressing models via quantization Quantizing 32-bit floating point weights to low precision integers decreases size and (sometimes) increases speed. https://medium.com/@kaustavtamuly/compressing-and-accelerating-high-dimensional-neural-networks-6b501983c0c8
  • 20. @ Compressing models via quantization Post-training quantization: train networks normally, quantize once after training. Training aware quantization: periodically removing unimportant weights and / or filters during training. Weights and activations: quantize both weights and activations to increase speed Results: 1. Post-training 8-bit quantization: 4X smaller with <2% accuracy loss 2. Training aware quantization: 8-16X smaller with minimal accuracy loss 3. Quantizing weights and activations can result in a 2-3X speed increase on CPUs https://arxiv.org/abs/1806.08342
  • 22. @ Deployment: embracing combinatorics Design Principles ● Train multiple models targeting different devices: OS x device ● Use native formats and frameworks ● Leverage available DSPs ● Monitor performance across devices
  • 23. @ Putting it all together
  • 24. @ Putting it all together Edge Intelligence Lifecycle ● Model selection: use efficient layers, parameterize model size ● Training: distill / prune for 2-10X smaller models, little accuracy loss ● Quantization: 8-bit models 4X smaller, 2-3X faster, no accuracy loss ● Deployment: use native formats that leverage available DSPs ● Improvement: put the right model on the right device at the right time
  • 25. @ 6327 kb / 7 fps iPhone X 28kb / +50 fps iPhone X 225x smaller 1.6 million parameters 6,300 parameters Putting it all together
  • 26. @ Putting it all together “TinyBERT is empirically effective and achieves comparable results with BERT in GLUE datasets, while being 7.5x smaller and 9.4x faster on inference.” - Jiao et al “Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB, again with no loss of accuracy.” - Han et al “The model itself takes up less than 20KB of Flash storage space … and it only needs 30KB of RAM to operate.” - Peter Warden at TensorFlow Dev Summit 2019
  • 27. @ Open questions and future work Need better support for quantized operations. Need more rigorous study of model optimization vs task complexity. Will platform-aware architecture search be helpful? Can MLIR solve the combinatorics problem?
  • 28. @ Complete Platform for Edge Intelligence
  • 29. @ Complete Platform for Edge Intelligence
  • 30. @ Benefits of using Fritz Mobile Developers ● Prepared + Pretrained ● Simple APIs ● Fast, Secure, On-device Machine Learning Engineers ● Iterate on Mobile ● Benchmark + Optimize ● Analytics Try yourself: Fritz AI Studio App Store Google Play
  • 31. Working at the edge? Questions? @jamesonthecrow jameson@fritz.ai https://www.fritz.ai Join the community! heartbeat.fritz.ai