SlideShare a Scribd company logo
1 of 40
Download to read offline
April 6, 2021 @QCOMResearch
Qualcomm Technologies, Inc.
Intelligence at
scale through
AI model efficiency
2
• Why efficient machine learning
is necessary for AI to proliferate
• Our latest research to make
AI models more efficient
• Our open-source projects
to scale efficient AI
Agenda
3
3
Video monitoring
Extended reality Smart cities
Smart factories
Autonomous vehicles
Video conferencing
Smart homes
Smartphone
AI is being used all around us
increasing productivity, enhancing collaboration,
and transforming industries
AI video analysis is on the rise
Trend toward more cameras, higher resolution,
and increased frame rate across devices
4
4
Source: Welling
Will we have reached the capacity of the human brain?
Energy efficiency of a brain is 100x better than current hardware
2025
Weight
parameter
count
1940 1950 1960 1970 1980 1990 2000 2010 2020 2030
1943: First NN (+/- N=10)
1988: NetTalk
(+/- N=20K)
2009: Hinton’s Deep
Belief Net (+/- N=10M)
2013: Google/Y!
(N=+/- 1B)
2025:
N = 100T = 1014
2017: Very large neural
networks (N=137B)
1012
1010
108
106
1014
104
102
100
Deep neural networks
are energy hungry
and growing fast
AI is being powered by the explosive
growth of deep neural networks
2021: Extremely large
neural networks (N =1.6T)
5
5
Power and thermal
efficiency are essential
for on-device AI
The challenge of
AI workloads
Constrained mobile
environment
Very compute
intensive
Large,
complicated neural
network models
Complex
concurrencies
Always-on
Real-time
Must be thermally
efficient for sleek,
ultra-light designs
Storage/memory
bandwidth limitations
Requires long battery
life for all-day use
6
Holistic
model efficiency
research
Multiple axes to shrink
AI models and efficiently
run them on hardware
Quantization
Learning to reduce
bit-precision while keeping
desired accuracy
Compression
Learning to prune
model while keeping
desired accuracy
Compilation
Learning to compile
AI models for efficient
hardware execution
Neural
architecture
search
Learning to design smaller
neural networks that are on par
or outperform hand-designed
architectures on real
hardware
7
1: FP32 model compared to quantized model
Leading
research to
efficiently
quantize
AI models
Promising results show that
low-precision integer inference
can become widespread
Virtually the same accuracy
between a FP32 and quantized
AI model through:
• Automated, data free,
post-training methods
• Automated training-based
mixed-precision method
Significant performance per watt
improvements through quantization
Automated reduction in precision
of weights and activations while
maintaining accuracy
Models trained at
high precision
32-bit floating point
3452.3194
8-bit Integer
255
Increase in performance
per watt from savings in
memory and compute1
Inference at
lower precision
16-bit Integer
3452
01010101
Increase in performance
per watt from savings in
memory and compute1
up to
4X
4-bit Integer
15
Increase in performance
per watt from savings in
memory and compute1
01010101
up to
16X
up to
64X
01010101
0101
01010101 01010101 01010101 01010101
8
8
Data-free
quantization
Created an automated method
that addresses bias and
imbalance in weight ranges:
No training
Data free
How can we make
quantization as simple
as possible?
Pushing the
limits of what’s
possible with
quantization
AdaRound
Created an automated
method for finding the
best rounding choice:
No training
Minimal unlabeled data
Is rounding to the nearest
value the best approach
for quantization?
SOTA 8-bit results
Making 8-bit weight
quantization ubiquitous
<1%
Accuracy drop for
MobileNet V2
against FP32 model
Data-Free Quantization Through Weight Equalization and
Bias Correction (Nagel, van Baalen, et al., ICCV 2019)
SOTA: State-of-the-art
Making 4-bit weight quantization
ubiquitous
<2.5%
Accuracy drop for
MobileNet V2
against FP32 model
Up or Down? Adaptive Rounding for Post-Training
Quantization (Nagel, Amjad, et al., ICML 2020)
Bayesian bits
Created a novel method
to learn mixed-precision
quantization:
Training required
Training data required
Jointly learns bit-width
precision and pruning
Can we quantize layers to
different bit widths based
on precision sensitivity?
SOTA mixed-precision results
Automating mixed-precision
quantization and enabling the tradeoff
between accuracy and kernel bit-width
<1%
Accuracy drop for
MobileNet V2 against FP32
model for mixed precision
model with computational
complexity equivalent to a
4-bit weight model
Bayesian Bits: Unifying Quantization and Pruning
van Baalen, Louizos, et al., NeurIPS 2020)
8
8
SOTA 4-bit weight results
9
9
Optimizing and deploying state-of-the-art AI models
for diverse scenarios at scale is challenging
Neural network
complexity
Many state-of-the-art
neural network
solutions are large,
complex, and do
not run efficiently
on target hardware
Neural network
diversity
Device
diversity
Cost
For different
tasks and use
case cases,
many different
neural networks
are required
Deploying neural
networks to many
different devices
with different
configurations and
changing software
is required
Compute and
engineering
resources for
training plus
evaluation are
too costly and
time consuming
10
10
NAS
Neural
Architecture
Search
An automated way to learn
a network topology that can
achieve the best performance
on a certain task
Search
space
Set of operations
and how they
can be connected
to form valid
network
architectures
Search
algorithm
Method for
sampling a
population of
good network
architecture
candidates
Evaluation
strategy
Method to
estimate the
performance
of sampled
network
architectures
11
High cost
Brute force search is expensive
>40,000 epochs per platform
Lack diverse search
Hard to search in diverse spaces, with different
block-types, attention, and activations
Repeated training phase for every new scenario
Do not scale
Repeated training phase for every new device
>40,000 epochs per platform
Unreliable hardware models
Requires differentiable cost-functions
Repeated training phase for every new device
Existing NAS
solutions do not
address all the
challenges
12
Introducing new AI research
Efficient NAS with hardware-aware optimization
A scalable method that finds pareto-optimal
network architectures in terms of accuracy and
latency for any hardware platform at low cost
Starts from an oversized pretrained
reference architecture
Distilling Optimal Neural
Network Architectures
DONNA
DONNA
Oversized pretrained
reference architecture
Set of Pareto-optimal
network architectures
Low cost
Low start-up cost of 1000-4000 epochs,
equivalent to training 2-10 networks from scratch
Diverse search to find
the best models
Supports diverse spaces with different cell-types,
attention, and activation functions (ReLU, Swish, etc.)
Scalable
Scales to many hardware devices
at minimal cost
Reliable hardware
measurements
Uses direct hardware measurements instead
of a potentially inaccurate hardware model
Distilling Optimal Neural Networks: Rapid Search in Diverse Spaces (Moons, Bert, et al., arXiv 2020)
13
Varying parameters:
• Kernel Size
• Expansion Factors
• Network depth
• Network width
• Attention/activation
• Different efficient
layer types
DONNA 4-step process Objective: Build accuracy model of search space once, then deploy to many scenarios
Define reference
and search
space once
A
13
Define backbone:
• Fixed channels
• Head and Stem
1 2 3 4 5
14
Define reference architecture and search-space once
A diverse search space is essential for finding optimal architectures with higher accuracy
Select reference
architecture
The largest model
in the search-space
Chop the NN
into blocks
Fix the STEM, HEAD,
# blocks, strides,
# channels at block-edge
Choose search space
Diverse factorized
hierarchical search space,
including variable kernel-
size, expansion-rate, depth,
# channels, cell-type,
activation, attention
STEM 1, s=2 2, s=2 3, s=2 4, s=1 5, s=2 HEAD
ch=32 ch=64 ch=96 ch=128 ch=196 ch=256
Conv
1x1
FC
Avg
ch=1536
Conv
3x3s2
DW
Conv
ch=32
Kernel:
Expand:
Depth:
Attention:
3,5,7
2,3,4,6
1,2,3,4
SE, no SE
Activation:
Cell type:
Width scale:
ReLU/Swish
grouped, DW, …
0.5x, 1.0x
Choose diverse search space
Ch: channel; SE: Squeeze-and-Excitation
15
Varying parameters:
• Kernel Size
• Expansion Factors
• Network depth
• Network width
• Attention/activation
• Different efficient
layer types
Approximate ideal projections
of a reference model through KD
1
1
MSE
2
2
MSE
3
3
MSE
4
4
MSE
5
5
MSE
Use quality of blockwise approximations
to build accuracy model
MSE
4
MSE
1
MSE
5
MSE
2
MSE
3
Define backbone:
• Fixed channels
• Head and Stem
1 2 3 4 5
DONNA 4-step process Objective: Build accuracy model of search space once, then deploy to many scenarios
15
Build accuracy model via
Knowledge Distillation
(KD) once
B
Define reference
and search
space once
A
16
Build accuracy predictor via BKD once
Low-cost hardware-agnostic training phase
Block library
Pretrain all blocks in search-
space through blockwise
knowledge distillation
Fast block training
Trivial parallelized training
Broad search space
Block
pretrained
weights
Block
quality
metrics
Finetuned
architectures
Architecture library
Quickly finetune a
representative set
of architectures
Finetune sampled networks
Fast network training
Only 20-30 NN required
Accuracy predictor
Fit linear
regression
model
Regularized Ridge Regression
Accurate predictions
BKD: blockwise knowledge distillation
17
Varying parameters:
• Kernel Size
• Expansion Factors
• Network depth
• Network width
• Attention/activation
• Different efficient
layer types
Define backbone:
• Fixed channels
• Head and Stem
1 2 3 4 5
A
Approximate ideal projections
of a reference model through KD
1
MSE
2
MSE
3
MSE
4
MSE
5
MSE
Use quality of blockwise approximations
to build accuracy model
MSE
4
MSE
1
MSE
5
MSE
2
MSE
3
different compiler versions,
different image sizes
HW latency
Predicted
accuracy
Scenario-
specific
search
DONNA 4-step process Objective: Build accuracy model of search space once, then deploy to many scenarios
17
Build accuracy model via
Knowledge Distillation
(KD) once
B Evolutionary
search in 24h
C
Define reference
and search
space once
1 2 3 4 5
Qualcomm Snapdragon is a product of Qualcomm Technologies, Inc. and/or its subsidiaries.
18
Evolutionary search with real hardware measurements
Scenario-specific search allows users to select optimal architectures for real-life deployments
Quick turnaround time
Results in +/- 1 day using one measurement device
NSGA: Non-dominated Sorting Genetic Algorithm
Accurate scenario-specific search
Captures all intricacies of the hardware platform
and software — e.g. run-time version or devices
NSGA-II
sampling
algorithm
Target HW
Task
acc predictor
Predicted task accuracy
Measured latency on device
End-to-end
model
19
different compiler versions,
different image sizes
HW latency
Predicted
accuracy
Scenario-
specific
search
MSE
4
MSE
1
MSE
5
MSE
2
MSE
3
4
1 5
2 3
Use KD-initialized
blocks from step B
to finetune any
network in the
search space in
15-50 epochs
instead of 450
DONNA 4-step process Objective: Build accuracy model of search space once, then deploy to many scenarios
19
Evolutionary
search in 24h
C Sample and
finetune
D
Varying parameters:
• Kernel Size
• Expansion Factors
• Network depth
• Network width
• Attention/activation
• Different efficient
layer types
Define backbone:
• Fixed channels
• Head and Stem
1 2 3 4 5
A
Approximate ideal projections
of a reference model through KD
1
MSE
2
MSE
3
MSE
4
MSE
5
MSE
Use quality of blockwise approximations
to build accuracy model
MSE
4
MSE
1
MSE
5
MSE
2
MSE
3
Build accuracy model via
Knowledge Distillation
(KD) once
B
Define reference
and search
space once
1 2 3 4 5
20
Quickly finetune predicted Pareto-optimal architectures
Finetune to reach full accuracy and complete hardware-aware optimization for on-device AI deployments
Soft distillation on teacher logits
Block
pretrained
weights
4
1 5
2 3
4
1 5
2 3
CE
Soft
CE
Ground-truth
labels
BKD-reference network
Predicted
accuracy
HW latency
Confirmed
accuracy
HW latency
21
Top-1
val
accuracy
[%]
Qualcomm® Adreno™ 660 GPU in the Snapdragon 888 running on the Samsung Galaxy S21. 2: Qualcomm® Hexagon™ 780 Processor in the Snapdragon 888 running on the Samsung Galaxy S21.
Qualcomm Adreno and Qualcomm Hexagon are products of Qualcomm Technologies, Inc. and/or its subsidiaries.
DONNA finds state-of-the-art
networks for on-device scenarios
Quickly optimize and make tradeoffs in model accuracy with respect
to the deployment conditions that matter
# of Parameters [M]
FLOPS
FPS
Desktop GPU latency
FPS
Mobile SoC latency1
(Adreno 660 GPU)
FPS
Mobile SoC latency2
(Hexagon 780 Processor)
20%
faster at similar
accuracy
20%
faster at similar
accuracy
224x224 images 224x224 images
224x224 images 672x672 images
20%
faster at similar
accuracy
22
22
DONNA provides MnasNet-level diversity at 100x lower cost
*Training 1 model from scratch = 450 epochs
DONNA
efficiently
finds optimal
models over
diverse
scenarios
Cost of training
is a handful of
architectures*
Method Granularity Macro-diversity
Search-cost
1 scenario
[epochs]
Cost / scenario
4 scenarios
[epochs]
Cost / scenario
∞ scenarios
[epochs]
OFA Layer-level Fixed 1200+10×[25 — 75] 550 — 1050 250 — 750
DNA Layer-level Fixed 770+10×450 4700 4500
MNasNet Block-level Variable 40000+10×450 44500 44500
DONNA Block-level Variable 4000+10×50 1500 500
Good OK Not good
23
DONNA finds
state-of-the-art
networks for
on-device
scenarios
Quickly optimize and
make tradeoffs in model
accuracy with respect
to the deployment
conditions that matter
# Multiply Accumulate Operations [FLOPS]
ResNet-50
Mobile models
Predicted
top-1
accuracy
VIT-B
DEIT-B
Object
detection
Vision
transformers
DONNA applies directly to downstream
tasks and non-CNN neural architectures
without conceptual code changes
VAL
mAP
(%)
24
Execute predictor-driven
evolutionary search
on-device
Define a search-space
of smaller, faster
network architectures
Roughly equal to training
2-10 nets from scratch
Create an accuracy
predictor for all
network architectures
User perspective for DONNA
Build accuracy model of search space once, then deploy to many scenarios
Oversized pretrained
reference architecture
IN
DONNA
A B C
Deploy best models @ 3ms, 6ms, 9ms, … on different chips
Set of Pareto-optimal
network architectures
OUT
1 day /
use-case
Finetune best searched
models, 9-30x faster vs
regular training
D
50 GPU-hrs /
net
25
25
DONNA
Conclusions
DONNA shrinks big networks
in a hardware-efficient way
DONNA can be rerun for any new
device or setting within a day
DONNA works on many different
tasks out of the box
DONNA enables scalability and
allows models to be easily updated
after small changes rather than
starting from scratch
25
26
26
AIMET and AIMET Model Zoo are products of Qualcomm Innovation Center, Inc.
Leading AI research and fast commercialization
Driving the industry towards integer inference and power-efficient AI
AI Model Efficiency Toolkit (AIMET)
AIMET Model Zoo
Relaxed Quantization
(ICLR 2019)
Data-free Quantization
(ICCV 2019)
AdaRound
(ICML 2020)
Bayesian Bits
(NeurIPS 2020)
Quantization
research
Quantization
open-sourcing
27
Open-source projects to scale model-efficient AI to the masses
AIMET &
AIMET Model Zoo
28
28
AIMET makes AI models small
Open-sourced GitHub project that includes state-of-the-art quantization
and compression techniques from Qualcomm AI Research
If interested, please join the AIMET GitHub project: https://github.com/quic/aimet
Trained
AI model
AI Model Efficiency Toolkit
(AIMET)
Optimized
AI model
TensorFlow or PyTorch
Compression
Quantization
Deployed
AI model
Features:
State-of-the-art
network compression
tools
State-of-the-art
quantization
tools
Support for both
TensorFlow
and PyTorch
Benchmarks
and tests for
many models
Developed by
professional software
developers
29
Benefits
Lower memory
bandwidth
Lower
power
Lower
storage
Higher
performance
Maintains model
accuracy
Simple
ease of use
AIMET
Providing advanced
model efficiency
features and benefits
Features
Quantization
Compression
State-of-the-art INT8 and
INT4 performance
Post-training quantization methods,
including Data-Free Quantization
and Adaptive Rounding (AdaRound) —
coming soon
Quantization-aware training
Quantization simulation
Efficient tensor decomposition
and removal of redundant
channels in convolution layers
Spatial singular value
decomposition (SVD)
Channel pruning
Visualization
Analysis tools for drawing insights
for quantization and compression
Weight ranges
Per-layer compression sensitivity
30
30
APIs invoked directly from the pipeline
Supports TensorFlow and PyTorch
Direct algorithm API frameworks
User-friendly APIs
compress_model (model,
eval_callback=obj_det_eval,
compress_scheme=Scheme.spatial_svd, ... )
equalize_model (model, ...)
AIMET features and APIs are easy to use
Designed to fit naturally in the AI model development workflow for researchers, developers, and ISVs
AIMET
extensions extensions
Model optimization library
(techniques to compress & quantize models)
Framework
specific API
Algorithm
API
Other
frameworks
31
Activation
range estimation
Bias
correction
Weight
quantization
Bias
absorption
Equalize
weight ranges
Avoid high
activation ranges
DFQ: data free quantization
Data Free Quantization results in AIMET
Post-training technique enabling INT8 inference with very minimal loss in accuracy
Cross-layer
equalization
Measure and
correct shift in
layer outputs
Estimate
activation ranges
for quantization
<1%
% Reduction in accuracy
between FP32 ad INT8
DFQ
example
results MobileNet-v2
(top-1 accuracy)
<1%
ResNet-50
(top-1 accuracy)
<1%
DeepLabv3
(mean intersection over union)
31
32
INT8, AdaRound quantization
AP: Average Precision
INT8, baseline quantization
Post-training technique that
makes INT8 quantization
more accurate and INT4
quantization possible
AdaRound is
coming soon
to AIMET
82.20
Bitwidth Mean AP (mAP)
FP32
INT8 baseline
quantization
INT8 AdaRound
quantization
49.85
81.21
<1%
Reduction in
accuracy
between
FP32 ad INT8
AdaRound
quantization
32
33
33
34
AIMET
Model Zoo
Accurate pre-trained 8-bit
quantized models
Image
classification
Semantic
segmentation
Pose
estimation
Speech
recognition
Super
resolution
Object
detection
35
*: Comparison between FP32 model and INT8 model quantized with AIMET.
For further details, check out: https://github.com/quic/aimet-model-zoo/
ResNet-50
(v1)
Top-1 accuracy*
FP32 INT8
75.21% 74.96%
MobileNet-
v2-1.4
Top-1 accuracy*
FP32 INT8
75% 74.21%
EfficientNet
Lite
Top-1 accuracy*
FP32 INT8
74.93% 74.99%
SSD
MobileNet-v2
mAP*
FP32 INT8
0.2469 0.2456
RetinaNet
mAP*
FP32 INT8
0.35 0.349
Pose
estimation
mAP*
FP32 INT8
0.383 0.379
SRGAN
PSNR*
FP32 INT8
25.45 24.78
MobileNetV2
Top-1 accuracy*
FP32 INT8
7167% 71.14%
EfficientNet-
lite0
Top-1 accuracy*
FP32 INT8
75.42% 74.44%
DeepLabV3+
mIoU*
FP32 INT8
72.62% 72.22%
MobileNetV2-
SSD-Lite
mAP*
FP32 INT8
68.7% 68.6%
Pose
estimation
mAP*
FP32 INT8
0.364 0.359
SRGAN
PSNR
FP32 INT8
25.51 25.5
DeepSpeech2
WER*
FP32 INT8
9.92% 10.22%
AIMET Model Zoo includes popular quantized AI models
Accuracy is maintained for INT8 models — less than 1% loss*
35
<1%
Loss in
accuracy*
36
Baseline quantization: Post-training quantization
using min-max based quantization grid
AIMET quantization: Model fine-tuned using
Quantization Aware Training in AIMET
FP32 INT8
(AIMET quantization)
INT8
(Baseline quantization)
AIMET Model
Zoo models
preserve accuracy
Visual difference in model
accuracy is telling between
AIMET and baseline
quantization methods
For DeepLabv3+
semantic segmentation,
AIMET quantization
maintains accuracy,
while baseline quantization
method is inaccurate
Accurate
segmentation
Inaccurate
segmentation
37
37
Join our open-source projects
AIMET
State-of-the-art quantization and compression techniques
github.com/quic/aimet
AIMET Model Zoo
Accurate pre-trained 8-bit quantized models
github.com/quic/aimet-model-zoo
38
38
AI model efficiency is crucial
for making AI ubiquitous, leading to
smarter devices and enhanced lives
We are conducting leading research
and development in AI model
efficiency while maintaining accuracy
Our open-source projects, based on
this leading research, are making it
possible for the industry to adopt
efficient AI models at scale
39
www.qualcomm.com/ai
@QCOMResearch
www.qualcomm.com/news/onq
https://www.youtube.com/qualcomm?
http://www.slideshare.net/qualcommwirelessevolution
Connect with Us
Questions?
Follow us on:
For more information, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
Thank you
Nothing in these materials is an offer to sell any of the
components or devices referenced herein.
©2018-2021 Qualcomm Technologies, Inc. and/or its
affiliated companies. All Rights Reserved.
Qualcomm, Snapdragon, Adreno, and Hexagon are
trademarks or registered trademark of Qualcomm
Incorporated. Other products and brand names may be
trademarks or registered trademarks of their respective
owners.
References in this presentation to “Qualcomm” may mean Qualcomm
Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries
or business units within the Qualcomm corporate structure, as
applicable. Qualcomm Incorporated includes our licensing business,
QTL, and the vast majority of our patent portfolio. Qualcomm
Technologies, Inc., a subsidiary of Qualcomm Incorporated, operates,
along with its subsidiaries, substantially all of our engineering,
research and development functions, and substantially all of our
products and services businesses, including our QCT semiconductor
business.

More Related Content

What's hot

Internet of things (iot)
Internet of things (iot)Internet of things (iot)
Internet of things (iot)
sankar s
 
“The Future of AI is Here Today: Deep Dive into Qualcomm’s On-Device AI Offer...
“The Future of AI is Here Today: Deep Dive into Qualcomm’s On-Device AI Offer...“The Future of AI is Here Today: Deep Dive into Qualcomm’s On-Device AI Offer...
“The Future of AI is Here Today: Deep Dive into Qualcomm’s On-Device AI Offer...
Edge AI and Vision Alliance
 

What's hot (20)

Enabling the metaverse with 5G- web.pdf
Enabling the metaverse with 5G- web.pdfEnabling the metaverse with 5G- web.pdf
Enabling the metaverse with 5G- web.pdf
 
6G Training Course Part 7: 6G Technologies - Introduction
6G Training Course Part 7: 6G Technologies - Introduction6G Training Course Part 7: 6G Technologies - Introduction
6G Training Course Part 7: 6G Technologies - Introduction
 
Introduction to IoT Architectures and Protocols
Introduction to IoT Architectures and ProtocolsIntroduction to IoT Architectures and Protocols
Introduction to IoT Architectures and Protocols
 
Presentation 5G high school
Presentation 5G high schoolPresentation 5G high school
Presentation 5G high school
 
Fog Computing and the Internet of Things
Fog Computing and the Internet of ThingsFog Computing and the Internet of Things
Fog Computing and the Internet of Things
 
Intro to AWS IoT
Intro to AWS IoTIntro to AWS IoT
Intro to AWS IoT
 
Internet of things (iot)
Internet of things (iot)Internet of things (iot)
Internet of things (iot)
 
“The Future of AI is Here Today: Deep Dive into Qualcomm’s On-Device AI Offer...
“The Future of AI is Here Today: Deep Dive into Qualcomm’s On-Device AI Offer...“The Future of AI is Here Today: Deep Dive into Qualcomm’s On-Device AI Offer...
“The Future of AI is Here Today: Deep Dive into Qualcomm’s On-Device AI Offer...
 
iot seminar topic
iot seminar topic iot seminar topic
iot seminar topic
 
Internet of Things (IoT) and Big Data
Internet of Things (IoT) and Big DataInternet of Things (IoT) and Big Data
Internet of Things (IoT) and Big Data
 
Edge Computing : future of IoT ?
Edge Computing : future of IoT ? Edge Computing : future of IoT ?
Edge Computing : future of IoT ?
 
Creator IoT Framework
Creator IoT FrameworkCreator IoT Framework
Creator IoT Framework
 
What's in the future of 5G millimeter wave?
What's in the future of 5G millimeter wave? What's in the future of 5G millimeter wave?
What's in the future of 5G millimeter wave?
 
Intro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer VisionIntro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer Vision
 
Scaling 5G to new frontiers with NR-Light (RedCap)
Scaling 5G to new frontiers with NR-Light (RedCap)Scaling 5G to new frontiers with NR-Light (RedCap)
Scaling 5G to new frontiers with NR-Light (RedCap)
 
O-RAN 5g high level network design
O-RAN 5g high level network designO-RAN 5g high level network design
O-RAN 5g high level network design
 
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AIQualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
 
Intelligently connecting our world in the 5G era
Intelligently connecting our world in the 5G eraIntelligently connecting our world in the 5G era
Intelligently connecting our world in the 5G era
 
Enabling on-device learning at scale
Enabling on-device learning at scaleEnabling on-device learning at scale
Enabling on-device learning at scale
 
Internet of things
Internet of thingsInternet of things
Internet of things
 

Similar to Intelligence at scale through AI model efficiency

Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Fisnik Kraja
 
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deployment
taeseon ryu
 

Similar to Intelligence at scale through AI model efficiency (20)

Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
“Accelerating Newer ML Models Using the Qualcomm AI Stack,” a Presentation fr...
“Accelerating Newer ML Models Using the Qualcomm AI Stack,” a Presentation fr...“Accelerating Newer ML Models Using the Qualcomm AI Stack,” a Presentation fr...
“Accelerating Newer ML Models Using the Qualcomm AI Stack,” a Presentation fr...
 
Accelerating algorithmic and hardware advancements for power efficient on-dev...
Accelerating algorithmic and hardware advancements for power efficient on-dev...Accelerating algorithmic and hardware advancements for power efficient on-dev...
Accelerating algorithmic and hardware advancements for power efficient on-dev...
 
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
 
Leading Research Across the AI Spectrum
Leading Research Across the AI SpectrumLeading Research Across the AI Spectrum
Leading Research Across the AI Spectrum
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Novi sad ai event 1-2018
Novi sad ai event 1-2018Novi sad ai event 1-2018
Novi sad ai event 1-2018
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
 
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 
Pushing the boundaries of AI research
Pushing the boundaries of AI researchPushing the boundaries of AI research
Pushing the boundaries of AI research
 
Implementing AI: Running AI at the Edge
Implementing AI: Running AI at the EdgeImplementing AI: Running AI at the Edge
Implementing AI: Running AI at the Edge
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
Quad Core Processors - Technology Presentation
Quad Core Processors - Technology PresentationQuad Core Processors - Technology Presentation
Quad Core Processors - Technology Presentation
 
AWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data Analytics
 
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteit
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteitMeetup 18/10/2018 - Artificiële intelligentie en mobiliteit
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteit
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Future of hpc
Future of hpcFuture of hpc
Future of hpc
 
Learn about Tensorflow for Deep Learning now! Part 1
Learn about Tensorflow for Deep Learning now! Part 1Learn about Tensorflow for Deep Learning now! Part 1
Learn about Tensorflow for Deep Learning now! Part 1
 

More from Qualcomm Research

More from Qualcomm Research (20)

Generative AI at the edge.pdf
Generative AI at the edge.pdfGenerative AI at the edge.pdf
Generative AI at the edge.pdf
 
The future of AI is hybrid
The future of AI is hybridThe future of AI is hybrid
The future of AI is hybrid
 
Why and what you need to know about 6G in 2022
Why and what you need to know about 6G in 2022Why and what you need to know about 6G in 2022
Why and what you need to know about 6G in 2022
 
Understanding the world in 3D with AI.pdf
Understanding the world in 3D with AI.pdfUnderstanding the world in 3D with AI.pdf
Understanding the world in 3D with AI.pdf
 
Bringing AI research to wireless communication and sensing
Bringing AI research to wireless communication and sensingBringing AI research to wireless communication and sensing
Bringing AI research to wireless communication and sensing
 
How will sidelink bring a new level of 5G versatility.pdf
How will sidelink bring a new level of 5G versatility.pdfHow will sidelink bring a new level of 5G versatility.pdf
How will sidelink bring a new level of 5G versatility.pdf
 
Realizing mission-critical industrial automation with 5G
Realizing mission-critical industrial automation with 5GRealizing mission-critical industrial automation with 5G
Realizing mission-critical industrial automation with 5G
 
3GPP Release 17: Completing the first phase of 5G evolution
3GPP Release 17: Completing the first phase of 5G evolution3GPP Release 17: Completing the first phase of 5G evolution
3GPP Release 17: Completing the first phase of 5G evolution
 
Setting off the 5G Advanced evolution with 3GPP Release 18
Setting off the 5G Advanced evolution with 3GPP Release 18Setting off the 5G Advanced evolution with 3GPP Release 18
Setting off the 5G Advanced evolution with 3GPP Release 18
 
The essential role of AI in the 5G future
The essential role of AI in the 5G futureThe essential role of AI in the 5G future
The essential role of AI in the 5G future
 
How AI research is enabling next-gen codecs
How AI research is enabling next-gen codecsHow AI research is enabling next-gen codecs
How AI research is enabling next-gen codecs
 
Role of localization and environment perception in autonomous driving
Role of localization and environment perception in autonomous drivingRole of localization and environment perception in autonomous driving
Role of localization and environment perception in autonomous driving
 
Pioneering 5G broadcast
Pioneering 5G broadcastPioneering 5G broadcast
Pioneering 5G broadcast
 
How to build high performance 5G networks with vRAN and O-RAN
How to build high performance 5G networks with vRAN and O-RANHow to build high performance 5G networks with vRAN and O-RAN
How to build high performance 5G networks with vRAN and O-RAN
 
Efficient video perception through AI
Efficient video perception through AIEfficient video perception through AI
Efficient video perception through AI
 
Enabling the rise of the smartphone: Chronicling the developmental history at...
Enabling the rise of the smartphone: Chronicling the developmental history at...Enabling the rise of the smartphone: Chronicling the developmental history at...
Enabling the rise of the smartphone: Chronicling the developmental history at...
 
5G spectrum innovations and global update
5G spectrum innovations and global update5G spectrum innovations and global update
5G spectrum innovations and global update
 
Transforming enterprise and industry with 5G private networks
Transforming enterprise and industry with 5G private networksTransforming enterprise and industry with 5G private networks
Transforming enterprise and industry with 5G private networks
 
The essential role of technology standards
The essential role of technology standardsThe essential role of technology standards
The essential role of technology standards
 
Smart transportation | Intelligent transportation system (ITS)
Smart transportation | Intelligent transportation system (ITS)Smart transportation | Intelligent transportation system (ITS)
Smart transportation | Intelligent transportation system (ITS)
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Intelligence at scale through AI model efficiency

  • 1. April 6, 2021 @QCOMResearch Qualcomm Technologies, Inc. Intelligence at scale through AI model efficiency
  • 2. 2 • Why efficient machine learning is necessary for AI to proliferate • Our latest research to make AI models more efficient • Our open-source projects to scale efficient AI Agenda
  • 3. 3 3 Video monitoring Extended reality Smart cities Smart factories Autonomous vehicles Video conferencing Smart homes Smartphone AI is being used all around us increasing productivity, enhancing collaboration, and transforming industries AI video analysis is on the rise Trend toward more cameras, higher resolution, and increased frame rate across devices
  • 4. 4 4 Source: Welling Will we have reached the capacity of the human brain? Energy efficiency of a brain is 100x better than current hardware 2025 Weight parameter count 1940 1950 1960 1970 1980 1990 2000 2010 2020 2030 1943: First NN (+/- N=10) 1988: NetTalk (+/- N=20K) 2009: Hinton’s Deep Belief Net (+/- N=10M) 2013: Google/Y! (N=+/- 1B) 2025: N = 100T = 1014 2017: Very large neural networks (N=137B) 1012 1010 108 106 1014 104 102 100 Deep neural networks are energy hungry and growing fast AI is being powered by the explosive growth of deep neural networks 2021: Extremely large neural networks (N =1.6T)
  • 5. 5 5 Power and thermal efficiency are essential for on-device AI The challenge of AI workloads Constrained mobile environment Very compute intensive Large, complicated neural network models Complex concurrencies Always-on Real-time Must be thermally efficient for sleek, ultra-light designs Storage/memory bandwidth limitations Requires long battery life for all-day use
  • 6. 6 Holistic model efficiency research Multiple axes to shrink AI models and efficiently run them on hardware Quantization Learning to reduce bit-precision while keeping desired accuracy Compression Learning to prune model while keeping desired accuracy Compilation Learning to compile AI models for efficient hardware execution Neural architecture search Learning to design smaller neural networks that are on par or outperform hand-designed architectures on real hardware
  • 7. 7 1: FP32 model compared to quantized model Leading research to efficiently quantize AI models Promising results show that low-precision integer inference can become widespread Virtually the same accuracy between a FP32 and quantized AI model through: • Automated, data free, post-training methods • Automated training-based mixed-precision method Significant performance per watt improvements through quantization Automated reduction in precision of weights and activations while maintaining accuracy Models trained at high precision 32-bit floating point 3452.3194 8-bit Integer 255 Increase in performance per watt from savings in memory and compute1 Inference at lower precision 16-bit Integer 3452 01010101 Increase in performance per watt from savings in memory and compute1 up to 4X 4-bit Integer 15 Increase in performance per watt from savings in memory and compute1 01010101 up to 16X up to 64X 01010101 0101 01010101 01010101 01010101 01010101
  • 8. 8 8 Data-free quantization Created an automated method that addresses bias and imbalance in weight ranges: No training Data free How can we make quantization as simple as possible? Pushing the limits of what’s possible with quantization AdaRound Created an automated method for finding the best rounding choice: No training Minimal unlabeled data Is rounding to the nearest value the best approach for quantization? SOTA 8-bit results Making 8-bit weight quantization ubiquitous <1% Accuracy drop for MobileNet V2 against FP32 model Data-Free Quantization Through Weight Equalization and Bias Correction (Nagel, van Baalen, et al., ICCV 2019) SOTA: State-of-the-art Making 4-bit weight quantization ubiquitous <2.5% Accuracy drop for MobileNet V2 against FP32 model Up or Down? Adaptive Rounding for Post-Training Quantization (Nagel, Amjad, et al., ICML 2020) Bayesian bits Created a novel method to learn mixed-precision quantization: Training required Training data required Jointly learns bit-width precision and pruning Can we quantize layers to different bit widths based on precision sensitivity? SOTA mixed-precision results Automating mixed-precision quantization and enabling the tradeoff between accuracy and kernel bit-width <1% Accuracy drop for MobileNet V2 against FP32 model for mixed precision model with computational complexity equivalent to a 4-bit weight model Bayesian Bits: Unifying Quantization and Pruning van Baalen, Louizos, et al., NeurIPS 2020) 8 8 SOTA 4-bit weight results
  • 9. 9 9 Optimizing and deploying state-of-the-art AI models for diverse scenarios at scale is challenging Neural network complexity Many state-of-the-art neural network solutions are large, complex, and do not run efficiently on target hardware Neural network diversity Device diversity Cost For different tasks and use case cases, many different neural networks are required Deploying neural networks to many different devices with different configurations and changing software is required Compute and engineering resources for training plus evaluation are too costly and time consuming
  • 10. 10 10 NAS Neural Architecture Search An automated way to learn a network topology that can achieve the best performance on a certain task Search space Set of operations and how they can be connected to form valid network architectures Search algorithm Method for sampling a population of good network architecture candidates Evaluation strategy Method to estimate the performance of sampled network architectures
  • 11. 11 High cost Brute force search is expensive >40,000 epochs per platform Lack diverse search Hard to search in diverse spaces, with different block-types, attention, and activations Repeated training phase for every new scenario Do not scale Repeated training phase for every new device >40,000 epochs per platform Unreliable hardware models Requires differentiable cost-functions Repeated training phase for every new device Existing NAS solutions do not address all the challenges
  • 12. 12 Introducing new AI research Efficient NAS with hardware-aware optimization A scalable method that finds pareto-optimal network architectures in terms of accuracy and latency for any hardware platform at low cost Starts from an oversized pretrained reference architecture Distilling Optimal Neural Network Architectures DONNA DONNA Oversized pretrained reference architecture Set of Pareto-optimal network architectures Low cost Low start-up cost of 1000-4000 epochs, equivalent to training 2-10 networks from scratch Diverse search to find the best models Supports diverse spaces with different cell-types, attention, and activation functions (ReLU, Swish, etc.) Scalable Scales to many hardware devices at minimal cost Reliable hardware measurements Uses direct hardware measurements instead of a potentially inaccurate hardware model Distilling Optimal Neural Networks: Rapid Search in Diverse Spaces (Moons, Bert, et al., arXiv 2020)
  • 13. 13 Varying parameters: • Kernel Size • Expansion Factors • Network depth • Network width • Attention/activation • Different efficient layer types DONNA 4-step process Objective: Build accuracy model of search space once, then deploy to many scenarios Define reference and search space once A 13 Define backbone: • Fixed channels • Head and Stem 1 2 3 4 5
  • 14. 14 Define reference architecture and search-space once A diverse search space is essential for finding optimal architectures with higher accuracy Select reference architecture The largest model in the search-space Chop the NN into blocks Fix the STEM, HEAD, # blocks, strides, # channels at block-edge Choose search space Diverse factorized hierarchical search space, including variable kernel- size, expansion-rate, depth, # channels, cell-type, activation, attention STEM 1, s=2 2, s=2 3, s=2 4, s=1 5, s=2 HEAD ch=32 ch=64 ch=96 ch=128 ch=196 ch=256 Conv 1x1 FC Avg ch=1536 Conv 3x3s2 DW Conv ch=32 Kernel: Expand: Depth: Attention: 3,5,7 2,3,4,6 1,2,3,4 SE, no SE Activation: Cell type: Width scale: ReLU/Swish grouped, DW, … 0.5x, 1.0x Choose diverse search space Ch: channel; SE: Squeeze-and-Excitation
  • 15. 15 Varying parameters: • Kernel Size • Expansion Factors • Network depth • Network width • Attention/activation • Different efficient layer types Approximate ideal projections of a reference model through KD 1 1 MSE 2 2 MSE 3 3 MSE 4 4 MSE 5 5 MSE Use quality of blockwise approximations to build accuracy model MSE 4 MSE 1 MSE 5 MSE 2 MSE 3 Define backbone: • Fixed channels • Head and Stem 1 2 3 4 5 DONNA 4-step process Objective: Build accuracy model of search space once, then deploy to many scenarios 15 Build accuracy model via Knowledge Distillation (KD) once B Define reference and search space once A
  • 16. 16 Build accuracy predictor via BKD once Low-cost hardware-agnostic training phase Block library Pretrain all blocks in search- space through blockwise knowledge distillation Fast block training Trivial parallelized training Broad search space Block pretrained weights Block quality metrics Finetuned architectures Architecture library Quickly finetune a representative set of architectures Finetune sampled networks Fast network training Only 20-30 NN required Accuracy predictor Fit linear regression model Regularized Ridge Regression Accurate predictions BKD: blockwise knowledge distillation
  • 17. 17 Varying parameters: • Kernel Size • Expansion Factors • Network depth • Network width • Attention/activation • Different efficient layer types Define backbone: • Fixed channels • Head and Stem 1 2 3 4 5 A Approximate ideal projections of a reference model through KD 1 MSE 2 MSE 3 MSE 4 MSE 5 MSE Use quality of blockwise approximations to build accuracy model MSE 4 MSE 1 MSE 5 MSE 2 MSE 3 different compiler versions, different image sizes HW latency Predicted accuracy Scenario- specific search DONNA 4-step process Objective: Build accuracy model of search space once, then deploy to many scenarios 17 Build accuracy model via Knowledge Distillation (KD) once B Evolutionary search in 24h C Define reference and search space once 1 2 3 4 5 Qualcomm Snapdragon is a product of Qualcomm Technologies, Inc. and/or its subsidiaries.
  • 18. 18 Evolutionary search with real hardware measurements Scenario-specific search allows users to select optimal architectures for real-life deployments Quick turnaround time Results in +/- 1 day using one measurement device NSGA: Non-dominated Sorting Genetic Algorithm Accurate scenario-specific search Captures all intricacies of the hardware platform and software — e.g. run-time version or devices NSGA-II sampling algorithm Target HW Task acc predictor Predicted task accuracy Measured latency on device End-to-end model
  • 19. 19 different compiler versions, different image sizes HW latency Predicted accuracy Scenario- specific search MSE 4 MSE 1 MSE 5 MSE 2 MSE 3 4 1 5 2 3 Use KD-initialized blocks from step B to finetune any network in the search space in 15-50 epochs instead of 450 DONNA 4-step process Objective: Build accuracy model of search space once, then deploy to many scenarios 19 Evolutionary search in 24h C Sample and finetune D Varying parameters: • Kernel Size • Expansion Factors • Network depth • Network width • Attention/activation • Different efficient layer types Define backbone: • Fixed channels • Head and Stem 1 2 3 4 5 A Approximate ideal projections of a reference model through KD 1 MSE 2 MSE 3 MSE 4 MSE 5 MSE Use quality of blockwise approximations to build accuracy model MSE 4 MSE 1 MSE 5 MSE 2 MSE 3 Build accuracy model via Knowledge Distillation (KD) once B Define reference and search space once 1 2 3 4 5
  • 20. 20 Quickly finetune predicted Pareto-optimal architectures Finetune to reach full accuracy and complete hardware-aware optimization for on-device AI deployments Soft distillation on teacher logits Block pretrained weights 4 1 5 2 3 4 1 5 2 3 CE Soft CE Ground-truth labels BKD-reference network Predicted accuracy HW latency Confirmed accuracy HW latency
  • 21. 21 Top-1 val accuracy [%] Qualcomm® Adreno™ 660 GPU in the Snapdragon 888 running on the Samsung Galaxy S21. 2: Qualcomm® Hexagon™ 780 Processor in the Snapdragon 888 running on the Samsung Galaxy S21. Qualcomm Adreno and Qualcomm Hexagon are products of Qualcomm Technologies, Inc. and/or its subsidiaries. DONNA finds state-of-the-art networks for on-device scenarios Quickly optimize and make tradeoffs in model accuracy with respect to the deployment conditions that matter # of Parameters [M] FLOPS FPS Desktop GPU latency FPS Mobile SoC latency1 (Adreno 660 GPU) FPS Mobile SoC latency2 (Hexagon 780 Processor) 20% faster at similar accuracy 20% faster at similar accuracy 224x224 images 224x224 images 224x224 images 672x672 images 20% faster at similar accuracy
  • 22. 22 22 DONNA provides MnasNet-level diversity at 100x lower cost *Training 1 model from scratch = 450 epochs DONNA efficiently finds optimal models over diverse scenarios Cost of training is a handful of architectures* Method Granularity Macro-diversity Search-cost 1 scenario [epochs] Cost / scenario 4 scenarios [epochs] Cost / scenario ∞ scenarios [epochs] OFA Layer-level Fixed 1200+10×[25 — 75] 550 — 1050 250 — 750 DNA Layer-level Fixed 770+10×450 4700 4500 MNasNet Block-level Variable 40000+10×450 44500 44500 DONNA Block-level Variable 4000+10×50 1500 500 Good OK Not good
  • 23. 23 DONNA finds state-of-the-art networks for on-device scenarios Quickly optimize and make tradeoffs in model accuracy with respect to the deployment conditions that matter # Multiply Accumulate Operations [FLOPS] ResNet-50 Mobile models Predicted top-1 accuracy VIT-B DEIT-B Object detection Vision transformers DONNA applies directly to downstream tasks and non-CNN neural architectures without conceptual code changes VAL mAP (%)
  • 24. 24 Execute predictor-driven evolutionary search on-device Define a search-space of smaller, faster network architectures Roughly equal to training 2-10 nets from scratch Create an accuracy predictor for all network architectures User perspective for DONNA Build accuracy model of search space once, then deploy to many scenarios Oversized pretrained reference architecture IN DONNA A B C Deploy best models @ 3ms, 6ms, 9ms, … on different chips Set of Pareto-optimal network architectures OUT 1 day / use-case Finetune best searched models, 9-30x faster vs regular training D 50 GPU-hrs / net
  • 25. 25 25 DONNA Conclusions DONNA shrinks big networks in a hardware-efficient way DONNA can be rerun for any new device or setting within a day DONNA works on many different tasks out of the box DONNA enables scalability and allows models to be easily updated after small changes rather than starting from scratch 25
  • 26. 26 26 AIMET and AIMET Model Zoo are products of Qualcomm Innovation Center, Inc. Leading AI research and fast commercialization Driving the industry towards integer inference and power-efficient AI AI Model Efficiency Toolkit (AIMET) AIMET Model Zoo Relaxed Quantization (ICLR 2019) Data-free Quantization (ICCV 2019) AdaRound (ICML 2020) Bayesian Bits (NeurIPS 2020) Quantization research Quantization open-sourcing
  • 27. 27 Open-source projects to scale model-efficient AI to the masses AIMET & AIMET Model Zoo
  • 28. 28 28 AIMET makes AI models small Open-sourced GitHub project that includes state-of-the-art quantization and compression techniques from Qualcomm AI Research If interested, please join the AIMET GitHub project: https://github.com/quic/aimet Trained AI model AI Model Efficiency Toolkit (AIMET) Optimized AI model TensorFlow or PyTorch Compression Quantization Deployed AI model Features: State-of-the-art network compression tools State-of-the-art quantization tools Support for both TensorFlow and PyTorch Benchmarks and tests for many models Developed by professional software developers
  • 29. 29 Benefits Lower memory bandwidth Lower power Lower storage Higher performance Maintains model accuracy Simple ease of use AIMET Providing advanced model efficiency features and benefits Features Quantization Compression State-of-the-art INT8 and INT4 performance Post-training quantization methods, including Data-Free Quantization and Adaptive Rounding (AdaRound) — coming soon Quantization-aware training Quantization simulation Efficient tensor decomposition and removal of redundant channels in convolution layers Spatial singular value decomposition (SVD) Channel pruning Visualization Analysis tools for drawing insights for quantization and compression Weight ranges Per-layer compression sensitivity
  • 30. 30 30 APIs invoked directly from the pipeline Supports TensorFlow and PyTorch Direct algorithm API frameworks User-friendly APIs compress_model (model, eval_callback=obj_det_eval, compress_scheme=Scheme.spatial_svd, ... ) equalize_model (model, ...) AIMET features and APIs are easy to use Designed to fit naturally in the AI model development workflow for researchers, developers, and ISVs AIMET extensions extensions Model optimization library (techniques to compress & quantize models) Framework specific API Algorithm API Other frameworks
  • 31. 31 Activation range estimation Bias correction Weight quantization Bias absorption Equalize weight ranges Avoid high activation ranges DFQ: data free quantization Data Free Quantization results in AIMET Post-training technique enabling INT8 inference with very minimal loss in accuracy Cross-layer equalization Measure and correct shift in layer outputs Estimate activation ranges for quantization <1% % Reduction in accuracy between FP32 ad INT8 DFQ example results MobileNet-v2 (top-1 accuracy) <1% ResNet-50 (top-1 accuracy) <1% DeepLabv3 (mean intersection over union) 31
  • 32. 32 INT8, AdaRound quantization AP: Average Precision INT8, baseline quantization Post-training technique that makes INT8 quantization more accurate and INT4 quantization possible AdaRound is coming soon to AIMET 82.20 Bitwidth Mean AP (mAP) FP32 INT8 baseline quantization INT8 AdaRound quantization 49.85 81.21 <1% Reduction in accuracy between FP32 ad INT8 AdaRound quantization 32
  • 33. 33 33
  • 34. 34 AIMET Model Zoo Accurate pre-trained 8-bit quantized models Image classification Semantic segmentation Pose estimation Speech recognition Super resolution Object detection
  • 35. 35 *: Comparison between FP32 model and INT8 model quantized with AIMET. For further details, check out: https://github.com/quic/aimet-model-zoo/ ResNet-50 (v1) Top-1 accuracy* FP32 INT8 75.21% 74.96% MobileNet- v2-1.4 Top-1 accuracy* FP32 INT8 75% 74.21% EfficientNet Lite Top-1 accuracy* FP32 INT8 74.93% 74.99% SSD MobileNet-v2 mAP* FP32 INT8 0.2469 0.2456 RetinaNet mAP* FP32 INT8 0.35 0.349 Pose estimation mAP* FP32 INT8 0.383 0.379 SRGAN PSNR* FP32 INT8 25.45 24.78 MobileNetV2 Top-1 accuracy* FP32 INT8 7167% 71.14% EfficientNet- lite0 Top-1 accuracy* FP32 INT8 75.42% 74.44% DeepLabV3+ mIoU* FP32 INT8 72.62% 72.22% MobileNetV2- SSD-Lite mAP* FP32 INT8 68.7% 68.6% Pose estimation mAP* FP32 INT8 0.364 0.359 SRGAN PSNR FP32 INT8 25.51 25.5 DeepSpeech2 WER* FP32 INT8 9.92% 10.22% AIMET Model Zoo includes popular quantized AI models Accuracy is maintained for INT8 models — less than 1% loss* 35 <1% Loss in accuracy*
  • 36. 36 Baseline quantization: Post-training quantization using min-max based quantization grid AIMET quantization: Model fine-tuned using Quantization Aware Training in AIMET FP32 INT8 (AIMET quantization) INT8 (Baseline quantization) AIMET Model Zoo models preserve accuracy Visual difference in model accuracy is telling between AIMET and baseline quantization methods For DeepLabv3+ semantic segmentation, AIMET quantization maintains accuracy, while baseline quantization method is inaccurate Accurate segmentation Inaccurate segmentation
  • 37. 37 37 Join our open-source projects AIMET State-of-the-art quantization and compression techniques github.com/quic/aimet AIMET Model Zoo Accurate pre-trained 8-bit quantized models github.com/quic/aimet-model-zoo
  • 38. 38 38 AI model efficiency is crucial for making AI ubiquitous, leading to smarter devices and enhanced lives We are conducting leading research and development in AI model efficiency while maintaining accuracy Our open-source projects, based on this leading research, are making it possible for the industry to adopt efficient AI models at scale
  • 40. Follow us on: For more information, visit us at: www.qualcomm.com & www.qualcomm.com/blog Thank you Nothing in these materials is an offer to sell any of the components or devices referenced herein. ©2018-2021 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved. Qualcomm, Snapdragon, Adreno, and Hexagon are trademarks or registered trademark of Qualcomm Incorporated. Other products and brand names may be trademarks or registered trademarks of their respective owners. References in this presentation to “Qualcomm” may mean Qualcomm Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries or business units within the Qualcomm corporate structure, as applicable. Qualcomm Incorporated includes our licensing business, QTL, and the vast majority of our patent portfolio. Qualcomm Technologies, Inc., a subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of our engineering, research and development functions, and substantially all of our products and services businesses, including our QCT semiconductor business.