SlideShare a Scribd company logo
1 of 97
Download to read offline
Status Quo of
TensorFlow Lite on Edge
Devices
Koan-Sin Tan

freedom@computer.org

Aug 17th, 2019

COSCUP, Taipei, Taiwan
1
• disclaimer: Opinions Are My Own

• feel free to interrupt me if you have any questions
2
who i am
• Used open source before the term “open
source” is used

• A software guy, learned to use Unix and open
source software on VAX-11/780 running 4.3BSD

• Used to be a programming language junkie

• Worked on various system software, e.g., CPU
scheduling and power management of non-
CPU components

• Recently, on NN performance on edge devices
related stuff

• Contributed from time to time to TensorFlow
Lite

• started a command line label_image for
TFLite
https://github.com/tensorflow/tensorflow/releases/tag/v2.0.0-alpha0
http://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg
3
Outline
• overview: or, say, why TFLite

• new features

• delegates: including new NNAPI delegate, GPU delegate,
and flex delegate,

• optimized kernels for ARM CPUs,

• various APIs: including Python, C, Objective-C, and Swift
ones, and

• misc, e.g., graph writer and Edge TPU.
4
Why TFLite?
• TensorFlow Lite

• TensorFlow is the most popular machine learning frameworks

• TFLite: a lightweight runtime for edge devices

• could be accelerated by GPU, DSP, or ASIC accelerators 

• PyTorch is catching up, but acceleration part is still lagging far
behind TFLite

• Yes, there are other open source NN frameworks. No one is as
comprehensive as TF Lite, as far as I can tell
5
https://www.youtube.com/watch?v=Jjm7MT6W0Dc
Comprehensive?
6
Why NN on edge device,
esp. cell phones?
• Offline usages

• Latency

• Bandwidth

• Privacy

• Sensors
7
Offline usage
• we heard words such as “always-on” and “always-
connected” back to 3G days 🤔, but wireless
communications is so unreliable
8
latency
• “There is an old network saying: Bandwidth problems
can be cured with money.  Latency problems are harder
because the speed of light is fixed — you can't bribe
God.” -- David D. Clark, MIT
9
https://en.wikipedia.org/wiki/David_D._Clark
Bandwidth
• Well, bandwidth of wireless network is not easy problem
either

• consider you have NN-based “portrait model” (or say
Bokeh effect) on iPhone Xs Max (12 + 12 MP)

• if we send raw image (12+12)*10^6*(3*8) = 576 M bits

• 576 * 30 ~= 17.3 G bits

• you know this is not feasible for now
10
Privacy
• you know you need privacy for
both your physical body and
your mobile device(s)
11
NN-based ML is already in
cell phones
• Google I/O 2017: Mobile First —> AI First

• TensorFlow Lite, Android Neural Network API

• Lots of stuff from Google blogs and papers, e.g., Google Lens, federated learning in Gboard

• Pixel Visual Core in Pixel 2/3, 2/3 XL: although it seems there is no way for developers to
use it as a general NN accelerator

• Apple announced CoreML, a machine framework, at WWDC 2017 (June 2017)

• Apple’s machine learning journal (https://machinelearning.apple.com/): how Apple uses CNN
and other machine techniques in iPhone

• Neural Engine in A11/A11X/A12/A12X, available to developers via Core ML on A12
devices

• Computer Architecture: A Quantitative Approach, 6th Ed. (Nov, 2017) has a whole new chapter
on Domain Specific Architecture, actually NN accelerators.
12
actually there are many NNAPI-
enabled phones already
http://ai-benchmark.com/ranking_processors.html
mid June, 2019
13
fiercely competitive market
14
http://ai-benchmark.com/ranking_processors.html
Aug 16th, 2019
https://www.anandtech.com/show/13392/the-iphone-xs-xs-max-review-unveiling-the-
silicon-secrets/5
• AnandTech is one the my favorite tech sites. Usually, it provides
good analysis

• E.g., Apple’s CPUs

• cache sizes

• execution units

• various instruction latency

• Not good enough for NN accelerators on mobile phones

• floating-point VGG16, Inception V3, and ResNet34?

• come on, are you still in Neolithic era?
Evolving fast: the slide I prepared Nov, 2018
15
TF Lite in Android Pie
• There are ‘libtflite.so’s in /system/lib and /system/lib64

• https://source.android.com/devices/tech/display/textclassifier
16
More TFLite use cases
17
Some TFLite clients
presented by TFLite guys
18
ML Kit
• https://
developers.google.com/ml-
kit/, part of FireBase

• Originally, only custom models
are TFLite

• Now, as far as I can tell, vision
parts are using TFLite also
https://developers.google.com/ml-kit/ 19
• see appendix for Google Translate, Google Lens, Gboard,
and others
20
Some Progresses Make NN
on Edge Devices Really Viable
• “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” [1]. A keynote at
ESWEEK 2017, “Keynote: Small Neural Nets Are Beautiful: Enabling Embedded Systems with Small Deep-
Neural-Network Architectures” [2]

• MobileNet V1 [3] and V2 [4]: Depthwise separable convolution [5] and inverted residuals and linear
bottlenecks [4]

• AutoML, e.g., 

• NASNet Mobile [6] and Mnasnet [7]

• MobileNet V3 [10] and EfficientNet [11]

• Quantization [8][9]

• How about pruning / compression stuff? As far as I know, not widely used yet

[1] https://arxiv.org/abs/1602.07360

[2] https://arxiv.org/abs/1710.02759

[3] https://arxiv.org/abs/1704.04861

[4] https://arxiv.org/abs/1801.04381

[5] https://www.di.ens.fr/data/publications/papers/phd_sifre.pdf

[6] https://arxiv.org/abs/1707.07012

[7] https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html, https://arxiv.org/abs/1807.11626

[8] https://arxiv.org/abs/1712.05877

[9] https://arxiv.org/abs/1806.08342

[10] https://arxiv.org/abs/1905.02244

[11] https://arxiv.org/abs/1905.11946
21
• Michael Jordan published an
article on Medium named
“Artificial Intelligence — The
Revolution Hasn’t
Happened Yet” [1]

• Yes, but current deep learning
driven stuff should be enough
for next few years

[1] https://medium.com/
@mijordan3/artificial-intelligence-
the-revolution-hasnt-happened-
yet-5e1d5812e1e7
22
Why I Started Learning TF
Lite
• We heard Android NN and TensorFlow Lite back in Google I/
O 2017

• My COSCUP 2017 slide deck “TensorFlow on Android”

• https://www.slideshare.net/kstan2/tensorflow-on-
android

• People knew a bit about Android NN API before it was
announced and released

• No information about TensorFlow Lite, at least to me,
before it was released in Nov, 2017
23
Quantization and
Accelerators
• Quantization

• Quantization is not new, people know that there are lots
redundancy in NN models back from pre DNN days. Many
quantization and compressing/pruning techniques were
presented all the years. TFLite and its underlying gemmlowp
(and NNAPI) made the first production quality system that
supports quantized unsigned int8.

• accelerators (thru NNAPI in the beginning, and directly later)

• CPU is not always the best one to use NN models

• GPU, DSP, and other accelerators
24
TFLite and Android NN in
Google I/O 2017
• New TensorFlow runtime
• Optimized for mobile and
embedded apps

• Runs TensorFlow models on
device

• Leverage Android NN API

• Soon to be open sourced
from Google I/O 2017 video
25
Actual Android NN API
• Announced/published with Android 8.1
Preview 1

• Available to developer in NDK

• yes, NDK

• The Android Neural Networks API (NNAPI)
is an Android C API designed for running
computationally intensive operations for
machine learning on mobile devices

• NNAPI is designed to provide a base layer
of functionality for higher-level machine
learning frameworks (such as TensorFlow
Lite, Caffe2, or others) that build and train
neural networks

• The API is available on all devices running
Android 8.1 (API level 27) or higher.
https://developer.android.com/ndk/images/nnapi/nnapi_architecture.png
26
Android NN on Pixel 2
• Only the CPU fallback was available on Oreo MR1

• Actually, you can see Android NN API related in AOSP after Oreo MR1 (8.1) release already

• user level code, see https://android.googlesource.com/platform/frameworks/ml/+/oreo-mr1-release

• HAL, see https://android.googlesource.com/platform/hardware/interfaces/+/oreo-mr1-release/
neuralnetworks/

• There is NN API 1.1 on Android Pie

• https://developer.android.com/about/versions/pie/android-9.0#nnapi

• adding support for nine new ops — Pad, BatchToSpaceND, SpaceToBatchND, Transpose, Strided
Slice, Mean, Div, Sub, and Squeeze

• In the Android P DP1/2 (https://developer.android.com/preview/download.html), there was a HVX
NN API 1.0 (yes, 1.0) driver. Gone after DP2. Not in recent Pie release. (See https://
android.googlesource.com/platform/hardware/qcom/neuralnetworks/hvxservice/ for source code)

• NN API 1.2, which supports 90+ ops, is in AOSP and will be in forthcoming Android Q (version 10)
27
So NNAPI accelerators
don’t work?
• Yes, I don’t know what happened to earlier Pixel phones

• I don’t have Pixel 3 to try

• Q beta 4 for Pixel 3a comes with working a HVX
accelerator driver that works. It’s an NNAPI 1.1 one
though.

• And remember what I showed in pp. 13 and 14, there are
many NNAPI-enabled phones already
28
Original TFLite APIs
• Java API: A convenience
wrapper around the C++ API
on Android

• C++ API: loads the
TensorFlow Lite model file and
invokes the Interpreter. The
same library is available on
both Android and iOS
https://www.tensorflow.org/mobile/tflite/
29
Other bindings
• Python and C APIs

• Python: introduced in TF 1.8.0, built into pip package in 1.9.0

• my label_image.py for tflite merged on Aug 9, 2018

• https://github.com/tensorflow/tensorflow/blob/master/tensorflow/
lite/examples/python/label_image.py

• https://github.com/tensorflow/tensorflow/tree/master/tensorflow/
lite/examples/python

• C API: introduced for Unity

• https://github.com/tensorflow/tensorflow/tree/master/tensorflow/
contrib/lite/experimental/c
30
How to Use it
31
• TFLite guys work hard
• documentation getting better and better
over < 2 yrs
• yes, sometimes you still have to “use the
source”
https://www.tensorflow.org/lite
TFLite Converter
https://www.tensorflow.org/lite/images/convert/workflow.svg
32
Basic Usage
• model: .tflite model

• resolver: if no custom ops, builtin op
resolver is enough

• interpreter: we need it to compute
the graph

• interpreter->AllocateTensor():
Allocate stuff for you, e.g., input
tensor(s)

• fill the input

• interpreter->Invoke(): run the graph

• process the output
tflite::FlatBufferModel model(path_to_model);
tflite::ops::builtin::BuiltinOpResolver resolver;
std::unique_ptr<tflite::Interpreter> interpreter;
tflite::InterpreterBuilder(*model, resolver)(&interpreter);
// Resize input tensors, if desired.
interpreter->AllocateTensors();
float* input = interpreter->typed_input_tensor<float>(0);
// Fill `input`.
interpreter->Invoke();
float* output = interpreter->type_output_tensor<float>(0);
33
more source code
• Check my COSCUP 2018 slide deck, which was for a talk
in a source code reading track, for more details

• https://www.slideshare.net/kstan2/open-source-nn-
frameworks-on-cellphones

• And I’ll have a more code-oriented talk on TFLite
delegates tomorrow
34
Interpreter
35
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/core/subgraph.cc#L734-L797
• TFLite compute graph is
a directed acyclic graph
(DAG), so traverse the
sorted graph node by
node
1×224×224×3
1×1001
TfLiteNnapiDelegate
1 32×3×3×3
2 1×3×3×512
3 512×1×1×512
4 1×3×3×512
5 512×1×1×512
6 1×3×3×512
7 1024×1×1×512
8 1×3×3×1024
9 1024×1×1×1024
10 1×3×3×32
11 64×1×1×32
12 1×3×3×64
13 128×1×1×64
14 1×3×3×128
15 128×1×1×128
16 1×3×3×128
17 256×1×1×128
18 1×3×3×256
19 256×1×1×256
20 1×3×3×256
21 512×1×1×256
22 1×3×3×512
23 512×1×1×512
24 1×3×3×512
25 512×1×1×512
26 1×3×3×512
27 512×1×1×512
28 1001
29 1001×1×1×1024
30 2
31 32
32 512
33 512
34 512
35 512
36 512
37 1024
38 1024
39 1024
40 32
41 64
42 64
43 128
44 128
45 128
46 128
47 256
48 256
49 256
50 256
51 512
52 512
53 512
54 512
55 512
56 512
57 512
input
Reshape_1
NNAPI Delegate
• Previously, when a graph is
delegated to NNAPI, it’s kinda
invisible to TFLite

• With recently NNAPI delegate
rewrite, it’s an op in TFLite
now

• subgraph

• all-or-nothing —> per op
1×224×224×3
1×112×112×32
1×112×112×32
1×112×112×64
1×56×56×64
1×56×56×128
1×56×56×128
1×56×56×128
1×28×28×128
1×28×28×256
1×28×28×256
1×28×28×256
1×14×14×256
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×7×7×512
1×7×7×1024
1×7×7×1024
1×7×7×1024
1×1×1×1024
1×1×1×1001
1×1001
1×1001
Conv2D
weights 32×3×3×3
bias 32
DepthwiseConv2D
weights 1×3×3×32
bias 32
Conv2D
weights 64×1×1×32
bias 64
DepthwiseConv2D
weights 1×3×3×64
bias 64
Conv2D
weights 128×1×1×64
bias 128
DepthwiseConv2D
weights 1×3×3×128
bias 128
Conv2D
weights 128×1×1×128
bias 128
DepthwiseConv2D
weights 1×3×3×128
bias 128
Conv2D
weights 256×1×1×128
bias 256
DepthwiseConv2D
weights 1×3×3×256
bias 256
Conv2D
weights 256×1×1×256
bias 256
DepthwiseConv2D
weights 1×3×3×256
bias 256
Conv2D
weights 512×1×1×256
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 1024×1×1×512
bias 1024
DepthwiseConv2D
weights 1×3×3×1024
bias 1024
Conv2D
weights 1024×1×1×1024
bias 1024
AveragePool2D
Conv2D
weights 1001×1×1×1024
bias 1001
Squeeze
Softmax
input
Reshape_1
36
http://localhost:8080/, http://localhost:8090/
More Delegates
• Flex Delegate

• Ops supported by TFLite is relatively limited, TensorFlow Lite models can now use a
subset of TensorFlow ops when TFLite builtin ops are not sufficient

• GPU backend: no, not NNAPI

• OpenGL ES 3.1 Compute Shaders on Android devices

• Metal Compute Shaders on iOS device

• “in general the new GPU backend performs 2–7x faster than the floating point CPU
implementation for a wide range of diverse deep neural network models.”

https://www.tensorflow.org/lite/using_select_tf_ops

https://medium.com/tensorflow/tensorflow-lite-now-faster-with-mobile-gpus-developer-preview-e15797e6dee7

https://www.tensorflow.org/lite/performance/gpu

https://www.tensorflow.org/lite/performance/gpu_advanced
37
Why a non-NNAPI delegate?
https://developer.android.com/about/dashboards
NNAPI-enabled devices ~7.5% around the end of Oct, 2018
38
NNAPI-enabled devices ~ 25.8% around May 7, 2019
https://developer.android.com/about/dashboards39
40
GL ES compute shader capable devices ~ 50%
https://developer.android.com/about/dashboards
GPU Delegate Performance
• my quick and dirty benchmarks

• Android: https://github.com/freedomtan/
glDelegateBench

• iOS: https://github.com/freedomtan/
glDelegateBenchmark/
• at first, GPU Delegate is binary release only (aar for Android; pod for iOS)
• after the release of GPU delegate source code, benchmark_model and
label_image are able to use GPU delegate
41
GPU delegate kernels
• Recently, TFLite GPU delegate guys
published a paper talking about how they
design it. Covered some interesting details

• GPU backends require initialization
involving shader compilation and
optimization by the driver before inference

• PHWC4: P stands for plane

• Reshape is expensive on GPU

• RGBA is better than RGB on GPU

• a tensor of shape [B,H,W,5], for instance,
is twice as expensive as [B, H, W, 4], but
about the same as [B, H, W, 8], then the
architect can tune around those 4-channel
boundaries rather than trying to optimize
on other boundaries. 

•
https://arxiv.org/pdf/1907.01989.pdf
Faster ARM CPU kernels
• It’s available now. Enabled by default for Android ARM64
early June

• https://github.com/tensorflow/tensorflow/commit/
8924e67e034909bea0343631b9f9024c5a6da5c4

• ruy:

• four tune fixed point kernels: big/LITTLE (out-of-order/
in-order), w/ or w/o dot-product instructions

• two tuned floating point kernels
43
More on ruy
• matrix multiplication in AArch64 NEON
• sdot based kernels for either out-of-order CPUs, e.g., CA76, or in-order CPUs, e.g., CA55r1
• non sdot based kernels for either out-of-order CPUs, e.g., CA73, or in-order CPUs, e.g., CA53
• how the kernel is chosen: detection at run time instead of hard-coded list (e.g., PyTorch cpuinfo)
• sdot or not: see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/
experimental/ruy/detect_dotprod.cc#L129-L157
• in-order or out-of-order: see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/
experimental/ruy/tune.cc, esp., https://github.com/tensorflow/tensorflow/blob/master/tensorflow/
lite/experimental/ruy/tune.cc#L102-L124
• doesn't need to list all possibilities, probably can handle future cores. Still cannot deal with
big.LITTLE cores
• thread pool: it seems to scale better than the one currently in use, so that multi-threaded floating-
point numbers are much better

• before ruy, floating point: eigen thread pool; fixed-point: TFLite’s thread pool
44
Python API
• TensorFlow Lite Optimizing Converter (TOCO) —> tflite_convert, mainly python
wrapped C++ code

• Python Interpeter: https://www.tensorflow.org/lite/convert/
python_api#tensorflow_lite_python_interpreter_

• https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/
convert/python_api.md

• https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/lite

• I sent label_image.py (merged, https://github.com/tensorflow/tensorflow/tree/master/
tensorflow/lite/examples/python) and mobilenet_ssd. Tried others such as DeepLab V3
on RPI 3 B+.

• Quick test and you can use OpenCV to do preprocessing and post-processing
45
C API
• https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/
experimental/c/c_api.h

• Started as a base for Unity, https://github.com/tensorflow/tensorflow/tree/
master/tensorflow/lite/experimental/examples/unity/TensorFlowLitePlugin

• FFI via C is much easier than C++

• Who uses it? Objective-C and Swift APIs

• my quick-and-dirty hacks for Pharo Smalltalk, https://github.com/
freedomtan/libtensorflow-pharo-bindings/blob/libtensorflowlite_c_hacks/
LibTensorFlow-Core/TensorFlowLiteCAPI.class.st
46
Yes, Smalltalk Is Alive
• Smalltalk is an object-
oriented, dynamically typed
reflective programming
language started in 1970s

• Alan Kay, the creator or
Smalltalk, coined the term
Object Oriented Programming
(OOP). 

• MVC, IDE, live programming http://pharo.org/web/files/teaser50.png
47
Smalltalk using TFLite C
API
48
There are more new things
• For example, uP

• See https://github.com/tensorflow/tensorflow/tree/
master/tensorflow/lite/experimental

• TFLite Micro and uTensor

• https://os.mbed.com/blog/entry/uTensor-and-Tensor-
Flow-Announcement/

• Yes, RNN-based models, including LSTM, are not doing
well (yet)
49
Google I/O 2019 updates
• new MLIR-based TF —> TFLite converter

• improved CPU backend: ruy

• on-device training: not ready yet?

• control flow support

• see more at https://www.youtube.com/watch?
v=Jjm7MT6W0Dc
50
why MLIR
51
https://medium.com/tensorflow/mlir-a-new-intermediate-
representation-and-compiler-framework-beba999ed18d
MLIR: Multi-Level Intermediate Representation for Compiler Infrastructure
52
MLIR for TFLite Converter
MLIR: Multi-Level Intermediate Representation for Compiler Infrastructure
53
TF graphdef .pb -> TFLite flatbuffer .tflite
• Build TensorFlow MLIR related binaries
bazel build --config opt tensorflow/compiler/mlir/...
• Get your model, e.g.,
wget http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_1.0_224.tgz
• Convert it
./bazel-bin/tensorflow/compiler/mlir/lite/tf_tfl_translate  -tf-input-shapes=1,224,224,3 -tf-input-data-
types=DT_FLOAT -tf-output-arrays=MobilenetV1/Predictions/Reshape_1  /tmp/mobilenet_v1_1.0_224_frozen.pb --tf-
input-arrays=input  -o /tmp/foo.tflite
• Yes, it works like a charm. But, not for quantized model, neither
./bazel-bin/tensorflow/compiler/mlir/lite/tf_tfl_translate  -tf-input-shapes=1,224,224,3 -tf-input-data-
types=DT_QUINT8 -tf-output-arrays=MobilenetV1/Predictions/Reshape_1  /tmp/mobilenet_v1_1.0_224_quant_frozen.pb --
tf-input-arrays=input  -o /tmp/bar.tflite
nor


./bazel-bin/tensorflow/compiler/mlir/lite/tf_tfl_translate -tf-input-shapes=1,224,224,3 -tf-input-data-
types=DT_FLOAT -tf-output-arrays=MobilenetV1/Predictions/Reshape_1  /tmp/mobilenet_v1_1.0_224_quant_frozen.pb --
tf-input-arrays=input  -o /tmp/bar.tflite —tf-inference-type=TF_QUINT8
works
54
Google Edge TPU
• Announced back in Google
Next 2018 (July, 2018)

• Available to general developers
right before TensorFlow Dev
Summit 2019 (Mar, 2019)

• USB: Coral Accelerator

• Dev Board: Coral Dev Board

• More are coming, e.g., PCI-E
Accelerator and SOM

• Supported framework: TFLite
https://coral.withgoogle.com/products/
55
Edge TPU Software
•Updates released on April 11th, 2019

•Compiler: removed the restriction for specific architectures

•New TensorFlow Lite C++ API

•Updated Python API, mainly for multiple Edge TPUs

•Updated Mendel OS and Mendel Management Tool (MDT) tool

•Environmental Sensor Board, https://coral.withgoogle.com/products/environmental/

•May updates, May 29th, 2019

•Offline compiler

•MDT update

https://developers.googleblog.com/2019/04/updates-from-coral-new-compiler-and.html 

https://coral.withgoogle.com/news/updates-04-2019/

https://coral.withgoogle.com/news/updates-05-2019/
56
Edge TPU Software
• July updates, July 24th, 2019

• Updated Edge TPU Compiler and runtime: support for
models built using post-training quantization

• Updated Edge TPU Python library

• New on-device backpropagation API

• Updated weight imprinting API

• New TensorFlow Lite delegate for Edge TPU

https://coral.withgoogle.com/news/updates-07-2019/
57
Edge TPU’s canned model
• all ops that could be offloaded
are packed into on op
The compiler creates a single custom op for all Edge TPU
compatible ops; anything else stays the same
https://coral.withgoogle.com/docs/edgetpu/models-intro/
58
MobileNet V1 1×224×224×3
1×1001
edgetpu-custom-op
input
Softmax
1×300×300×3
1×1917×91
1×10×4 1×10 1×10 1
edgetpu-custom-op
TFLite_Detection_PostProcess
3 1917×4
normalized_input_image_tensor
TFLite_Detection_PostProcess TFLite_Detection_PostProcess:1 TFLite_Detection_PostProcess:2 TFLite_Detection_PostProcess:3
SSD MobileNet V1
EdgeTPU Delegate
• There is dynamic delegate plugin interface recently.
Currently it’s only used by EdgeTPU’s
https://coral.withgoogle.com/news/updates-07-2019/
There still are many trivial bugs in
TensorFlow
• There are many typos in comments of TensorFlow code
• Many things are not well-documented
• There are many many warnings when building TensorFlow from source
code
• a trivial fix in May, 2019 by me
60
https://github.com/tensorflow/tensorflow/pull/28618
Concluding Remarks
• Deep learning on devices are here to stay. You can see
some applications nowadays. More to come.

• TensorFlow, including Lite, is under active development.
Documentation is improving. Opportunities to contribute
are still there
61
The End
62
Appendix / backup
slides
63
Transistor–Transistor Logic (TTL)
https://en.wikipedia.org/wiki/Transistor–transistor_logic
https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/68k_ttl.jpg/600px-68k_ttl.jpg
64
[1] https://www.slideshare.net/kstan2/tensorflow-on-android

[2] https://www.slideshare.net/kstan2/introduction-to-tensorflow-lite

[3] https://www.slideshare.net/kstan2/caffe2-on-android

[4] https://www.slideshare.net/kstan2/open-source-nn-frameworks-
on-cellphones

[5] https://www.slideshare.net/kstan2/why-you-cannot-use-neural-
engine-to-run-your-nn-models-on-a11-devices

[6] https://www.slideshare.net/kstan2/a-peek-into-googles-edge-tpu
65
https://www.amazon.com/
Computational-Aspects-Principles-
Computer-Science/dp/0914894951
Google Translate
66
Google Lens
67
68
Google Lens
⼦曰:「⼩⼦何莫學夫詩︖詩,可以興,可以觀,可以群,可以怨。邇之事⽗,遠之事君︔多識於⿃獸草⽊之名。」
69
Your phone personalizes the model locally, based on your usage (A).
Many users' updates are aggregated (B) to form a consensus change
(C) to the shared model, after which the procedure is repeated.
https://research.googleblog.com/2017/04/federated-learning-collaborative.html
70
tflite in gboard
Data:
•nwp
next-word-predictor/
next-word-predictor/tflite-nwp-20180920
next-word-predictor/tflite-nwp-20180920/nwp.uint8.tflite
next-word-predictor/tflite-nwp-20180920/nwp.syms
next-word-predictor/pie-nwp-20180807
next-word-predictor/pie-nwp-20180807/nwp.syms
next-word-predictor/pie-nwp-20180807/nwp.uint8.data
•Emoji
./emoji-predictor
./emoji-predictor/tflite-emoji-pred-
a69f4f3dd1a865206f8a5f8cdcd9f6d6
./emoji-predictor/tflite-emoji-pred-
a69f4f3dd1a865206f8a5f8cdcd9f6d6/emoji_pred.scale.csv
./emoji-predictor/tflite-emoji-pred-
a69f4f3dd1a865206f8a5f8cdcd9f6d6/emoji_pred.emoji.syms
./emoji-predictor/tflite-emoji-pred-
a69f4f3dd1a865206f8a5f8cdcd9f6d6/emoji_pred.uint8.tflite
./emoji-predictor/tflite-emoji-pred-
a69f4f3dd1a865206f8a5f8cdcd9f6d6/emoji_pred.token.syms
• Next-word-predictor
and emoji predictor
seem to be TFLite
based and using uint8
model
• However, .tflite here is
not real flatbuffer .tflite
• Seems to be from this
paper [1]
[1] https://arxiv.org/abs/1811.03604
71
Gboard: Chinese input methods
seem to be HMM-based
• As the name suggested, it could be HMM (Hidden
Markovian Model) and n-gram based
• Does HMM and n-gram work with federated learning?
72
• All-neural on-device
Recognizer [1]

• Live caption [2], announced in
Google I/O 2019

• [1] https://ai.googleblog.com/
2019/03/an-all-neural-on-
device-speech.html

• [2] https://www.youtube.com/
watch?v=hPv1PkjJ-J0
73
label_image for TFLite
• https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/examples/label_image/

• https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/examples/label_image/label_image.md

• Run a TF Lite single input, single output classifier model, e.g., MobileNet V1, so that we can verify the classifier
works or not

• What does it do

• read an image: unlike TF, there is no image decoder in TF Lite, so I wrote a simple .bmp decoder

• resize the input image to specific size, e.g., 224x244 or 299x299

• convert the image tensor to floating point if necessary

• load the classifier

• prepare tensors

• run the model

• process the input

• top-k labels
74
Speed of Quantized Models
• It seems it's much better than naive quantization as we saw before (in TensorFlow before TFLite)

• On Nexus 9 (MobileNet 1.0/224)

• Quantized

• ./label_image -t 2: ~ 160 ms

• ./label_image -t 2 -c 100: ~ 60 ms

• Floating point

• ./label_image -t 2 -m ./mobilenet_v1_1.0_224.tflite: ~ 300 ms

• ./label_image -t 2 -c 100 -m ./mobilenet_v1_1.0_224.tflite: ~ 82 ms

• Pixel 2 Quantized

• CPU 

• single thread: as is: ~ 90 ms, controlled env: ~ 70 ms

• 4 threads: ~ 30 ms

• HVX: ~ 12 ms
75
Fake Quantization in Early
Dec, 2017
• How hard can it be? How much time is needed?

• Several pre-tested models are available

• https://github.com/tensorflow/tensorflow/blob/master/
tensorflow/contrib/lite/g3doc/models.md

• but only one of them (https://storage.googleapis.com/
download.tensorflow.org/models/tflite/
mobilenet_v1_224_android_quant_2017_11_08.zip) is quantized
one

• as we can guess from related docs, retrain is kinda required to
get accuracy back
76
Fake Quantization in early
Nov, 2018
• Documents

• a paper at Arxiv: https://arxiv.org/abs/1712.05877

• white paper: https://arxiv.org/abs/1806.08342

• Code, e.g.,

• TF fake quant

• SLIM (https://github.com/tensorflow/models/blob/master/research/slim/train_image_classifier.py#L519-
L521), object-detection (e.g., https://github.com/tensorflow/models/blob/master/research/
object_detection/samples/configs/ssd_mobilenet_v2_quantized_300x300_coco.config#L196-L201), etc.

• models many quantized models

• classifiers: all MobileNet V1, some MobileNet V2 and others (https://www.tensorflow.org/lite/models)

• others, e.g.,

• Object-detection: e.g., MobileNet-SSD

• Semantic segmentation: DeepLab V3
77
TfLiteQuantizationParams
typedef struct {
float scale;
int32_t zero_point;
} TfLiteQuantizationParams;
https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/context.h#L165-L171
r = S(q − Z)
78
Note that the biases are not quantized because they are
represented as 32-bit integers in the inference process, with
a much higher range and precision compared to the 8 bit
weights and activations. Furthermore, quantization param-
eters used for biases are inferred from the quantization pa-
rameters of the weights and activations. See section 2.4.
Typical TensorFlow code illustrating use of [19] follows:
from tf.contrib.quantize 
import quantize_graph as qg
g = tf.Graph()
with g.as_default():
output = ...
total_loss = ...
optimizer = ...
train_tensor = ...
if is_training:
quantized_graph = 
qg.create_training_graph(g)
else:
quantized_graph = 
qg.create_eval_graph(g)
# Train or evaluate quantized_graph.
3.2. Batch normalization folding
For models that use batch normalization (see [17]), there
is additional complexity: the training graph contains batch
normalization as a separate block of operations, whereas
the inference graph has batch normalization parameters
“folded” into the convolutional or fully connected layer’s
Float
Integer
Table 4.1
tized net
Sche
Weigh
Activati
Accu
Table 4.
ious qua
works (B
[21, 22])
fine-grai
4. Expe
We c
ing the e
and the o
tradeoff
tion. 4.2
ence wo
is matrix
floating-
library [1
how to use fake quant
conv
weights
uint8
input
+
biases
uint32
ReLU6 output
uint8
uint32
uint8
uint8
(a) Integer-arithmetic-only inference
conv
wt quant weightsinput
+
biases
ReLU6 act quant output
(b) Training with simulated quantization
10 20 40 80 160 320
40
50
60
70
Latency (ms)
Top1Accuracy
Float
8-bit
(c) ImageNet latency-vs-accuracy tradeoff
Figure 1.1: Integer-arithmetic-only quantization. a) Integer-arithmetic-only inference of a convolution layer. The input and output
are represented as 8-bit integers according to equation 1. The convolution involves 8-bit integer operands and a 32-bit integer accumulator.
The bias addition involves only 32-bit integers (section 2.4). The ReLU6 nonlinearity only involves 8-bit integer arithmetic. b) Training
with simulated quantization of the convolution layer. All variables and computations are carried out using 32-bit floating-point arithmetic.
Weight quantization (“wt quant”) and activation quantization (“act quant”) nodes are injected into the computation graph to simulate the
effects of quantization of the variables (section 3). The resultant graph approximates the integer-arithmetic-only computation graph in panel
a), while being trainable using conventional optimization algorithms for floating point models. c) Our quantization scheme benefits from
the fast integer-arithmetic circuits in common CPUs to deliver an improved latency-vs-accuracy tradeoff (section 4). The figure compares
integer quantized MobileNets [10] against floating point baselines on ImageNet [3] using Qualcomm Snapdragon 835 LITTLE cores.
tions [14, 27, 34]. With these approaches, both multiplica-
tions and additions can be implemented by efficient bit-shift
and bit-count operations, which are showcased in custom
GPU kernels (BNN [14]). However, 1 bit quantization of-
Our work draws inspiration from [7], which leverages
low-precision fixed-point arithmetic to accelerate the train-
ing speed of CNNs, and from [31], which uses 8-bit fixed-
point arithmetic to speed up inference on x86 CPUs. Our
[1] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/qu
README.md
[2] https://arxiv.org/abs/1712.05877
[3] https://arxiv.org/abs/1806.08342
79
example of depthwise
convolution with fake quant
80
Real computation
• BLAS part: Eigen (http://eigen.tuxfamily.org/) and gemmlowp
(https://github.com/google/gemmlowp)

• Some Caveats

• convolutions are multithreaded

• uint8/gemm: 1

• float32/Eigen: 4

• depthwise convolutions are single threaded

• problems: big.LITTLE, number of cores, scheduling
81
knowing more to squeeze
performance
• Memory management: to get reasonable good performance when running highly parallel
workloads on mobile devices, you need good enough mechanism

• Profiling: there is a simple profiling mechanism in TF Lite since Apr, 2018

• time profiling only now. how about memory stuff?

• static buffer size: https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/
contrib/lite/profiling/profiler.h#L80

• https://github.com/tensorflow/tensorflow/tree/r1.10/tensorflow/contrib/lite/profiling

• Computation of quantized uint8

• when you want to do some operations on tensors, scale and zero point could be
changed. How to do it efficiently

• Post-training quantization: https://www.tensorflow.org/lite/performance/
post_training_quantization
82
Quick Intro to Caffe 2
• Caffe 2

• 2nd generation of Caffe, which was the most popular deep learning framework
(before TensorFlow) from Berkeley

• merged into PyTorch
• What's the difference? Caffe2 improves Caffe 1.0 in a series of directions:

• first-class support for large-scale distributed training

• mobile deployment
• new hardware support (in addition to CPU and CUDA)

• flexibility for future directions such as quantized computation

• stress tested by the vast scale of Facebook applications
83
https://caffe2.ai/docs/caffe-migration.html
Caffe2 backends for
Android I know
• ARM CPU:

• NNPACK, Eigen: quite mature

• QNNPACK: looks good, (https://code.fb.com/ml-applications/qnnpack/)

• OpenGL ES:

• OpenGL: not actively maintained (?)

• ARM Compute Library (GL ES part): stalled? 18.01

• NEON, and OpenCL

• NNAPI: stalled? NNAPI 1.0 (Oreo 8.1 API 27), not fully integrated yet

• ios: iOS MPS backend
84
More open source frameworks
• Yes, there are other framworks, e.g.,

• MACE from XiaoMi: https://github.com/XiaoMi/mace,

• ncnn from Tencent: https://github.com/Tencent/ncnn,

• ONNX runtime from Microsoft, https://github.com/
microsoft/onnxruntime, 

• TVM stack, https://tvm.ai

• So far, the TF/TFLite ecosystem is the largest one
85
Beyond Open Source
• Apple CoreML

• https://developer.apple.com/
documentation/coreml

• Google ML Kit

• https://developers.google.com/ml-kit/

• image labeling, OCR, face detection, bar
code scanning, landmark detection, etc.

• Custom models in TF Lite

• Qualcomm Snapdragon Neural Processing
Engine (SNPE)

• https://developer.qualcomm.com/software/
snapdragon-neural-processing-engine-ai

• Huawei HiAi DDK
86
https://aiyprojects.withgoogle.com/edge-tpu
https://www.anandtech.com/show/
13393/techinsights-publishes-
apple-a12-die-shot-our-take
Figure 7.38 Floor plan of the 8-core Pixel Visual Core chip. A53 is an ARMv7 core. LPDDR4 is a DRAM controller.
PCIE and MIPI are I/O buses.
87
Figure 7.13 Example of systolic array in action, from top to bottom on the page. In this example, the six weights
are already inside the multiply-accumulate units, as is the norm for the TPU. The three inputs are staggered in time to
get the desired effect, and in this example are shown coming in from the top. (In the TPU, the data actually comes in
from the left.) The array passes the data down to the next element and the result of the computation to the right to the
next element. At the end of the process, the sum of products is found to the right. Drawings courtesy of Yaz Sato.
It seems Edge TPU is not TPU-like?
Figure 7.14 Systolic data flow of the Matrix Multiply Unit.
https://www.elsevier.com/books-and-journals/book-companion/9780128119051
88
Edge TPU and NCS 2
89
device
MobileNet V1
1.0/224
MobileNet V2
1.0/224
Inception V3 ResNet 50 SqueezeNet 1.1
MobileNet V1
0.25/128
SSD MobileNet
V1 COCO
SSD MobileNet
V2 COCO
Coral: Edge
TPU
2.74 2.87 43.27 42.41 1.90 1.11 10.05 12.48
NCS 2 (fp16) 12.11 14.87 52.25 33.1 3.99 4.08 23.53 39.11
iPhone Xs Max
(Neural Engine
accelerated,
fp16)
1.74 2.15 8.65 6.91 1.75 1.16
Mobilenet V1/V2 and SSD Mobilenet V1/V2 are quite good
• Edge TPU: my scripts, https://github.com/freedomtan/edge_tpu_python_scripts
• NCS 2: ./benchmark_app-d MYRIAD -niter 50 -nireq 10 ..
• iPhone Xs Max: my CoreML benchmark, https://github.com/freedomtan/coremlbenchmark
0
2
4
6
8
10
12
14
time(ms)
Mobilenet V1: Edge TPU and NCS2
ncs2 mobilenet_v1_0.25 ncs2 mobilenet_v1_0.5 ncs2 mobilenet_v1_0.75 ncs2 mobilenet_v1_1.0
coral mobilenet_v1_0.25 coral mobilenet_v1_0.5 coral mobilenet_v1_0.75 coral mobilenet_v1_1.0
Mobilenet V1 on EdgeTPU
and NCS2
90
inference time size=128x128 size=160x160 size=192x192 size=224x224
ncs2
mobilenet_v1_0
.25
3.83 3.95 4.06 4.4
ncs2
mobilenet_v1_0
.5
4.98 4.86 5.51 6.51
ncs2
mobilenet_v1_0
.75
6.04 6.67 7.93 9.4
ncs2
mobilenet_v1_1
.0
7.43 8.68 10.13 12.2
coral
mobilenet_v1_0
.25
1.07 1.24 1.30 1.47
coral
mobilenet_v1_0
.5
1.16 1.40 1.53 1.95
coral
mobilenet_v1_0
.75
1.29 1.70 1.80 2.16
coral
mobilenet_v1_1
.0
1.50 1.95 2.15 2.85
1×224×224×3
1×1×1×1024
1×1×1×1024
1×1×1×5
1×5
1×5
edgetpu-custom-op
L2Normalization
Conv2D
weights 5×1×1×1024
bias 5
Reshape
Softmax
input
Output
Imprinting Engine
• Yes, let’s check what it is

• The Imprinting Engine implements a low-shot learning technique
called ‘Imprinted Weights’ [1][2]

• Can be used to retrain classifiers on-device (either on USB
Accelerator or Dev Board), no back-propagation gradient involved.

• Why?

• Transfer-learning happens on-device, at near-realtime speed.

• You don't need to recompile the model.

• Limitations

• Training data size is limited to a max of 200 images per class.

• It is most suitable only for datasets that have a small inner
class variation.

• The last fully-connected layer runs on the CPU, not the Edge
TPU. So it will be slightly less efficient than running a pre-
compiled on Edge TPU.

• if you are interested in it, check the paper and
aiy::learn::imprinting::ImprintingEngine::Train(un
signed char const*, int, int)
91
[1] https://coral.withgoogle.com/docs/edgetpu/retrain-classification-ondevice/

[2] https://arxiv.org/abs/1712.07136
1×224×224×3
1×1×1×1024
edgetpu-custom-op
input
AvgPool
EfficientNet
• EfficientNet-B0:

• much smaller FLOPS than
MobileNet V1; much higher
accuracy

• MobileNet V2: a bit larger FLOPS;
much higher accuracy
http://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.html
92
EfficientNet-B0 floating point
93
EfficientNet-B0 fixed point
94
Depthwise Separable Convolution
• CNNs with depthwise separable convolution such as Mobilenet [1]
changed almost everything

• Depthwise separable convolution “factorize” a standard convolution
into a depthwise convolution and a 1 × 1 convolution called a
pointwise convolution. Thus it greatly reduces computation
complexity.

• Depthwise separable convolution is not that that new [2], but pure
depthwise separable convolution-based networks such as Xception
and MobileNet demonstrated its power

[1] https://arxiv.org/abs/1704.04861

[2] L. Sifre. “Rigid-motion scattering for image classification”, PhD thesis, 2014
95
...M
N
1
1
...
MDK
DK
1
...
M
DK
DK N
depthwise convolution filters
standard convolution filters
1×1 Convolutional Filters (Pointwise Convolution)https://arxiv.org/abs/1704.04861
Depthwise Separable Convolution
96
97

More Related Content

More from Koan-Sin Tan

Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?Koan-Sin Tan
 
open source nn frameworks on cellphones
open source nn frameworks on cellphonesopen source nn frameworks on cellphones
open source nn frameworks on cellphonesKoan-Sin Tan
 
Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Koan-Sin Tan
 
Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on AndroidKoan-Sin Tan
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016Koan-Sin Tan
 
A peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserA peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserKoan-Sin Tan
 
Android Wear and the Future of Smartwatch
Android Wear and the Future of SmartwatchAndroid Wear and the Future of Smartwatch
Android Wear and the Future of SmartwatchKoan-Sin Tan
 
Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android BenchmarksKoan-Sin Tan
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsKoan-Sin Tan
 
Smalltalk and ruby - 2012-12-08
Smalltalk and ruby  - 2012-12-08Smalltalk and ruby  - 2012-12-08
Smalltalk and ruby - 2012-12-08Koan-Sin Tan
 

More from Koan-Sin Tan (11)

Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
 
open source nn frameworks on cellphones
open source nn frameworks on cellphonesopen source nn frameworks on cellphones
open source nn frameworks on cellphones
 
Caffe2 on Android
Caffe2 on AndroidCaffe2 on Android
Caffe2 on Android
 
Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Introduction to TensorFlow Lite
Introduction to TensorFlow Lite
 
Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on Android
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016
 
A peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserA peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk User
 
Android Wear and the Future of Smartwatch
Android Wear and the Future of SmartwatchAndroid Wear and the Future of Smartwatch
Android Wear and the Future of Smartwatch
 
Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android Benchmarks
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
 
Smalltalk and ruby - 2012-12-08
Smalltalk and ruby  - 2012-12-08Smalltalk and ruby  - 2012-12-08
Smalltalk and ruby - 2012-12-08
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

Status quo of tensor flow lite on edge devices coscup 2019

  • 1. Status Quo of TensorFlow Lite on Edge Devices Koan-Sin Tan freedom@computer.org Aug 17th, 2019 COSCUP, Taipei, Taiwan 1
  • 2. • disclaimer: Opinions Are My Own • feel free to interrupt me if you have any questions 2
  • 3. who i am • Used open source before the term “open source” is used • A software guy, learned to use Unix and open source software on VAX-11/780 running 4.3BSD • Used to be a programming language junkie • Worked on various system software, e.g., CPU scheduling and power management of non- CPU components • Recently, on NN performance on edge devices related stuff • Contributed from time to time to TensorFlow Lite • started a command line label_image for TFLite https://github.com/tensorflow/tensorflow/releases/tag/v2.0.0-alpha0 http://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg 3
  • 4. Outline • overview: or, say, why TFLite • new features • delegates: including new NNAPI delegate, GPU delegate, and flex delegate, • optimized kernels for ARM CPUs, • various APIs: including Python, C, Objective-C, and Swift ones, and • misc, e.g., graph writer and Edge TPU. 4
  • 5. Why TFLite? • TensorFlow Lite • TensorFlow is the most popular machine learning frameworks • TFLite: a lightweight runtime for edge devices • could be accelerated by GPU, DSP, or ASIC accelerators • PyTorch is catching up, but acceleration part is still lagging far behind TFLite • Yes, there are other open source NN frameworks. No one is as comprehensive as TF Lite, as far as I can tell 5
  • 7. Why NN on edge device, esp. cell phones? • Offline usages • Latency • Bandwidth • Privacy • Sensors 7
  • 8. Offline usage • we heard words such as “always-on” and “always- connected” back to 3G days 🤔, but wireless communications is so unreliable 8
  • 9. latency • “There is an old network saying: Bandwidth problems can be cured with money.  Latency problems are harder because the speed of light is fixed — you can't bribe God.” -- David D. Clark, MIT 9 https://en.wikipedia.org/wiki/David_D._Clark
  • 10. Bandwidth • Well, bandwidth of wireless network is not easy problem either • consider you have NN-based “portrait model” (or say Bokeh effect) on iPhone Xs Max (12 + 12 MP) • if we send raw image (12+12)*10^6*(3*8) = 576 M bits • 576 * 30 ~= 17.3 G bits • you know this is not feasible for now 10
  • 11. Privacy • you know you need privacy for both your physical body and your mobile device(s) 11
  • 12. NN-based ML is already in cell phones • Google I/O 2017: Mobile First —> AI First • TensorFlow Lite, Android Neural Network API • Lots of stuff from Google blogs and papers, e.g., Google Lens, federated learning in Gboard • Pixel Visual Core in Pixel 2/3, 2/3 XL: although it seems there is no way for developers to use it as a general NN accelerator • Apple announced CoreML, a machine framework, at WWDC 2017 (June 2017) • Apple’s machine learning journal (https://machinelearning.apple.com/): how Apple uses CNN and other machine techniques in iPhone • Neural Engine in A11/A11X/A12/A12X, available to developers via Core ML on A12 devices • Computer Architecture: A Quantitative Approach, 6th Ed. (Nov, 2017) has a whole new chapter on Domain Specific Architecture, actually NN accelerators. 12
  • 13. actually there are many NNAPI- enabled phones already http://ai-benchmark.com/ranking_processors.html mid June, 2019 13
  • 15. https://www.anandtech.com/show/13392/the-iphone-xs-xs-max-review-unveiling-the- silicon-secrets/5 • AnandTech is one the my favorite tech sites. Usually, it provides good analysis • E.g., Apple’s CPUs • cache sizes • execution units • various instruction latency • Not good enough for NN accelerators on mobile phones • floating-point VGG16, Inception V3, and ResNet34? • come on, are you still in Neolithic era? Evolving fast: the slide I prepared Nov, 2018 15
  • 16. TF Lite in Android Pie • There are ‘libtflite.so’s in /system/lib and /system/lib64 • https://source.android.com/devices/tech/display/textclassifier 16
  • 17. More TFLite use cases 17
  • 18. Some TFLite clients presented by TFLite guys 18
  • 19. ML Kit • https:// developers.google.com/ml- kit/, part of FireBase • Originally, only custom models are TFLite • Now, as far as I can tell, vision parts are using TFLite also https://developers.google.com/ml-kit/ 19
  • 20. • see appendix for Google Translate, Google Lens, Gboard, and others 20
  • 21. Some Progresses Make NN on Edge Devices Really Viable • “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” [1]. A keynote at ESWEEK 2017, “Keynote: Small Neural Nets Are Beautiful: Enabling Embedded Systems with Small Deep- Neural-Network Architectures” [2] • MobileNet V1 [3] and V2 [4]: Depthwise separable convolution [5] and inverted residuals and linear bottlenecks [4] • AutoML, e.g., • NASNet Mobile [6] and Mnasnet [7] • MobileNet V3 [10] and EfficientNet [11] • Quantization [8][9] • How about pruning / compression stuff? As far as I know, not widely used yet [1] https://arxiv.org/abs/1602.07360 [2] https://arxiv.org/abs/1710.02759 [3] https://arxiv.org/abs/1704.04861 [4] https://arxiv.org/abs/1801.04381 [5] https://www.di.ens.fr/data/publications/papers/phd_sifre.pdf [6] https://arxiv.org/abs/1707.07012 [7] https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html, https://arxiv.org/abs/1807.11626 [8] https://arxiv.org/abs/1712.05877 [9] https://arxiv.org/abs/1806.08342 [10] https://arxiv.org/abs/1905.02244 [11] https://arxiv.org/abs/1905.11946 21
  • 22. • Michael Jordan published an article on Medium named “Artificial Intelligence — The Revolution Hasn’t Happened Yet” [1] • Yes, but current deep learning driven stuff should be enough for next few years [1] https://medium.com/ @mijordan3/artificial-intelligence- the-revolution-hasnt-happened- yet-5e1d5812e1e7 22
  • 23. Why I Started Learning TF Lite • We heard Android NN and TensorFlow Lite back in Google I/ O 2017 • My COSCUP 2017 slide deck “TensorFlow on Android” • https://www.slideshare.net/kstan2/tensorflow-on- android • People knew a bit about Android NN API before it was announced and released • No information about TensorFlow Lite, at least to me, before it was released in Nov, 2017 23
  • 24. Quantization and Accelerators • Quantization • Quantization is not new, people know that there are lots redundancy in NN models back from pre DNN days. Many quantization and compressing/pruning techniques were presented all the years. TFLite and its underlying gemmlowp (and NNAPI) made the first production quality system that supports quantized unsigned int8. • accelerators (thru NNAPI in the beginning, and directly later) • CPU is not always the best one to use NN models • GPU, DSP, and other accelerators 24
  • 25. TFLite and Android NN in Google I/O 2017 • New TensorFlow runtime • Optimized for mobile and embedded apps • Runs TensorFlow models on device • Leverage Android NN API • Soon to be open sourced from Google I/O 2017 video 25
  • 26. Actual Android NN API • Announced/published with Android 8.1 Preview 1 • Available to developer in NDK • yes, NDK • The Android Neural Networks API (NNAPI) is an Android C API designed for running computationally intensive operations for machine learning on mobile devices • NNAPI is designed to provide a base layer of functionality for higher-level machine learning frameworks (such as TensorFlow Lite, Caffe2, or others) that build and train neural networks • The API is available on all devices running Android 8.1 (API level 27) or higher. https://developer.android.com/ndk/images/nnapi/nnapi_architecture.png 26
  • 27. Android NN on Pixel 2 • Only the CPU fallback was available on Oreo MR1 • Actually, you can see Android NN API related in AOSP after Oreo MR1 (8.1) release already • user level code, see https://android.googlesource.com/platform/frameworks/ml/+/oreo-mr1-release • HAL, see https://android.googlesource.com/platform/hardware/interfaces/+/oreo-mr1-release/ neuralnetworks/ • There is NN API 1.1 on Android Pie • https://developer.android.com/about/versions/pie/android-9.0#nnapi • adding support for nine new ops — Pad, BatchToSpaceND, SpaceToBatchND, Transpose, Strided Slice, Mean, Div, Sub, and Squeeze • In the Android P DP1/2 (https://developer.android.com/preview/download.html), there was a HVX NN API 1.0 (yes, 1.0) driver. Gone after DP2. Not in recent Pie release. (See https:// android.googlesource.com/platform/hardware/qcom/neuralnetworks/hvxservice/ for source code) • NN API 1.2, which supports 90+ ops, is in AOSP and will be in forthcoming Android Q (version 10) 27
  • 28. So NNAPI accelerators don’t work? • Yes, I don’t know what happened to earlier Pixel phones • I don’t have Pixel 3 to try • Q beta 4 for Pixel 3a comes with working a HVX accelerator driver that works. It’s an NNAPI 1.1 one though. • And remember what I showed in pp. 13 and 14, there are many NNAPI-enabled phones already 28
  • 29. Original TFLite APIs • Java API: A convenience wrapper around the C++ API on Android • C++ API: loads the TensorFlow Lite model file and invokes the Interpreter. The same library is available on both Android and iOS https://www.tensorflow.org/mobile/tflite/ 29
  • 30. Other bindings • Python and C APIs • Python: introduced in TF 1.8.0, built into pip package in 1.9.0 • my label_image.py for tflite merged on Aug 9, 2018 • https://github.com/tensorflow/tensorflow/blob/master/tensorflow/ lite/examples/python/label_image.py • https://github.com/tensorflow/tensorflow/tree/master/tensorflow/ lite/examples/python • C API: introduced for Unity • https://github.com/tensorflow/tensorflow/tree/master/tensorflow/ contrib/lite/experimental/c 30
  • 31. How to Use it 31 • TFLite guys work hard • documentation getting better and better over < 2 yrs • yes, sometimes you still have to “use the source” https://www.tensorflow.org/lite
  • 33. Basic Usage • model: .tflite model • resolver: if no custom ops, builtin op resolver is enough • interpreter: we need it to compute the graph • interpreter->AllocateTensor(): Allocate stuff for you, e.g., input tensor(s) • fill the input • interpreter->Invoke(): run the graph • process the output tflite::FlatBufferModel model(path_to_model); tflite::ops::builtin::BuiltinOpResolver resolver; std::unique_ptr<tflite::Interpreter> interpreter; tflite::InterpreterBuilder(*model, resolver)(&interpreter); // Resize input tensors, if desired. interpreter->AllocateTensors(); float* input = interpreter->typed_input_tensor<float>(0); // Fill `input`. interpreter->Invoke(); float* output = interpreter->type_output_tensor<float>(0); 33
  • 34. more source code • Check my COSCUP 2018 slide deck, which was for a talk in a source code reading track, for more details • https://www.slideshare.net/kstan2/open-source-nn- frameworks-on-cellphones • And I’ll have a more code-oriented talk on TFLite delegates tomorrow 34
  • 36. 1×224×224×3 1×1001 TfLiteNnapiDelegate 1 32×3×3×3 2 1×3×3×512 3 512×1×1×512 4 1×3×3×512 5 512×1×1×512 6 1×3×3×512 7 1024×1×1×512 8 1×3×3×1024 9 1024×1×1×1024 10 1×3×3×32 11 64×1×1×32 12 1×3×3×64 13 128×1×1×64 14 1×3×3×128 15 128×1×1×128 16 1×3×3×128 17 256×1×1×128 18 1×3×3×256 19 256×1×1×256 20 1×3×3×256 21 512×1×1×256 22 1×3×3×512 23 512×1×1×512 24 1×3×3×512 25 512×1×1×512 26 1×3×3×512 27 512×1×1×512 28 1001 29 1001×1×1×1024 30 2 31 32 32 512 33 512 34 512 35 512 36 512 37 1024 38 1024 39 1024 40 32 41 64 42 64 43 128 44 128 45 128 46 128 47 256 48 256 49 256 50 256 51 512 52 512 53 512 54 512 55 512 56 512 57 512 input Reshape_1 NNAPI Delegate • Previously, when a graph is delegated to NNAPI, it’s kinda invisible to TFLite • With recently NNAPI delegate rewrite, it’s an op in TFLite now • subgraph • all-or-nothing —> per op 1×224×224×3 1×112×112×32 1×112×112×32 1×112×112×64 1×56×56×64 1×56×56×128 1×56×56×128 1×56×56×128 1×28×28×128 1×28×28×256 1×28×28×256 1×28×28×256 1×14×14×256 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×7×7×512 1×7×7×1024 1×7×7×1024 1×7×7×1024 1×1×1×1024 1×1×1×1001 1×1001 1×1001 Conv2D weights 32×3×3×3 bias 32 DepthwiseConv2D weights 1×3×3×32 bias 32 Conv2D weights 64×1×1×32 bias 64 DepthwiseConv2D weights 1×3×3×64 bias 64 Conv2D weights 128×1×1×64 bias 128 DepthwiseConv2D weights 1×3×3×128 bias 128 Conv2D weights 128×1×1×128 bias 128 DepthwiseConv2D weights 1×3×3×128 bias 128 Conv2D weights 256×1×1×128 bias 256 DepthwiseConv2D weights 1×3×3×256 bias 256 Conv2D weights 256×1×1×256 bias 256 DepthwiseConv2D weights 1×3×3×256 bias 256 Conv2D weights 512×1×1×256 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 1024×1×1×512 bias 1024 DepthwiseConv2D weights 1×3×3×1024 bias 1024 Conv2D weights 1024×1×1×1024 bias 1024 AveragePool2D Conv2D weights 1001×1×1×1024 bias 1001 Squeeze Softmax input Reshape_1 36 http://localhost:8080/, http://localhost:8090/
  • 37. More Delegates • Flex Delegate • Ops supported by TFLite is relatively limited, TensorFlow Lite models can now use a subset of TensorFlow ops when TFLite builtin ops are not sufficient • GPU backend: no, not NNAPI • OpenGL ES 3.1 Compute Shaders on Android devices • Metal Compute Shaders on iOS device • “in general the new GPU backend performs 2–7x faster than the floating point CPU implementation for a wide range of diverse deep neural network models.” https://www.tensorflow.org/lite/using_select_tf_ops https://medium.com/tensorflow/tensorflow-lite-now-faster-with-mobile-gpus-developer-preview-e15797e6dee7 https://www.tensorflow.org/lite/performance/gpu https://www.tensorflow.org/lite/performance/gpu_advanced 37
  • 38. Why a non-NNAPI delegate? https://developer.android.com/about/dashboards NNAPI-enabled devices ~7.5% around the end of Oct, 2018 38
  • 39. NNAPI-enabled devices ~ 25.8% around May 7, 2019 https://developer.android.com/about/dashboards39
  • 40. 40 GL ES compute shader capable devices ~ 50% https://developer.android.com/about/dashboards
  • 41. GPU Delegate Performance • my quick and dirty benchmarks • Android: https://github.com/freedomtan/ glDelegateBench • iOS: https://github.com/freedomtan/ glDelegateBenchmark/ • at first, GPU Delegate is binary release only (aar for Android; pod for iOS) • after the release of GPU delegate source code, benchmark_model and label_image are able to use GPU delegate 41
  • 42. GPU delegate kernels • Recently, TFLite GPU delegate guys published a paper talking about how they design it. Covered some interesting details • GPU backends require initialization involving shader compilation and optimization by the driver before inference • PHWC4: P stands for plane • Reshape is expensive on GPU • RGBA is better than RGB on GPU • a tensor of shape [B,H,W,5], for instance, is twice as expensive as [B, H, W, 4], but about the same as [B, H, W, 8], then the architect can tune around those 4-channel boundaries rather than trying to optimize on other boundaries. • https://arxiv.org/pdf/1907.01989.pdf
  • 43. Faster ARM CPU kernels • It’s available now. Enabled by default for Android ARM64 early June • https://github.com/tensorflow/tensorflow/commit/ 8924e67e034909bea0343631b9f9024c5a6da5c4 • ruy: • four tune fixed point kernels: big/LITTLE (out-of-order/ in-order), w/ or w/o dot-product instructions • two tuned floating point kernels 43
  • 44. More on ruy • matrix multiplication in AArch64 NEON • sdot based kernels for either out-of-order CPUs, e.g., CA76, or in-order CPUs, e.g., CA55r1 • non sdot based kernels for either out-of-order CPUs, e.g., CA73, or in-order CPUs, e.g., CA53 • how the kernel is chosen: detection at run time instead of hard-coded list (e.g., PyTorch cpuinfo) • sdot or not: see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/ experimental/ruy/detect_dotprod.cc#L129-L157 • in-order or out-of-order: see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/ experimental/ruy/tune.cc, esp., https://github.com/tensorflow/tensorflow/blob/master/tensorflow/ lite/experimental/ruy/tune.cc#L102-L124 • doesn't need to list all possibilities, probably can handle future cores. Still cannot deal with big.LITTLE cores • thread pool: it seems to scale better than the one currently in use, so that multi-threaded floating- point numbers are much better • before ruy, floating point: eigen thread pool; fixed-point: TFLite’s thread pool 44
  • 45. Python API • TensorFlow Lite Optimizing Converter (TOCO) —> tflite_convert, mainly python wrapped C++ code • Python Interpeter: https://www.tensorflow.org/lite/convert/ python_api#tensorflow_lite_python_interpreter_ • https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/ convert/python_api.md • https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/lite • I sent label_image.py (merged, https://github.com/tensorflow/tensorflow/tree/master/ tensorflow/lite/examples/python) and mobilenet_ssd. Tried others such as DeepLab V3 on RPI 3 B+. • Quick test and you can use OpenCV to do preprocessing and post-processing 45
  • 46. C API • https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/ experimental/c/c_api.h • Started as a base for Unity, https://github.com/tensorflow/tensorflow/tree/ master/tensorflow/lite/experimental/examples/unity/TensorFlowLitePlugin • FFI via C is much easier than C++ • Who uses it? Objective-C and Swift APIs • my quick-and-dirty hacks for Pharo Smalltalk, https://github.com/ freedomtan/libtensorflow-pharo-bindings/blob/libtensorflowlite_c_hacks/ LibTensorFlow-Core/TensorFlowLiteCAPI.class.st 46
  • 47. Yes, Smalltalk Is Alive • Smalltalk is an object- oriented, dynamically typed reflective programming language started in 1970s • Alan Kay, the creator or Smalltalk, coined the term Object Oriented Programming (OOP). • MVC, IDE, live programming http://pharo.org/web/files/teaser50.png 47
  • 49. There are more new things • For example, uP • See https://github.com/tensorflow/tensorflow/tree/ master/tensorflow/lite/experimental • TFLite Micro and uTensor • https://os.mbed.com/blog/entry/uTensor-and-Tensor- Flow-Announcement/ • Yes, RNN-based models, including LSTM, are not doing well (yet) 49
  • 50. Google I/O 2019 updates • new MLIR-based TF —> TFLite converter • improved CPU backend: ruy • on-device training: not ready yet? • control flow support • see more at https://www.youtube.com/watch? v=Jjm7MT6W0Dc 50
  • 52. MLIR: Multi-Level Intermediate Representation for Compiler Infrastructure 52 MLIR for TFLite Converter
  • 53. MLIR: Multi-Level Intermediate Representation for Compiler Infrastructure 53
  • 54. TF graphdef .pb -> TFLite flatbuffer .tflite • Build TensorFlow MLIR related binaries bazel build --config opt tensorflow/compiler/mlir/... • Get your model, e.g., wget http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_1.0_224.tgz • Convert it ./bazel-bin/tensorflow/compiler/mlir/lite/tf_tfl_translate  -tf-input-shapes=1,224,224,3 -tf-input-data- types=DT_FLOAT -tf-output-arrays=MobilenetV1/Predictions/Reshape_1  /tmp/mobilenet_v1_1.0_224_frozen.pb --tf- input-arrays=input  -o /tmp/foo.tflite • Yes, it works like a charm. But, not for quantized model, neither ./bazel-bin/tensorflow/compiler/mlir/lite/tf_tfl_translate  -tf-input-shapes=1,224,224,3 -tf-input-data- types=DT_QUINT8 -tf-output-arrays=MobilenetV1/Predictions/Reshape_1  /tmp/mobilenet_v1_1.0_224_quant_frozen.pb -- tf-input-arrays=input  -o /tmp/bar.tflite nor 
 ./bazel-bin/tensorflow/compiler/mlir/lite/tf_tfl_translate -tf-input-shapes=1,224,224,3 -tf-input-data- types=DT_FLOAT -tf-output-arrays=MobilenetV1/Predictions/Reshape_1  /tmp/mobilenet_v1_1.0_224_quant_frozen.pb -- tf-input-arrays=input  -o /tmp/bar.tflite —tf-inference-type=TF_QUINT8 works 54
  • 55. Google Edge TPU • Announced back in Google Next 2018 (July, 2018) • Available to general developers right before TensorFlow Dev Summit 2019 (Mar, 2019) • USB: Coral Accelerator • Dev Board: Coral Dev Board • More are coming, e.g., PCI-E Accelerator and SOM • Supported framework: TFLite https://coral.withgoogle.com/products/ 55
  • 56. Edge TPU Software •Updates released on April 11th, 2019 •Compiler: removed the restriction for specific architectures •New TensorFlow Lite C++ API •Updated Python API, mainly for multiple Edge TPUs •Updated Mendel OS and Mendel Management Tool (MDT) tool •Environmental Sensor Board, https://coral.withgoogle.com/products/environmental/ •May updates, May 29th, 2019 •Offline compiler •MDT update https://developers.googleblog.com/2019/04/updates-from-coral-new-compiler-and.html https://coral.withgoogle.com/news/updates-04-2019/ https://coral.withgoogle.com/news/updates-05-2019/ 56
  • 57. Edge TPU Software • July updates, July 24th, 2019 • Updated Edge TPU Compiler and runtime: support for models built using post-training quantization • Updated Edge TPU Python library • New on-device backpropagation API • Updated weight imprinting API • New TensorFlow Lite delegate for Edge TPU https://coral.withgoogle.com/news/updates-07-2019/ 57
  • 58. Edge TPU’s canned model • all ops that could be offloaded are packed into on op The compiler creates a single custom op for all Edge TPU compatible ops; anything else stays the same https://coral.withgoogle.com/docs/edgetpu/models-intro/ 58 MobileNet V1 1×224×224×3 1×1001 edgetpu-custom-op input Softmax 1×300×300×3 1×1917×91 1×10×4 1×10 1×10 1 edgetpu-custom-op TFLite_Detection_PostProcess 3 1917×4 normalized_input_image_tensor TFLite_Detection_PostProcess TFLite_Detection_PostProcess:1 TFLite_Detection_PostProcess:2 TFLite_Detection_PostProcess:3 SSD MobileNet V1
  • 59. EdgeTPU Delegate • There is dynamic delegate plugin interface recently. Currently it’s only used by EdgeTPU’s https://coral.withgoogle.com/news/updates-07-2019/
  • 60. There still are many trivial bugs in TensorFlow • There are many typos in comments of TensorFlow code • Many things are not well-documented • There are many many warnings when building TensorFlow from source code • a trivial fix in May, 2019 by me 60 https://github.com/tensorflow/tensorflow/pull/28618
  • 61. Concluding Remarks • Deep learning on devices are here to stay. You can see some applications nowadays. More to come. • TensorFlow, including Lite, is under active development. Documentation is improving. Opportunities to contribute are still there 61
  • 65. [1] https://www.slideshare.net/kstan2/tensorflow-on-android [2] https://www.slideshare.net/kstan2/introduction-to-tensorflow-lite [3] https://www.slideshare.net/kstan2/caffe2-on-android [4] https://www.slideshare.net/kstan2/open-source-nn-frameworks- on-cellphones [5] https://www.slideshare.net/kstan2/why-you-cannot-use-neural- engine-to-run-your-nn-models-on-a11-devices [6] https://www.slideshare.net/kstan2/a-peek-into-googles-edge-tpu 65
  • 68. 68
  • 70. Your phone personalizes the model locally, based on your usage (A). Many users' updates are aggregated (B) to form a consensus change (C) to the shared model, after which the procedure is repeated. https://research.googleblog.com/2017/04/federated-learning-collaborative.html 70
  • 71. tflite in gboard Data: •nwp next-word-predictor/ next-word-predictor/tflite-nwp-20180920 next-word-predictor/tflite-nwp-20180920/nwp.uint8.tflite next-word-predictor/tflite-nwp-20180920/nwp.syms next-word-predictor/pie-nwp-20180807 next-word-predictor/pie-nwp-20180807/nwp.syms next-word-predictor/pie-nwp-20180807/nwp.uint8.data •Emoji ./emoji-predictor ./emoji-predictor/tflite-emoji-pred- a69f4f3dd1a865206f8a5f8cdcd9f6d6 ./emoji-predictor/tflite-emoji-pred- a69f4f3dd1a865206f8a5f8cdcd9f6d6/emoji_pred.scale.csv ./emoji-predictor/tflite-emoji-pred- a69f4f3dd1a865206f8a5f8cdcd9f6d6/emoji_pred.emoji.syms ./emoji-predictor/tflite-emoji-pred- a69f4f3dd1a865206f8a5f8cdcd9f6d6/emoji_pred.uint8.tflite ./emoji-predictor/tflite-emoji-pred- a69f4f3dd1a865206f8a5f8cdcd9f6d6/emoji_pred.token.syms • Next-word-predictor and emoji predictor seem to be TFLite based and using uint8 model • However, .tflite here is not real flatbuffer .tflite • Seems to be from this paper [1] [1] https://arxiv.org/abs/1811.03604 71
  • 72. Gboard: Chinese input methods seem to be HMM-based • As the name suggested, it could be HMM (Hidden Markovian Model) and n-gram based • Does HMM and n-gram work with federated learning? 72
  • 73. • All-neural on-device Recognizer [1] • Live caption [2], announced in Google I/O 2019 • [1] https://ai.googleblog.com/ 2019/03/an-all-neural-on- device-speech.html • [2] https://www.youtube.com/ watch?v=hPv1PkjJ-J0 73
  • 74. label_image for TFLite • https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/examples/label_image/ • https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/examples/label_image/label_image.md • Run a TF Lite single input, single output classifier model, e.g., MobileNet V1, so that we can verify the classifier works or not • What does it do • read an image: unlike TF, there is no image decoder in TF Lite, so I wrote a simple .bmp decoder • resize the input image to specific size, e.g., 224x244 or 299x299 • convert the image tensor to floating point if necessary • load the classifier • prepare tensors • run the model • process the input • top-k labels 74
  • 75. Speed of Quantized Models • It seems it's much better than naive quantization as we saw before (in TensorFlow before TFLite) • On Nexus 9 (MobileNet 1.0/224) • Quantized • ./label_image -t 2: ~ 160 ms • ./label_image -t 2 -c 100: ~ 60 ms • Floating point • ./label_image -t 2 -m ./mobilenet_v1_1.0_224.tflite: ~ 300 ms • ./label_image -t 2 -c 100 -m ./mobilenet_v1_1.0_224.tflite: ~ 82 ms • Pixel 2 Quantized • CPU • single thread: as is: ~ 90 ms, controlled env: ~ 70 ms • 4 threads: ~ 30 ms • HVX: ~ 12 ms 75
  • 76. Fake Quantization in Early Dec, 2017 • How hard can it be? How much time is needed? • Several pre-tested models are available • https://github.com/tensorflow/tensorflow/blob/master/ tensorflow/contrib/lite/g3doc/models.md • but only one of them (https://storage.googleapis.com/ download.tensorflow.org/models/tflite/ mobilenet_v1_224_android_quant_2017_11_08.zip) is quantized one • as we can guess from related docs, retrain is kinda required to get accuracy back 76
  • 77. Fake Quantization in early Nov, 2018 • Documents • a paper at Arxiv: https://arxiv.org/abs/1712.05877 • white paper: https://arxiv.org/abs/1806.08342 • Code, e.g., • TF fake quant • SLIM (https://github.com/tensorflow/models/blob/master/research/slim/train_image_classifier.py#L519- L521), object-detection (e.g., https://github.com/tensorflow/models/blob/master/research/ object_detection/samples/configs/ssd_mobilenet_v2_quantized_300x300_coco.config#L196-L201), etc. • models many quantized models • classifiers: all MobileNet V1, some MobileNet V2 and others (https://www.tensorflow.org/lite/models) • others, e.g., • Object-detection: e.g., MobileNet-SSD • Semantic segmentation: DeepLab V3 77
  • 78. TfLiteQuantizationParams typedef struct { float scale; int32_t zero_point; } TfLiteQuantizationParams; https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/context.h#L165-L171 r = S(q − Z) 78
  • 79. Note that the biases are not quantized because they are represented as 32-bit integers in the inference process, with a much higher range and precision compared to the 8 bit weights and activations. Furthermore, quantization param- eters used for biases are inferred from the quantization pa- rameters of the weights and activations. See section 2.4. Typical TensorFlow code illustrating use of [19] follows: from tf.contrib.quantize import quantize_graph as qg g = tf.Graph() with g.as_default(): output = ... total_loss = ... optimizer = ... train_tensor = ... if is_training: quantized_graph = qg.create_training_graph(g) else: quantized_graph = qg.create_eval_graph(g) # Train or evaluate quantized_graph. 3.2. Batch normalization folding For models that use batch normalization (see [17]), there is additional complexity: the training graph contains batch normalization as a separate block of operations, whereas the inference graph has batch normalization parameters “folded” into the convolutional or fully connected layer’s Float Integer Table 4.1 tized net Sche Weigh Activati Accu Table 4. ious qua works (B [21, 22]) fine-grai 4. Expe We c ing the e and the o tradeoff tion. 4.2 ence wo is matrix floating- library [1 how to use fake quant conv weights uint8 input + biases uint32 ReLU6 output uint8 uint32 uint8 uint8 (a) Integer-arithmetic-only inference conv wt quant weightsinput + biases ReLU6 act quant output (b) Training with simulated quantization 10 20 40 80 160 320 40 50 60 70 Latency (ms) Top1Accuracy Float 8-bit (c) ImageNet latency-vs-accuracy tradeoff Figure 1.1: Integer-arithmetic-only quantization. a) Integer-arithmetic-only inference of a convolution layer. The input and output are represented as 8-bit integers according to equation 1. The convolution involves 8-bit integer operands and a 32-bit integer accumulator. The bias addition involves only 32-bit integers (section 2.4). The ReLU6 nonlinearity only involves 8-bit integer arithmetic. b) Training with simulated quantization of the convolution layer. All variables and computations are carried out using 32-bit floating-point arithmetic. Weight quantization (“wt quant”) and activation quantization (“act quant”) nodes are injected into the computation graph to simulate the effects of quantization of the variables (section 3). The resultant graph approximates the integer-arithmetic-only computation graph in panel a), while being trainable using conventional optimization algorithms for floating point models. c) Our quantization scheme benefits from the fast integer-arithmetic circuits in common CPUs to deliver an improved latency-vs-accuracy tradeoff (section 4). The figure compares integer quantized MobileNets [10] against floating point baselines on ImageNet [3] using Qualcomm Snapdragon 835 LITTLE cores. tions [14, 27, 34]. With these approaches, both multiplica- tions and additions can be implemented by efficient bit-shift and bit-count operations, which are showcased in custom GPU kernels (BNN [14]). However, 1 bit quantization of- Our work draws inspiration from [7], which leverages low-precision fixed-point arithmetic to accelerate the train- ing speed of CNNs, and from [31], which uses 8-bit fixed- point arithmetic to speed up inference on x86 CPUs. Our [1] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/qu README.md [2] https://arxiv.org/abs/1712.05877 [3] https://arxiv.org/abs/1806.08342 79
  • 80. example of depthwise convolution with fake quant 80
  • 81. Real computation • BLAS part: Eigen (http://eigen.tuxfamily.org/) and gemmlowp (https://github.com/google/gemmlowp) • Some Caveats • convolutions are multithreaded • uint8/gemm: 1 • float32/Eigen: 4 • depthwise convolutions are single threaded • problems: big.LITTLE, number of cores, scheduling 81
  • 82. knowing more to squeeze performance • Memory management: to get reasonable good performance when running highly parallel workloads on mobile devices, you need good enough mechanism • Profiling: there is a simple profiling mechanism in TF Lite since Apr, 2018 • time profiling only now. how about memory stuff? • static buffer size: https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/ contrib/lite/profiling/profiler.h#L80 • https://github.com/tensorflow/tensorflow/tree/r1.10/tensorflow/contrib/lite/profiling • Computation of quantized uint8 • when you want to do some operations on tensors, scale and zero point could be changed. How to do it efficiently • Post-training quantization: https://www.tensorflow.org/lite/performance/ post_training_quantization 82
  • 83. Quick Intro to Caffe 2 • Caffe 2 • 2nd generation of Caffe, which was the most popular deep learning framework (before TensorFlow) from Berkeley • merged into PyTorch • What's the difference? Caffe2 improves Caffe 1.0 in a series of directions: • first-class support for large-scale distributed training • mobile deployment • new hardware support (in addition to CPU and CUDA) • flexibility for future directions such as quantized computation • stress tested by the vast scale of Facebook applications 83 https://caffe2.ai/docs/caffe-migration.html
  • 84. Caffe2 backends for Android I know • ARM CPU: • NNPACK, Eigen: quite mature • QNNPACK: looks good, (https://code.fb.com/ml-applications/qnnpack/) • OpenGL ES: • OpenGL: not actively maintained (?) • ARM Compute Library (GL ES part): stalled? 18.01 • NEON, and OpenCL • NNAPI: stalled? NNAPI 1.0 (Oreo 8.1 API 27), not fully integrated yet • ios: iOS MPS backend 84
  • 85. More open source frameworks • Yes, there are other framworks, e.g., • MACE from XiaoMi: https://github.com/XiaoMi/mace, • ncnn from Tencent: https://github.com/Tencent/ncnn, • ONNX runtime from Microsoft, https://github.com/ microsoft/onnxruntime, • TVM stack, https://tvm.ai • So far, the TF/TFLite ecosystem is the largest one 85
  • 86. Beyond Open Source • Apple CoreML • https://developer.apple.com/ documentation/coreml • Google ML Kit • https://developers.google.com/ml-kit/ • image labeling, OCR, face detection, bar code scanning, landmark detection, etc. • Custom models in TF Lite • Qualcomm Snapdragon Neural Processing Engine (SNPE) • https://developer.qualcomm.com/software/ snapdragon-neural-processing-engine-ai • Huawei HiAi DDK 86
  • 87. https://aiyprojects.withgoogle.com/edge-tpu https://www.anandtech.com/show/ 13393/techinsights-publishes- apple-a12-die-shot-our-take Figure 7.38 Floor plan of the 8-core Pixel Visual Core chip. A53 is an ARMv7 core. LPDDR4 is a DRAM controller. PCIE and MIPI are I/O buses. 87
  • 88. Figure 7.13 Example of systolic array in action, from top to bottom on the page. In this example, the six weights are already inside the multiply-accumulate units, as is the norm for the TPU. The three inputs are staggered in time to get the desired effect, and in this example are shown coming in from the top. (In the TPU, the data actually comes in from the left.) The array passes the data down to the next element and the result of the computation to the right to the next element. At the end of the process, the sum of products is found to the right. Drawings courtesy of Yaz Sato. It seems Edge TPU is not TPU-like? Figure 7.14 Systolic data flow of the Matrix Multiply Unit. https://www.elsevier.com/books-and-journals/book-companion/9780128119051 88
  • 89. Edge TPU and NCS 2 89 device MobileNet V1 1.0/224 MobileNet V2 1.0/224 Inception V3 ResNet 50 SqueezeNet 1.1 MobileNet V1 0.25/128 SSD MobileNet V1 COCO SSD MobileNet V2 COCO Coral: Edge TPU 2.74 2.87 43.27 42.41 1.90 1.11 10.05 12.48 NCS 2 (fp16) 12.11 14.87 52.25 33.1 3.99 4.08 23.53 39.11 iPhone Xs Max (Neural Engine accelerated, fp16) 1.74 2.15 8.65 6.91 1.75 1.16 Mobilenet V1/V2 and SSD Mobilenet V1/V2 are quite good • Edge TPU: my scripts, https://github.com/freedomtan/edge_tpu_python_scripts • NCS 2: ./benchmark_app-d MYRIAD -niter 50 -nireq 10 .. • iPhone Xs Max: my CoreML benchmark, https://github.com/freedomtan/coremlbenchmark
  • 90. 0 2 4 6 8 10 12 14 time(ms) Mobilenet V1: Edge TPU and NCS2 ncs2 mobilenet_v1_0.25 ncs2 mobilenet_v1_0.5 ncs2 mobilenet_v1_0.75 ncs2 mobilenet_v1_1.0 coral mobilenet_v1_0.25 coral mobilenet_v1_0.5 coral mobilenet_v1_0.75 coral mobilenet_v1_1.0 Mobilenet V1 on EdgeTPU and NCS2 90 inference time size=128x128 size=160x160 size=192x192 size=224x224 ncs2 mobilenet_v1_0 .25 3.83 3.95 4.06 4.4 ncs2 mobilenet_v1_0 .5 4.98 4.86 5.51 6.51 ncs2 mobilenet_v1_0 .75 6.04 6.67 7.93 9.4 ncs2 mobilenet_v1_1 .0 7.43 8.68 10.13 12.2 coral mobilenet_v1_0 .25 1.07 1.24 1.30 1.47 coral mobilenet_v1_0 .5 1.16 1.40 1.53 1.95 coral mobilenet_v1_0 .75 1.29 1.70 1.80 2.16 coral mobilenet_v1_1 .0 1.50 1.95 2.15 2.85
  • 91. 1×224×224×3 1×1×1×1024 1×1×1×1024 1×1×1×5 1×5 1×5 edgetpu-custom-op L2Normalization Conv2D weights 5×1×1×1024 bias 5 Reshape Softmax input Output Imprinting Engine • Yes, let’s check what it is • The Imprinting Engine implements a low-shot learning technique called ‘Imprinted Weights’ [1][2] • Can be used to retrain classifiers on-device (either on USB Accelerator or Dev Board), no back-propagation gradient involved. • Why? • Transfer-learning happens on-device, at near-realtime speed. • You don't need to recompile the model. • Limitations • Training data size is limited to a max of 200 images per class. • It is most suitable only for datasets that have a small inner class variation. • The last fully-connected layer runs on the CPU, not the Edge TPU. So it will be slightly less efficient than running a pre- compiled on Edge TPU. • if you are interested in it, check the paper and aiy::learn::imprinting::ImprintingEngine::Train(un signed char const*, int, int) 91 [1] https://coral.withgoogle.com/docs/edgetpu/retrain-classification-ondevice/ [2] https://arxiv.org/abs/1712.07136 1×224×224×3 1×1×1×1024 edgetpu-custom-op input AvgPool
  • 92. EfficientNet • EfficientNet-B0: • much smaller FLOPS than MobileNet V1; much higher accuracy • MobileNet V2: a bit larger FLOPS; much higher accuracy http://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.html 92
  • 95. Depthwise Separable Convolution • CNNs with depthwise separable convolution such as Mobilenet [1] changed almost everything • Depthwise separable convolution “factorize” a standard convolution into a depthwise convolution and a 1 × 1 convolution called a pointwise convolution. Thus it greatly reduces computation complexity. • Depthwise separable convolution is not that that new [2], but pure depthwise separable convolution-based networks such as Xception and MobileNet demonstrated its power [1] https://arxiv.org/abs/1704.04861 [2] L. Sifre. “Rigid-motion scattering for image classification”, PhD thesis, 2014 95
  • 96. ...M N 1 1 ... MDK DK 1 ... M DK DK N depthwise convolution filters standard convolution filters 1×1 Convolutional Filters (Pointwise Convolution)https://arxiv.org/abs/1704.04861 Depthwise Separable Convolution 96
  • 97. 97