SlideShare a Scribd company logo
1 of 37
Download to read offline
TFLite NNAPI and
GPU Delegates
Koan-Sin Tan

freedom@computer.org

Aug 18th, 2019

COSCUP 2019, Taipei, Taiwan
• disclaimer: Opinions Are My Own

• feel free to interrupt me if you have any questions

• questions in English, Taiwanese, and Mandarin are fine

• note that i am gonna skip memory related code in the talk
because of time constraint. Memory management,
including locality and zero-copy, is always a crucial part of
high-performance computing
2
who i am
• Used open source before the term “open
source” is used

• A software guy, learned to use Unix and open
source software on VAX-11/780 running 4.3BSD

• Used to be a programming language junkie

• Worked on various system software, e.g., CPU
scheduling and power management of non-
CPU components

• Recently, on NN performance on edge devices
related stuff

• Contributed from time to time to TensorFlow
Lite

• started a command line label_image for
TFLite
https://github.com/tensorflow/tensorflow/releases/tag/v2.0.0-alpha0
http://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg
3
Delegation
• Delegation: one of the commonly
used old mechanisms mentioned in
the GoF book

• presumably, you know this well
already

• in case no, delegate definitions
from dictionaries work

figure from GoF, https://learning.oreilly.com/library/view/design-patterns-elements/0201633612/ch01.html#ch01lev3sec4
So, what is a TFLite
delegate?
• “A TensorFlow Lite delegate is a way to delegate part or all of graph execution to another
executor.”

• Why delegates?

• running computation-intensive NN models on mobile devices is resource demanding for
mobile CPUs, processing power and energy consumption could be problems

• and matrix-multiplication which is there core of convolution and fully connected ops is
highly parallel

• Thus, some devices have hardware accelerators, such as GPU or DSP, that provide better
performance and higher energy efficiency thru Android NNAPI

• To use NNAPI, TFLite has an NNAPI delegate

• Why I want to share what I know

• used TFLite, contributed some code, e.g., label_image for TFLite

• wrote quick-and-dirty TFLite GPU delegate benchmarks
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/g3doc/performance/delegates.md
What is TFLite
• An lightweight inference engine

• originally for Android and
similar platforms. Extended to
micro-controllers (e.g., ARM
Cortex-M series)

• Interpreter-based (what other
choices do they have?)

• ops are organized as a
directed acyclic graph (DAG)

• execute / interpret ops one bye
one if no delegates involved
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/core/subgraph.cc#L734-L798
TfLiteContext
• TfLiteContext: reporting
facilities and access to global
objects, including all the
tensors

• TfLiteNode: a single node or
operation

• TfLiteRegistration: the
implementation of a
conceptual operation
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L411-L485
ResizeTensor()
ReportError()
AddTensors()
GetNodeAndRegistration()
ReplaceNodeSubsetsWithDelegateKernels
GetExternalContext()
SetExternalContext()
…
tensors_size
tensors
impl_
recommended_num_threads
allow_fp32_relax_to_fp16
profiler
…
TfLiteContext
TfLiteNode
• TfLiteContext: reporting
facilities and access to global
objects, including all the
tensors

• TfLiteNode: a single node or
operation

• TfLiteRegistration: the
implementation of a
conceptual operation
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L377-L409
inputs
outputs
intermediates
temporaries
user_data
builtin_data
custom_initial_data
custom_initial_data_size
delegate
…
TfLiteNode
TfLiteRegistration
• TfLiteContext: reporting
facilities and access to global
objects, including all the
tensors

• TfLiteNode: a single node or
operation

• TfLiteRegistration: the
implementation of a
conceptual operation
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L487-L544
init()
free()
prepare()
invoke()
profilling_string()
…
builtin_code
custom_name
version
…
TfLiteRegistration
To know more
• Read [1][2] and create a custom op will help
understanding TfLiteRegistration, TfLiteNode, and
TfLiteContext deeper

[1] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/guide/
inference.md#write-a-custom-operator

[2] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/guide/
ops_custom.md
TfLiteDelegate: the
interface
• In case you didn’t notices it
yet, TFLite is mainly written in
C++

• C API for FFI from other
high level languages

• I hacked a Smalltalk one

• many classes are structs and
no member functions so that it
could be used in C API easily
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L563-L602
Prepare()
CopyFromBufferHandle()
CopyToBufferHandle()
FreeBufferHandler()
…
data_
flags
…
TfLiteDelegate
How TFLite delegates
work?
• Let's say we have a simple model graph such as the following:

• Let's assume that there is a delegate "MyDelegate," which has a faster
implementation for Conv2D and Mean operations. The resulting main graph
will be updated to look like below.
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/g3doc/performance/delegates.md
1×224×224×3
1×1001
TfLiteNnapiDelegate
1 32×3×3×3
2 1×3×3×512
3 512×1×1×512
4 1×3×3×512
5 512×1×1×512
6 1×3×3×512
7 1024×1×1×512
8 1×3×3×1024
9 1024×1×1×1024
10 1×3×3×32
11 64×1×1×32
12 1×3×3×64
13 128×1×1×64
14 1×3×3×128
15 128×1×1×128
16 1×3×3×128
17 256×1×1×128
18 1×3×3×256
19 256×1×1×256
20 1×3×3×256
21 512×1×1×256
22 1×3×3×512
23 512×1×1×512
24 1×3×3×512
25 512×1×1×512
26 1×3×3×512
27 512×1×1×512
28 1001
29 1001×1×1×1024
30 2
31 32
32 512
33 512
34 512
35 512
36 512
37 1024
38 1024
39 1024
40 32
41 64
42 64
43 128
44 128
45 128
46 128
47 256
48 256
49 256
50 256
51 512
52 512
53 512
54 512
55 512
56 512
57 512
input
Reshape_1
What does a real model
look like?
• With the NNAPI delegate
rewrite backed from Nov,
2018, a subgraph delegated to
an “accelerator” is an op
(named Delegate) in TFLite
now

• subgraph

• all-or-nothing —> per op
1×224×224×3
1×112×112×32
1×112×112×32
1×112×112×64
1×56×56×64
1×56×56×128
1×56×56×128
1×56×56×128
1×28×28×128
1×28×28×256
1×28×28×256
1×28×28×256
1×14×14×256
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×7×7×512
1×7×7×1024
1×7×7×1024
1×7×7×1024
1×1×1×1024
1×1×1×1001
1×1001
1×1001
Conv2D
weights 32×3×3×3
bias 32
DepthwiseConv2D
weights 1×3×3×32
bias 32
Conv2D
weights 64×1×1×32
bias 64
DepthwiseConv2D
weights 1×3×3×64
bias 64
Conv2D
weights 128×1×1×64
bias 128
DepthwiseConv2D
weights 1×3×3×128
bias 128
Conv2D
weights 128×1×1×128
bias 128
DepthwiseConv2D
weights 1×3×3×128
bias 128
Conv2D
weights 256×1×1×128
bias 256
DepthwiseConv2D
weights 1×3×3×256
bias 256
Conv2D
weights 256×1×1×256
bias 256
DepthwiseConv2D
weights 1×3×3×256
bias 256
Conv2D
weights 512×1×1×256
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 1024×1×1×512
bias 1024
DepthwiseConv2D
weights 1×3×3×1024
bias 1024
Conv2D
weights 1024×1×1×1024
bias 1024
AveragePool2D
Conv2D
weights 1001×1×1×1024
bias 1001
Squeeze
Softmax
input
Reshape_1
http://localhost:8080/, http://localhost:8090/
delegates in TFLite
• NNAPI delegate

• mainly for Android

• GPU delegate: NNAPI, which as introduced in Android O MR1 (late 2017), is not
popular (yet)

• GL ES Compute shader on Android

• Metal shader on iOS

• FlexDelegate: eager mode to run some ops

• useful when not all ops are supported by TFLite or accelerators (thru something
like NNAPI or GPU delegate)

• not in TensorFlow repo: EdgeTPU delegate
NNAPI-enabled devices ~ 25.8% around May 7, 2019
https://developer.android.com/about/dashboards15
16
GL ES compute shader capable devices ~ 50%
https://developer.android.com/about/dashboards
Android NN API
• Announced/published with Android 8.1
Preview 1

• Available to developer in NDK

• yes, NDK

• The Android Neural Networks API (NNAPI)
is an Android C API designed for running
computationally intensive operations for
machine learning on mobile devices

• NNAPI is designed to provide a base layer
of functionality for higher-level machine
learning frameworks (such as TensorFlow
Lite, Caffe2, or others) that build and train
neural networks

• The API is available on all devices running
Android 8.1 (API level 27) or higher
https://developer.android.com/ndk/images/nnapi/nnapi_architecture.png
17
So, what a delegate is
supposed to implement
• Understanding how to
add a delegate helps

• define a kernel node,
which means to
implement
TfLiteRegistration

• create an instance of
TfLiteDelegate, then
register the kernel node in
Prepare()
typedef struct TfLiteDelegate {
void* data_;
TfLiteStatus (*Prepare)(TfLiteContext* context,
struct TfLiteDelegate* delegate);
TfLiteStatus (*CopyFromBufferHandle)(TfLiteContext* context,
struct TfLiteDelegate* delegate,
TfLiteBufferHandle buffer_handle,
TfLiteTensor* tensor);
TfLiteStatus (*CopyToBufferHandle)(TfLiteContext* context,
struct TfLiteDelegate* delegate,
TfLiteBufferHandle buffer_handle,
TfLiteTensor* tensor);
void (*FreeBufferHandle)(TfLiteContext* context,
struct TfLiteDelegate* delegate,
TfLiteBufferHandle* handle);
int64_t flags;
} TfLiteDelegate;
typedef struct _TfLiteRegistration {
void* (*init)(TfLiteContext* context, const char* buffer, size_t
length);
void (*free)(TfLiteContext* context, void* buffer);
TfLiteStatus (*prepare)(TfLiteContext* context, TfLiteNode* node);
TfLiteStatus (*invoke)(TfLiteContext* context, TfLiteNode* node);
const char* (*profiling_string)(const TfLiteContext* context, const
TfLiteNode* node);
int32_t builtin_code;
const char* custom_name;
int version;
} TfLiteRegistration;
NNAPI delegate
• C++ code: instead of C style
one

• derived from TfLiteDelegate

• Some private data
structures

• extra member functions
corresponding to private
data structures
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/
nnapi_delegate.h#L29-L161
Prepare()
CopyFromBufferHandle()
CopyToBufferHandle()
FreeBufferHandler()
…
data_
flags
…
TfLiteDelegate
Prepare()
CopyFromBufferHandle()
CopyToBufferHandle()
FreeBufferHandler()
GetOptions()
RegisteNnMemory()
GetTensorMemoryMap()
…
data_
flags
acceleration_name
(options)
(memory_registration)
…
StateFullNnApiDelegate
data
• execution_preference

• power/perf tradeoff: not
widely supported as far as I
can tell

• accelerator_name: e.g.,
“fallback” and “hvx”

• cache_dir

• model_token

• tensor_memory_map:
MemoryRegistration
struct Data {
// Preferred Power/perf trade-off.
Options::ExecutionPreference execution_preference;
// Selected NNAPI accelerator name.
std::string accelerator_name;
// The cache dir for NNAPI model.
std::string cache_dir;
// The unique token string for NNAPI model.
std::string model_token;
// Tensor to ANeuralNetworksMemory mapping.
std::vector<MemoryRegistration> tensor_memory_map;
};
// Encapsulates all fields related to memory
registration for internal
// bookkeeping only.
struct MemoryRegistration {
ANeuralNetworksMemory* memory;
CopyToHostTensorFnPtr callback;
void* callback_context;
};
TfLiteRegistration for
nnapi_delegate_kernel
• init()

• free()

• prepare()

• invoke()

• no profiling_string()

• builtin_code = …

• custom_name
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L3575-L3607
init()
free()
prepare()
invoke()
profilling_string()
…
builtin_code
custom_name
version
…
TfLiteRegistration
Init() of NNAPI Delegate
Kernel
• mainly for NNAPI initialization

ANeuralNetworksCompilation_*()
• and build graph

• if NNAPI >= 1.2, checking
there is “real” NNAPI device

• one interesting conversion is
INT8 -> UINT8
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L2571-L2672
INT8 —> UINT8 conversion
• Original TFLite and NNAPI uses asymmetric UINT8 quantization

• asymmetric one provides more flexibilities, but usually symmetric INT8 is more
hardware friendly

• more and more INT8 code for TFLite

• NNAPI doesn’t change as fast as TFLite, so conversion is needed

• See the quantization paper for TFLite [1] and MLIR’s quantization doc [2]

[1] Jacob, B et al., ”Quantization and Training of Neural Networks for Efficient Integer-
Arithmetic-Only Inference”, https://arxiv.org/abs/1712.05877

[2] https://github.com/tensorflow/mlir/blob/master/g3doc/Quantization.md
Invoke() of NNAPI Delegate
Kernel
• mainly memory management
and 

ANeuralNetworksExecution*()
• To digger more we have to go
thru more TFLite and NNAPI
data structures

• asking NNAPI to work for you
is quite trivial when everything
is well-prepared
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L2683-L2872
DoPrepare
• for NNAPI >=1.2 (Android Q and
later), if no real accelerators there,
i.e., only NNAPI CPU fallback is
there, computation is not
offloaded.

• Check for every node to see if it is
supported

• NN API Delegate Registration:
previous pages

• Request TFLite to partition the
graph and make kernels for each
independent node subset a new
nnapi_delegate_kernel
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L3353-L3457
partition graph
• in the end of DoPrepare(),
ReplaceNodeSubsetsWithDele
gateKernels() is called

• DoPrepare() ->
Subgraph::ReplaceNodeSubs
etsWithDelegateKernels() ->
tflite::PartitionGraphIntoIndepe
ndentNodeSubsets() ->
tflite::Partition()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/core/
subgraph.cc#L298-L363
tflite::Partition() did most
partition job
• part of Partition()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/graph_info.cc#L67-L118
GPU GL Delegate
TfLiteRegistration
• TfLiteRegistration in
DelegatePrepare()

• init()

• no free()

• prepare() is quite simple

• invoke(): simply calls node-
>Invoke()

• context ->
ReplaceNodeSubsetsWithDele
gateKernels()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L392-L431
GPU GL Delegate
• TfLiteDelegate

• Prepare

• CopyFromBufferHandle

• CopyToBufferHandle

• class Delegate

• TFLiteGpuDelegateCreate()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L75-L457
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L464-L470
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L75-L457
GPU Metal Delegate
TfLiteRegistration
• TfLiteRegistration in
DelegatePrepare()

• init()

• no free()

• prepare() is quite simple

• invoke(): simply calls node-
>Invoke()

• context ->
ReplaceNodeSubsetsWithDele
gateKernels()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L392-L431
GPU Metal Delegate
• TfLiteDelegate

• Prepare: yup, just Prepare()

• class Delegate, which is quite
large

• NewGpuDelege()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L525-L532
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L620-L624
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L163-L613
GPU delegate kernels
• GPU backends require initialization
involving shader compilation and
optimization by the driver before
inference

• PHWC4: P stands for plane

• Reshape is expensive on GPU

• RGBA is better than RGB on GPU

• a tensor of shape [B,H,W,5], for
instance, is twice as expensive as [B, H,
W, 4], but about the same as [B, H, W,
8], then the architect can tune around
those 4-channel boundaries rather than
trying to optimize on other boundaries. 

•
https://arxiv.org/pdf/1907.01989.pdf
Flex Delegate
• Another delegate is the
one that provides
selected set of ops in
Eager mode

• It’s much easier to check
what it does
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/flex/delegate.cc#L143-L148
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/flex/kernel.cc#L561-L573
Edge TPU’s canned model
• supported ops are packed into
single op for Edge TPU
The compiler creates a single custom op for all Edge TPU
compatible ops; anything else stays the same
https://coral.withgoogle.com/docs/edgetpu/models-intro/
34
MobileNet V1 1×224×224×3
1×1001
edgetpu-custom-op
input
Softmax
1×300×300×3
1×1917×91
1×10×4 1×10 1×10 1
edgetpu-custom-op
TFLite_Detection_PostProcess
3 1917×4
normalized_input_image_tensor
TFLite_Detection_PostProcess TFLite_Detection_PostProcess:1 TFLite_Detection_PostProcess:2 TFLite_Detection_PostProcess:3
SSD MobileNet V1
Edge TPU C++ API
https://coral.withgoogle.com/docs/edgetpu/api-intro/
EdgeTPU Delegate
• There is dynamic delegate plugin interface. Currently it’s
only used by EdgeTPU’s
https://coral.withgoogle.com/docs/edgetpu/api-intro/
There still are many trivial bugs in
TensorFlow
• There are many typos in comments of TensorFlow code
• Many things are not well-documented
• There are many many warnings when building TensorFlow from source
code
• a trivial fix in May, 2019 by me
37
https://github.com/tensorflow/tensorflow/pull/28618

More Related Content

What's hot

HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareHKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
Linaro
 

What's hot (20)

Qemu Introduction
Qemu IntroductionQemu Introduction
Qemu Introduction
 
Qemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System EmulationQemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System Emulation
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKB
 
Introduction to Optee (26 may 2016)
Introduction to Optee (26 may 2016)Introduction to Optee (26 may 2016)
Introduction to Optee (26 may 2016)
 
Understanding of linux kernel memory model
Understanding of linux kernel memory modelUnderstanding of linux kernel memory model
Understanding of linux kernel memory model
 
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
 
Scheduling in Android
Scheduling in AndroidScheduling in Android
Scheduling in Android
 
Q4.11: Porting Android to new Platforms
Q4.11: Porting Android to new PlatformsQ4.11: Porting Android to new Platforms
Q4.11: Porting Android to new Platforms
 
LCU13: An Introduction to ARM Trusted Firmware
LCU13: An Introduction to ARM Trusted FirmwareLCU13: An Introduction to ARM Trusted Firmware
LCU13: An Introduction to ARM Trusted Firmware
 
Embedded Android : System Development - Part III
Embedded Android : System Development - Part IIIEmbedded Android : System Development - Part III
Embedded Android : System Development - Part III
 
Deep Dive into the Linux Kernel - メモリ管理におけるCompaction機能について
Deep Dive into the Linux Kernel - メモリ管理におけるCompaction機能についてDeep Dive into the Linux Kernel - メモリ管理におけるCompaction機能について
Deep Dive into the Linux Kernel - メモリ管理におけるCompaction機能について
 
On-device ML with TFLite
On-device ML with TFLiteOn-device ML with TFLite
On-device ML with TFLite
 
[KubeCon EU 2020] containerd Deep Dive
[KubeCon EU 2020] containerd Deep Dive[KubeCon EU 2020] containerd Deep Dive
[KubeCon EU 2020] containerd Deep Dive
 
COSCUP 2020 RISC-V 32 bit linux highmem porting
COSCUP 2020 RISC-V 32 bit linux highmem portingCOSCUP 2020 RISC-V 32 bit linux highmem porting
COSCUP 2020 RISC-V 32 bit linux highmem porting
 
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareHKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
 
[Container Plumbing Days 2023] Why was nerdctl made?
[Container Plumbing Days 2023] Why was nerdctl made?[Container Plumbing Days 2023] Why was nerdctl made?
[Container Plumbing Days 2023] Why was nerdctl made?
 
Android's HIDL: Treble in the HAL
Android's HIDL: Treble in the HALAndroid's HIDL: Treble in the HAL
Android's HIDL: Treble in the HAL
 
Android Things : Building Embedded Devices
Android Things : Building Embedded DevicesAndroid Things : Building Embedded Devices
Android Things : Building Embedded Devices
 
Design and Concepts of Android Graphics
Design and Concepts of Android GraphicsDesign and Concepts of Android Graphics
Design and Concepts of Android Graphics
 
Running TFLite on Your Mobile Devices, 2020
Running TFLite on Your Mobile Devices, 2020Running TFLite on Your Mobile Devices, 2020
Running TFLite on Your Mobile Devices, 2020
 

Similar to TFLite NNAPI and GPU Delegates

TEE - kernel support is now upstream. What this means for open source security
TEE - kernel support is now upstream. What this means for open source securityTEE - kernel support is now upstream. What this means for open source security
TEE - kernel support is now upstream. What this means for open source security
Linaro
 

Similar to TFLite NNAPI and GPU Delegates (20)

Hot to build continuously processing for 24/7 real-time data streaming platform?
Hot to build continuously processing for 24/7 real-time data streaming platform?Hot to build continuously processing for 24/7 real-time data streaming platform?
Hot to build continuously processing for 24/7 real-time data streaming platform?
 
LAS16-200: Firmware summit - Tianocore Progress and Status
LAS16-200:  Firmware summit - Tianocore Progress and StatusLAS16-200:  Firmware summit - Tianocore Progress and Status
LAS16-200: Firmware summit - Tianocore Progress and Status
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 
The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)
 
Some wonderful Linux softwares for daily use
Some wonderful Linux softwares for daily useSome wonderful Linux softwares for daily use
Some wonderful Linux softwares for daily use
 
How to Choose a Deep Learning Framework
How to Choose a Deep Learning FrameworkHow to Choose a Deep Learning Framework
How to Choose a Deep Learning Framework
 
TEE - kernel support is now upstream. What this means for open source security
TEE - kernel support is now upstream. What this means for open source securityTEE - kernel support is now upstream. What this means for open source security
TEE - kernel support is now upstream. What this means for open source security
 
Meetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCMeetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaC
 
LAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMGLAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMG
 
Devops with Python by Yaniv Cohen DevopShift
Devops with Python by Yaniv Cohen DevopShiftDevops with Python by Yaniv Cohen DevopShift
Devops with Python by Yaniv Cohen DevopShift
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup SunnyvaleIntroduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
 
Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on Android
 
Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the Cloud
 
Os Lamothe
Os LamotheOs Lamothe
Os Lamothe
 
Edge and ai
Edge and aiEdge and ai
Edge and ai
 
Tensorflow Lite and ARM Compute Library
Tensorflow Lite and ARM Compute LibraryTensorflow Lite and ARM Compute Library
Tensorflow Lite and ARM Compute Library
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509
 
Bringing TensorFlow to Android: a war story - Yoni Tsafir, JoyTunes
Bringing TensorFlow to Android: a war story - Yoni Tsafir, JoyTunesBringing TensorFlow to Android: a war story - Yoni Tsafir, JoyTunes
Bringing TensorFlow to Android: a war story - Yoni Tsafir, JoyTunes
 
Bringing TensorFlow to Android - a War Story
Bringing TensorFlow to Android - a War StoryBringing TensorFlow to Android - a War Story
Bringing TensorFlow to Android - a War Story
 

More from Koan-Sin Tan

Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android Benchmarks
Koan-Sin Tan
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Koan-Sin Tan
 

More from Koan-Sin Tan (15)

running stable diffusion on android
running stable diffusion on androidrunning stable diffusion on android
running stable diffusion on android
 
Exploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source ToolsExploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source Tools
 
A Peek into TFRT
A Peek into TFRTA Peek into TFRT
A Peek into TFRT
 
Exploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source ToolExploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source Tool
 
A Sneak Peek of MLIR in TensorFlow
A Sneak Peek of MLIR in TensorFlowA Sneak Peek of MLIR in TensorFlow
A Sneak Peek of MLIR in TensorFlow
 
A Peek into Google's Edge TPU
A Peek into Google's Edge TPUA Peek into Google's Edge TPU
A Peek into Google's Edge TPU
 
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
 
open source nn frameworks on cellphones
open source nn frameworks on cellphonesopen source nn frameworks on cellphones
open source nn frameworks on cellphones
 
Caffe2 on Android
Caffe2 on AndroidCaffe2 on Android
Caffe2 on Android
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016
 
A peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserA peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk User
 
Android Wear and the Future of Smartwatch
Android Wear and the Future of SmartwatchAndroid Wear and the Future of Smartwatch
Android Wear and the Future of Smartwatch
 
Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android Benchmarks
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
 
Smalltalk and ruby - 2012-12-08
Smalltalk and ruby  - 2012-12-08Smalltalk and ruby  - 2012-12-08
Smalltalk and ruby - 2012-12-08
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

TFLite NNAPI and GPU Delegates

  • 1. TFLite NNAPI and GPU Delegates Koan-Sin Tan freedom@computer.org Aug 18th, 2019 COSCUP 2019, Taipei, Taiwan
  • 2. • disclaimer: Opinions Are My Own • feel free to interrupt me if you have any questions • questions in English, Taiwanese, and Mandarin are fine • note that i am gonna skip memory related code in the talk because of time constraint. Memory management, including locality and zero-copy, is always a crucial part of high-performance computing 2
  • 3. who i am • Used open source before the term “open source” is used • A software guy, learned to use Unix and open source software on VAX-11/780 running 4.3BSD • Used to be a programming language junkie • Worked on various system software, e.g., CPU scheduling and power management of non- CPU components • Recently, on NN performance on edge devices related stuff • Contributed from time to time to TensorFlow Lite • started a command line label_image for TFLite https://github.com/tensorflow/tensorflow/releases/tag/v2.0.0-alpha0 http://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg 3
  • 4. Delegation • Delegation: one of the commonly used old mechanisms mentioned in the GoF book • presumably, you know this well already • in case no, delegate definitions from dictionaries work figure from GoF, https://learning.oreilly.com/library/view/design-patterns-elements/0201633612/ch01.html#ch01lev3sec4
  • 5. So, what is a TFLite delegate? • “A TensorFlow Lite delegate is a way to delegate part or all of graph execution to another executor.” • Why delegates? • running computation-intensive NN models on mobile devices is resource demanding for mobile CPUs, processing power and energy consumption could be problems • and matrix-multiplication which is there core of convolution and fully connected ops is highly parallel • Thus, some devices have hardware accelerators, such as GPU or DSP, that provide better performance and higher energy efficiency thru Android NNAPI • To use NNAPI, TFLite has an NNAPI delegate • Why I want to share what I know • used TFLite, contributed some code, e.g., label_image for TFLite • wrote quick-and-dirty TFLite GPU delegate benchmarks https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/g3doc/performance/delegates.md
  • 6. What is TFLite • An lightweight inference engine • originally for Android and similar platforms. Extended to micro-controllers (e.g., ARM Cortex-M series) • Interpreter-based (what other choices do they have?) • ops are organized as a directed acyclic graph (DAG) • execute / interpret ops one bye one if no delegates involved https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/core/subgraph.cc#L734-L798
  • 7. TfLiteContext • TfLiteContext: reporting facilities and access to global objects, including all the tensors • TfLiteNode: a single node or operation • TfLiteRegistration: the implementation of a conceptual operation https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L411-L485 ResizeTensor() ReportError() AddTensors() GetNodeAndRegistration() ReplaceNodeSubsetsWithDelegateKernels GetExternalContext() SetExternalContext() … tensors_size tensors impl_ recommended_num_threads allow_fp32_relax_to_fp16 profiler … TfLiteContext
  • 8. TfLiteNode • TfLiteContext: reporting facilities and access to global objects, including all the tensors • TfLiteNode: a single node or operation • TfLiteRegistration: the implementation of a conceptual operation https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L377-L409 inputs outputs intermediates temporaries user_data builtin_data custom_initial_data custom_initial_data_size delegate … TfLiteNode
  • 9. TfLiteRegistration • TfLiteContext: reporting facilities and access to global objects, including all the tensors • TfLiteNode: a single node or operation • TfLiteRegistration: the implementation of a conceptual operation https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L487-L544 init() free() prepare() invoke() profilling_string() … builtin_code custom_name version … TfLiteRegistration
  • 10. To know more • Read [1][2] and create a custom op will help understanding TfLiteRegistration, TfLiteNode, and TfLiteContext deeper [1] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/guide/ inference.md#write-a-custom-operator [2] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/guide/ ops_custom.md
  • 11. TfLiteDelegate: the interface • In case you didn’t notices it yet, TFLite is mainly written in C++ • C API for FFI from other high level languages • I hacked a Smalltalk one • many classes are structs and no member functions so that it could be used in C API easily https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L563-L602 Prepare() CopyFromBufferHandle() CopyToBufferHandle() FreeBufferHandler() … data_ flags … TfLiteDelegate
  • 12. How TFLite delegates work? • Let's say we have a simple model graph such as the following: • Let's assume that there is a delegate "MyDelegate," which has a faster implementation for Conv2D and Mean operations. The resulting main graph will be updated to look like below. https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/g3doc/performance/delegates.md
  • 13. 1×224×224×3 1×1001 TfLiteNnapiDelegate 1 32×3×3×3 2 1×3×3×512 3 512×1×1×512 4 1×3×3×512 5 512×1×1×512 6 1×3×3×512 7 1024×1×1×512 8 1×3×3×1024 9 1024×1×1×1024 10 1×3×3×32 11 64×1×1×32 12 1×3×3×64 13 128×1×1×64 14 1×3×3×128 15 128×1×1×128 16 1×3×3×128 17 256×1×1×128 18 1×3×3×256 19 256×1×1×256 20 1×3×3×256 21 512×1×1×256 22 1×3×3×512 23 512×1×1×512 24 1×3×3×512 25 512×1×1×512 26 1×3×3×512 27 512×1×1×512 28 1001 29 1001×1×1×1024 30 2 31 32 32 512 33 512 34 512 35 512 36 512 37 1024 38 1024 39 1024 40 32 41 64 42 64 43 128 44 128 45 128 46 128 47 256 48 256 49 256 50 256 51 512 52 512 53 512 54 512 55 512 56 512 57 512 input Reshape_1 What does a real model look like? • With the NNAPI delegate rewrite backed from Nov, 2018, a subgraph delegated to an “accelerator” is an op (named Delegate) in TFLite now • subgraph • all-or-nothing —> per op 1×224×224×3 1×112×112×32 1×112×112×32 1×112×112×64 1×56×56×64 1×56×56×128 1×56×56×128 1×56×56×128 1×28×28×128 1×28×28×256 1×28×28×256 1×28×28×256 1×14×14×256 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×7×7×512 1×7×7×1024 1×7×7×1024 1×7×7×1024 1×1×1×1024 1×1×1×1001 1×1001 1×1001 Conv2D weights 32×3×3×3 bias 32 DepthwiseConv2D weights 1×3×3×32 bias 32 Conv2D weights 64×1×1×32 bias 64 DepthwiseConv2D weights 1×3×3×64 bias 64 Conv2D weights 128×1×1×64 bias 128 DepthwiseConv2D weights 1×3×3×128 bias 128 Conv2D weights 128×1×1×128 bias 128 DepthwiseConv2D weights 1×3×3×128 bias 128 Conv2D weights 256×1×1×128 bias 256 DepthwiseConv2D weights 1×3×3×256 bias 256 Conv2D weights 256×1×1×256 bias 256 DepthwiseConv2D weights 1×3×3×256 bias 256 Conv2D weights 512×1×1×256 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 1024×1×1×512 bias 1024 DepthwiseConv2D weights 1×3×3×1024 bias 1024 Conv2D weights 1024×1×1×1024 bias 1024 AveragePool2D Conv2D weights 1001×1×1×1024 bias 1001 Squeeze Softmax input Reshape_1 http://localhost:8080/, http://localhost:8090/
  • 14. delegates in TFLite • NNAPI delegate • mainly for Android • GPU delegate: NNAPI, which as introduced in Android O MR1 (late 2017), is not popular (yet) • GL ES Compute shader on Android • Metal shader on iOS • FlexDelegate: eager mode to run some ops • useful when not all ops are supported by TFLite or accelerators (thru something like NNAPI or GPU delegate) • not in TensorFlow repo: EdgeTPU delegate
  • 15. NNAPI-enabled devices ~ 25.8% around May 7, 2019 https://developer.android.com/about/dashboards15
  • 16. 16 GL ES compute shader capable devices ~ 50% https://developer.android.com/about/dashboards
  • 17. Android NN API • Announced/published with Android 8.1 Preview 1 • Available to developer in NDK • yes, NDK • The Android Neural Networks API (NNAPI) is an Android C API designed for running computationally intensive operations for machine learning on mobile devices • NNAPI is designed to provide a base layer of functionality for higher-level machine learning frameworks (such as TensorFlow Lite, Caffe2, or others) that build and train neural networks • The API is available on all devices running Android 8.1 (API level 27) or higher https://developer.android.com/ndk/images/nnapi/nnapi_architecture.png 17
  • 18. So, what a delegate is supposed to implement • Understanding how to add a delegate helps • define a kernel node, which means to implement TfLiteRegistration • create an instance of TfLiteDelegate, then register the kernel node in Prepare() typedef struct TfLiteDelegate { void* data_; TfLiteStatus (*Prepare)(TfLiteContext* context, struct TfLiteDelegate* delegate); TfLiteStatus (*CopyFromBufferHandle)(TfLiteContext* context, struct TfLiteDelegate* delegate, TfLiteBufferHandle buffer_handle, TfLiteTensor* tensor); TfLiteStatus (*CopyToBufferHandle)(TfLiteContext* context, struct TfLiteDelegate* delegate, TfLiteBufferHandle buffer_handle, TfLiteTensor* tensor); void (*FreeBufferHandle)(TfLiteContext* context, struct TfLiteDelegate* delegate, TfLiteBufferHandle* handle); int64_t flags; } TfLiteDelegate; typedef struct _TfLiteRegistration { void* (*init)(TfLiteContext* context, const char* buffer, size_t length); void (*free)(TfLiteContext* context, void* buffer); TfLiteStatus (*prepare)(TfLiteContext* context, TfLiteNode* node); TfLiteStatus (*invoke)(TfLiteContext* context, TfLiteNode* node); const char* (*profiling_string)(const TfLiteContext* context, const TfLiteNode* node); int32_t builtin_code; const char* custom_name; int version; } TfLiteRegistration;
  • 19. NNAPI delegate • C++ code: instead of C style one • derived from TfLiteDelegate • Some private data structures • extra member functions corresponding to private data structures https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/ nnapi_delegate.h#L29-L161 Prepare() CopyFromBufferHandle() CopyToBufferHandle() FreeBufferHandler() … data_ flags … TfLiteDelegate Prepare() CopyFromBufferHandle() CopyToBufferHandle() FreeBufferHandler() GetOptions() RegisteNnMemory() GetTensorMemoryMap() … data_ flags acceleration_name (options) (memory_registration) … StateFullNnApiDelegate
  • 20. data • execution_preference • power/perf tradeoff: not widely supported as far as I can tell • accelerator_name: e.g., “fallback” and “hvx” • cache_dir • model_token • tensor_memory_map: MemoryRegistration struct Data { // Preferred Power/perf trade-off. Options::ExecutionPreference execution_preference; // Selected NNAPI accelerator name. std::string accelerator_name; // The cache dir for NNAPI model. std::string cache_dir; // The unique token string for NNAPI model. std::string model_token; // Tensor to ANeuralNetworksMemory mapping. std::vector<MemoryRegistration> tensor_memory_map; }; // Encapsulates all fields related to memory registration for internal // bookkeeping only. struct MemoryRegistration { ANeuralNetworksMemory* memory; CopyToHostTensorFnPtr callback; void* callback_context; };
  • 21. TfLiteRegistration for nnapi_delegate_kernel • init() • free() • prepare() • invoke() • no profiling_string() • builtin_code = … • custom_name https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L3575-L3607 init() free() prepare() invoke() profilling_string() … builtin_code custom_name version … TfLiteRegistration
  • 22. Init() of NNAPI Delegate Kernel • mainly for NNAPI initialization ANeuralNetworksCompilation_*() • and build graph • if NNAPI >= 1.2, checking there is “real” NNAPI device • one interesting conversion is INT8 -> UINT8 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L2571-L2672
  • 23. INT8 —> UINT8 conversion • Original TFLite and NNAPI uses asymmetric UINT8 quantization • asymmetric one provides more flexibilities, but usually symmetric INT8 is more hardware friendly • more and more INT8 code for TFLite • NNAPI doesn’t change as fast as TFLite, so conversion is needed • See the quantization paper for TFLite [1] and MLIR’s quantization doc [2] [1] Jacob, B et al., ”Quantization and Training of Neural Networks for Efficient Integer- Arithmetic-Only Inference”, https://arxiv.org/abs/1712.05877 [2] https://github.com/tensorflow/mlir/blob/master/g3doc/Quantization.md
  • 24. Invoke() of NNAPI Delegate Kernel • mainly memory management and ANeuralNetworksExecution*() • To digger more we have to go thru more TFLite and NNAPI data structures • asking NNAPI to work for you is quite trivial when everything is well-prepared https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L2683-L2872
  • 25. DoPrepare • for NNAPI >=1.2 (Android Q and later), if no real accelerators there, i.e., only NNAPI CPU fallback is there, computation is not offloaded. • Check for every node to see if it is supported • NN API Delegate Registration: previous pages • Request TFLite to partition the graph and make kernels for each independent node subset a new nnapi_delegate_kernel https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L3353-L3457
  • 26. partition graph • in the end of DoPrepare(), ReplaceNodeSubsetsWithDele gateKernels() is called • DoPrepare() -> Subgraph::ReplaceNodeSubs etsWithDelegateKernels() -> tflite::PartitionGraphIntoIndepe ndentNodeSubsets() -> tflite::Partition() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/core/ subgraph.cc#L298-L363
  • 27. tflite::Partition() did most partition job • part of Partition() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/graph_info.cc#L67-L118
  • 28. GPU GL Delegate TfLiteRegistration • TfLiteRegistration in DelegatePrepare() • init() • no free() • prepare() is quite simple • invoke(): simply calls node- >Invoke() • context -> ReplaceNodeSubsetsWithDele gateKernels() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L392-L431
  • 29. GPU GL Delegate • TfLiteDelegate • Prepare • CopyFromBufferHandle • CopyToBufferHandle • class Delegate • TFLiteGpuDelegateCreate() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L75-L457 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L464-L470 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L75-L457
  • 30. GPU Metal Delegate TfLiteRegistration • TfLiteRegistration in DelegatePrepare() • init() • no free() • prepare() is quite simple • invoke(): simply calls node- >Invoke() • context -> ReplaceNodeSubsetsWithDele gateKernels() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L392-L431
  • 31. GPU Metal Delegate • TfLiteDelegate • Prepare: yup, just Prepare() • class Delegate, which is quite large • NewGpuDelege() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L525-L532 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L620-L624 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L163-L613
  • 32. GPU delegate kernels • GPU backends require initialization involving shader compilation and optimization by the driver before inference • PHWC4: P stands for plane • Reshape is expensive on GPU • RGBA is better than RGB on GPU • a tensor of shape [B,H,W,5], for instance, is twice as expensive as [B, H, W, 4], but about the same as [B, H, W, 8], then the architect can tune around those 4-channel boundaries rather than trying to optimize on other boundaries. • https://arxiv.org/pdf/1907.01989.pdf
  • 33. Flex Delegate • Another delegate is the one that provides selected set of ops in Eager mode • It’s much easier to check what it does https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/flex/delegate.cc#L143-L148 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/flex/kernel.cc#L561-L573
  • 34. Edge TPU’s canned model • supported ops are packed into single op for Edge TPU The compiler creates a single custom op for all Edge TPU compatible ops; anything else stays the same https://coral.withgoogle.com/docs/edgetpu/models-intro/ 34 MobileNet V1 1×224×224×3 1×1001 edgetpu-custom-op input Softmax 1×300×300×3 1×1917×91 1×10×4 1×10 1×10 1 edgetpu-custom-op TFLite_Detection_PostProcess 3 1917×4 normalized_input_image_tensor TFLite_Detection_PostProcess TFLite_Detection_PostProcess:1 TFLite_Detection_PostProcess:2 TFLite_Detection_PostProcess:3 SSD MobileNet V1
  • 35. Edge TPU C++ API https://coral.withgoogle.com/docs/edgetpu/api-intro/
  • 36. EdgeTPU Delegate • There is dynamic delegate plugin interface. Currently it’s only used by EdgeTPU’s https://coral.withgoogle.com/docs/edgetpu/api-intro/
  • 37. There still are many trivial bugs in TensorFlow • There are many typos in comments of TensorFlow code • Many things are not well-documented • There are many many warnings when building TensorFlow from source code • a trivial fix in May, 2019 by me 37 https://github.com/tensorflow/tensorflow/pull/28618