Running TFLite on Your Mobile Devices, 2020

Koan-Sin Tan,
freedom@computer.org
COSCUP, Aug 1st, 2020
Running TFLite on Your Mobile
Devices

• disclaimer: opinions are my own

• feel free to interrupt me if you have any questions during the presentation

• questions could be Taiwanese, English, or Mandarin

• Used open source before the term “open
source” is used
• A software guy, learned to use Unix and open
source software on VAX-11/780 running 4.3BSD
• Used to be a programming language junkie
• Worked on various system software, e.g., CPU
scheduling and power management of non-
CPU components
• Recently, on NN performance on edge devices
related stuﬀ
• Contributed from time to time to TensorFlow Lite
• started a command line label_image for TFLite
who i am
https://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg

VAX 11/780 CPU consists of TTL ICs
https://en.wikipedia.org/wiki/Transistor%E2%80%93transistor_logic https://en.wikipedia.org/wiki/7400-series_integrated_circuits

Why TFLite?
• TensorFlow Lite

• TensorFlow is one of the most popular machine learning frameworks

• TFLite: a lightweight runtime for edge devices

• originally mobile devices —> mobile and IoT/embedded devices

• could be accelerated by GPU, DSP, or ASIC accelerator

• How about PyTorch?

• yes, it is popular, but not on mobile devices yet

• Yes, there are other open source NN frameworks. No one is as comprehensive as TF Lite, as far as I can tell

• See my talk slide deck at COSCUP 2019 for more discussion, https://www.slideshare.net/kstan2/status-
quo-of-tensor-ﬂow-lite-on-edge-devices-coscup-2019

Outline
• Overview of TFLite on Android and iOS devices,

• TFLite metadata and TFLite Android code generator,

• Some new features: CoreML delegate and XNNPACK delegate

What is TensorFlow Lite
• TensorFlow Lite is a cross-platform framework for deploying ML on mobile
devices and embedded systems

• Mobile devices -> mobile and IoT/embedded devices

• TFLite for Android and iOS

• TFLu: TFLite micro for micro-controllers

Why ML on Edge devices
• Low latency & close knit interactions

• “There is an old network saying: Bandwidth problems can be cured with money.
Latency problems are harder because the speed of light is ﬁxed — you can't bribe
God.” -- David D. Clark,

• network connectivity

• you probably heard “always-on” back from 3G days, you know that’s not true in
the 5G era

• privacy preserving

• sensors

from TF Dev Summit 2020, https://youtu.be/27Zx-4GOQA8

• We’ll talk about

• TFLite metadata and codegen which are in tflite support library

• two delegates which enable using hardware capabilities

• What others you may want to dig into

• quantization, fixed point, integer

• ARM dot product instruction, Apple A13 matrix operations in CPUs (yes, CPU)

• GPU delegate started quantized models couple month ago

• GPUs usually support fp16 first

• new MLIR-based runtimes, such as TFRT and IREE

• I’ll talk a little bit about TFRT tomorrow

So how to start using TFLite
• TFLite actually has two main parts

• interpreter: loads and runs a model on various hardware

• converter: converts TF models to a TFLite specific format to be used by the
interpreter

• see https://www.tensorflow.org/lite/guide for more introduction materials

• There is a good guide on how to load a model and do inference on devices
using TFLite interpreter, in Java, Swift, Objective-C, C++, and Python

• https://www.tensorflow.org/lite/guide/inference

load and run a model in C++
other APIs are wrappers around C++ code
https://www.tensorﬂow.org/lite/guide/inference

TFLite metadata and TFLite
Android code generator

TFLite Metadata
• before TFLite Metadata was introduced, when we load and run a model

• it’s user’s/developer’s responsibility to figure out what input tensors and output tensors are. E.g.,

• we know image a classifier usually expects preprocessed (resizing, cropping, padding, etc.) and normalized ([0,
1] or [-1, 1]) data

• label file is not included

• in TFLite metadata, there are three parts in the schema:

• Model information - Overall description of the model as well as items such as licence terms.
See ModelMetadata.
• Input information - Description of the inputs and pre-processing required such as normalization.
See SubGraphMetadata.input_tensor_metadata.

• Output information - Description of the output and post-processing required such as mapping to labels.
See SubGraphMetadata.output_tensor_metadata.
https://www.tensorflow.org/lite/convert/metadata

• Supported Input / Output types

• Feature - Numbers which are unsigned integers or float32.

• Image - Metadata currently supports RGB and greyscale images.

• Bounding box - Rectangular shape bounding boxes. The schema supports a
variety of numbering schemes.

• Pack the associated files, e.g.,

• label file(s)

• Normalization and quantization parameters

• With example at https://
www.tensorflow.org/lite/convert/
metadata, we can create a image
classifier with

• image input, and

• label output
https://www.tensorflow.org/lite/convert/metadata

• https://developer.android.com/
studio/preview/features#tensor-
ﬂow-lite-models

CoreML Classifier model
and autogen headers for Objective-C

My exercise to use Android CameraX and TFLite codegen in Kotlin
• To test TFLite metadata and codegen, I need an Android app that can

• grab camera inputs and

• convert them into Android Bitmap to feed into the generated model
wrapper.

• Since I know nothing about Android Camera and Kotlin, I started this from the
CameraX tutorial. It seems quite easy.

• https://github.com/freedomtan/CameraxTFLite

https://github.com/freedomtan/CameraxTFLite/blob/master/my_classify_wrapper/myclassiﬁermodel.md

https://github.com/freedomtan/CameraxTFLite/blob/master/app/src/main/java/com/mediatek/cameraxtﬂite/MainActivity.kt#L182-L215

What is a TFLite delegate?
• “A TensorFlow Lite delegate is a way to delegate part or all of graph execution to another executor.”

• Why delegates?

• running computation-intensive NN models on mobile devices is resource demanding for mobile CPUs,
processing power and energy consumption could be problems

• and matrix-multiplication which is there core of convolution and fully connected ops is highly parallel

• Thus, some devices have hardware accelerators, such as GPU or DSP, that provide better performance
and higher energy eﬃciency thru Android NNAPI

• To use NNAPI, TFLite has an NNAPI delegate from the very beginning. Then, there are GPU delegates
(GL ES, OpenCL, and Metal for now. Vulkan one is coming) and others.

• my COSCUP 2019 slide deck on how NNAPI and GPU delegates work , https://www.slideshare.net/
kstan2/tﬂite-nnapi-and-gpu-delegates

XNNPACK and CoreML Delegates
• “XNNPACK is a highly optimized library of ﬂoating-point neural network inference operators for ARM,
WebAssembly, and x86 platforms.”

• “XNNPACK is not intended for direct use by deep learning practitioners and researchers; instead it
provides low-level performance primitives for accelerating high-level machine learning frameworks, such
as TensorFlow Lite, TensorFlow.js, PyTorch, and MediaPipe.", https://github.com/google/XNNPACK

• NNPACK —> QNNPACK —> XNNPACK

• In TFLite, there is a XNNPACK delegate

• CoreML is Apple’s machine learning framework

• the only formal way to use Neural Engine, Apple’s NN accelerator started from A11

• nope, CoreML cannot use A11 Neural Engine, https://www.slideshare.net/kstan2/why-you-cannot-
use-neural-engine-to-run-your-nn-models-on-a11-devices

• convolution is at the core of current
neural network models

• How convolution is implemented either
in SW or HW

• “direct convolution”: 6- or 7-layer
nested for loops,

• im2col, then GEMM,

• other transforms, e.g., Winograd

• XNNPACK found a way to eﬃciently
reuse GEMM
XNNPACK
https://arxiv.org/pdf/1907.02129.pdf

Using XNNPACK in label_image.cc
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/examples/label_image/
label_image.cc#L109-L116
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/evaluation/utils.h#L64-L88

Using CoreML delegate
https://github.com/freedomtan/glDelegateBenchmark/blob/master/glDelegateBenchmark/ViewController.mm#L61-L68
model name CPU 1 thread (ms) CPU 2 threads (ms) GPU (ms) CoreML Delegate (ms) [4]
Mobilenet V1 1.0 224 26.54 18.21 10.91 2.03
PoseNet 34.14 23.62 16.75 3.34
DeepLab V3 (257x257) 39.65 29.87 20.43 9.10
Mobilnet V2 SSD COCO 44.94 34.05 19.73 11.54
On iPhone 11 Pro, I got

Concluding remarks
• TFLite is getting more and more mature and comprehensive

• If you haven’t started using it, you may want to start with TFLite metadata and
Android code generators

• nope, there is no iOS code generator (yet)

• To speed up execution of NN models, use TFL delegates

• note that not all accelerators are created equal

• some are fp only; some are int/quant only

A13 AMX (Apple Matrix Extension?)

sgemm and dgemm in BLAS
• For (2048x4096) x (40x96x4096) matrix multiplication
• sgemm (32-bit floating point) speed: A13 > My MBP >> A12 > A11
• dgemm (64-bit floating point) speed: My MBP > A13 >> A12 > A11

Running TFLite on Your Mobile Devices, 2020

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Running TFLite on Your Mobile Devices, 2020

Similar to Running TFLite on Your Mobile Devices, 2020 (20)

More from Koan-Sin Tan

More from Koan-Sin Tan (13)

Recently uploaded

Recently uploaded (20)

Running TFLite on Your Mobile Devices, 2020