Introduction to TensorFlow Lite

What I Know about
TensorFlow Lite
Koan-Sin Tan

freedom@computer.org

Hsinchu Coding Serfs Meeting

Dec 7th, 2017

• We heard Android NN and TensorFlow Lite back in Google I/
O 2017

• My COSCUP 2017 slide deck “TensorFlow on Android”

• https://www.slideshare.net/kstan2/tensorﬂow-on-
android

• People knew a bit about Android NN API before it was
announced and released

• No information about TensorFlow Lite, at least to me,
before it was released in Nov

tf-lite and android NN in
Google I/O
• New TensorFlow runtime
• Optimized for mobile and
embedded apps

• Runs TensorFlow models on
device

• Leverage Android NN API

• Soon to be open sourced
from Google I/O 2017 video

Actual Android NN API
• Announced with Android 8.1 Preview 1

• Available to developer in NDK

• yes, NDK

• The Android Neural Networks API (NNAPI)
is an Android C API designed for running
computationally intensive operations for
machine learning on mobile devices

• NNAPI is designed to provide a base layer
of functionality for higher-level machine
learning frameworks (such as TensorFlow
Lite, Caﬀe2, or others) that build and train
neural networks

• The API is available on all devices running
Android 8.1 (API level 27) or higher.
https://developer.android.com/ndk/images/nnapi/nnapi_architecture.png

Android NN on Pixel 2
• Only the CPU fallback is available

• Actually, you can see Android NN API related in AOSP
after Oreo MR1 (8.1) release already

• user level code, see https://android.googlesource.com/
platform/frameworks/ml/+/oreo-mr1-release

• HAL, see https://android.googlesource.com/platform/
hardware/interfaces/+/oreo-mr1-release/
neuralnetworks/

TensorFlow Lite
• TensorFlow Lite is TensorFlow’s lightweight solution for
mobile and embedded devices

• It enables on-device machine learning inference with low
latency and a small binary size

• Low latency techniques: optimizing the kernels for mobile
apps, pre-fused activations, and quantized kernels that
allow smaller and faster (fixed-point math) models

• TensorFlow Lite also supports hardware acceleration with
the Android Neural Networks API
https://www.tensorflow.org/mobile/tflite/

What does TensorFlow Lite
contain?
• a set of core operators, both quantized and float, which have been tuned for mobile platforms

• pre-fused activations and biases to further enhance performance and quantized accuracy

• using custom operations in models also supported

• a new model file format, based on FlatBuffers

• the primary difference is that FlatBuffers does not need a parsing/unpacking step to a secondary
representation before you can access data

• the code footprint of FlatBuffers is an order of magnitude smaller than protocol buffers

• a new mobile-optimized interpreter,

• key goals: keeping apps lean and fast.

• a static graph ordering and a custom (less-dynamic) memory allocator to ensure minimal load,
initialization, and execution latency

• an interface to Android NN API if available

why a new mobile-speciﬁc
library?
• Innovation at the silicon layer is enabling new possibilities for hardware
acceleration, and frameworks such as the Android Neural Networks API
make it easy to leverage these

• Recent advances in real-time computer-vision and spoken language
understanding have led to mobile-optimized benchmark models being open
sourced (e.g. MobileNets, SqueezeNet)

• Widely-available smart appliances create new possibilities for on-device
intelligence

• Interest in stronger user data privacy paradigms where user data does not
need to leave the mobile device

• Ability to serve ‘oﬄine’ use cases, where the device does not need to be
connected to a network

• A set of core operators, both quantized and float, many of which have been tuned for mobile platforms. These
can be used to create and run custom models. Developers can also write their own custom operators and
use them in models

• A new FlatBuffers-based model file format

• On-device interpreter with kernels optimized for faster execution on mobile

• TensorFlow converter to convert TensorFlow-trained models to the TensorFlow Lite format.

• Smaller in size: TensorFlow Lite is smaller than 300KB when all supported operators are linked and less than
200KB when using only the operators needed for supporting InceptionV3 and Mobilenet

• Pre-tested models

• Inception V3, MobileNet, On Device Smart Reply

• Quantized versions of the MobileNet model, which runs faster than the non-quantized (float) version on CPU.

• New Android demo app to illustrate the use of TensorFlow Lite with a quantized MobileNet model for object
classification

• Java and C++ API support

• Java API: A convenience
wrapper around the C++ API
on Android

• C++ API: Loads the
TensorFlow Lite Model File
and invokes the Interpreter.
The same library is available
on both Android and iOS

• Let $TF_ROOT be root of tensorflow

• source of tf-lite: ${TF_ROOT}/tensorflow/contrib/lite/

• https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/README.md

• examples

• two for Android, two for iOS

• APIs: ${TF_ROOT}/tensorflow/contrib/lite/g3doc/apis.md, https://github.com/tensorflow/
tensorflow/blob/master/tensorflow/contrib/lite/g3doc/apis.md

• no benchmark_model: well there is one, https://github.com/tensorflow/tensorflow/blob/master/
tensorflow/contrib/lite/tools/benchmark_model.cc

• it’s incomplete

• no command line label_image (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/
examples/label_image)

• model: .tflite model

• resolver: if no custom ops, builtin
op resolve is enough

• interpreter: we need it to compute
the graph

• interpreter->AllocateTensor():
allocate stuff for you, e.g., input
tensor(s)

• fill the input

• interpreter->Invoke(): run the graph

• process the output
tflite::FlatBufferModel model(path_to_model);
tflite::ops::builtin::BuiltinOpResolver resolver;
std::unique_ptr<tflite::Interpreter> interpreter;
tflite::InterpreterBuilder(*model, resolver)(&interpreter);
// Resize input tensors, if desired.
interpreter->AllocateTensors();
float* input = interpreter->typed_input_tensor<float>(0);
// Fill ìnput`.
interpreter->Invoke();
float* output = interpreter->type_output_tensor<float>(0);

beyond basic stuff
• https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/interpreter.h

• const char* GetInputName(int index): https://github.com/tensorflow/tensorflow/blob/master/
tensorflow/contrib/lite/interpreter.h#L157

• const char* GetOutputName(int index): https://github.com/tensorflow/tensorflow/blob/
master/tensorflow/contrib/lite/interpreter.h#L166

• int tensors_size(): https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/
lite/interpreter.h#L171

• TfLiteTensor* tensor(int tensor_index)

• int nodes_size(): https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/
lite/interpreter.h#L174

• const std::pair<TfLiteNode, TfLiteRegistration>* node_and_registration(int node_index)

• Yes, we should be able to enumerate/traverse tensors and nodes

beyond basic stuff
• void UseNNAPI(bool enable)

• void SetNumThreads(int num_threads)

• TfLiteTensor: https://github.com/freedomtan/tensorflow/blob/
label_image_tflite_pr/tensorflow/contrib/lite/context.h#L163

• my label_image for tflite

• https://github.com/freedomtan/tensorflow/blob/
label_image_tflite_pr/tensorflow/contrib/lite/examples/
label_image/label_image.cc#L103

• logs of running label_image

• https://drive.google.com/file/d/11LAI_b1fVOM2GxOT_gOpOMpASaqe5m_U/view?
usp=sharing

• builtin state dump function

• void PrintInterpreterState(Interpreter* interpreter): https://github.com/tensorflow/tensorflow/
blob/master/tensorflow/contrib/lite/optional_debug_tools.h#L25

• https://github.com/freedomtan/tensorflow/blob/label_image_tflite_pr/tensorflow/contrib/lite/
examples/label_image/label_image.cc#L147

• TF operations --> TF Lite operations is not trivial

• https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/g3doc/
tf_ops_compatibility.md

• https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/
nnapi_delegate.cc

speed of quantized one
• It seems it's much better than naive quantization as we saw before

• On Nexus 9 (MobileNet 1.0/224)

• Quantized

• ./label_image -t 2: ~ 160 ms

• ./label_image -t 2 -c 100: ~ 60 ms

• Floating point

• ./label_image -t 2 -f 1 -m ./mobilenet_v1_1.0_224.tflite: ~ 300 ms

• ./label_image -t 2 -c 100 -f 1 -m ./mobilenet_v1_1.0_224.tflite: ~ 82 ms

• TFLiteCameraDemo: 130 - 180 ms

• Pixel 2

• TFLiteCameraDemo: ~ 100 ms

• didn’t see signiﬁcant diﬀerence, w/ or w/o Android NN runtime

Custom Operators
• https://github.com/tensorflow/
tensorflow/blob/master/
tensorflow/contrib/lite/g3doc/
custom_operators.md

• OpInit(), OpFree(),
OpPrepare(), and OpInvoke() in
interpreter.cc
typedef struct {
void* (*init)(TfLiteContext* context, const char* buffer, size_t length);
void (*free)(TfLiteContext* context, void* buffer);
TfLiteStatus (*prepare)(TfLiteContext* context, TfLiteNode* node);
TfLiteStatus (*invoke)(TfLiteContext* context, TfLiteNode* node);
} TfLiteRegistration;

Fake Quantiztion
• How hard can it be? How much time is needed?

• Several pre-tested models are available

• https://github.com/tensorflow/tensorflow/blob/master/
tensorflow/contrib/lite/g3doc/models.md

• but only one of them (https://storage.googleapis.com/
download.tensorflow.org/models/tflite/
mobilenet_v1_224_android_quant_2017_11_08.zip) is quantized
one

• as we can guess from related docs, retrain is kinda required to
get accuracy back

• BLAS part: eigen (http://eigen.tuxfamily.org/) and
gemmlowp (https://github.com/google/gemmlowp)

Introduction to TensorFlow Lite

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to TensorFlow Lite

Similar to Introduction to TensorFlow Lite (20)

More from Koan-Sin Tan

More from Koan-Sin Tan (14)

Recently uploaded

Recently uploaded (20)

Introduction to TensorFlow Lite