A Peek into Google's Edge TPU

A Peek into Google’s
Edge TPU
Koan-Sin Tan

freedom@computer.org

April 18th, 2019

Hsinchu Coding Serfs Meeting
1

Who Am I?
• An old programmer, learned to use “open
source” stuff on VAX-11/780 running 4.3BSD
before the term “open source” was coined

• TensorFlow Contributor

• Search “Koan-Sin" at https://github.com/
tensorflow/tensorflow/releases

• PRs, https://github.com/tensorflow/
tensorflow/pulls?
utf8=%E2%9C%93&q=is%3Apr+author
%3Afreedomtan+

• Contributing to TensorFlow is quite easy.
There are many typos :-)

• Interested in using NN on edge devices. so
learned TFLite

• label_image for TFLite
2

Google Edge TPU
!3
https://coral.withgoogle.com/products/

Google Edge TPU
• Announced in Google Next
2018 (July, 2018)

• Available to general developers
right before TensorFlow Dev
Summit 2019 (Mar, 2019)

• USB: Coral Accelerator

• Dev Board: Coral Dev Board

• More are coming, e.g., PCI-E
Accelerator and SOM

• Supported framework: TFLite
https://coral.withgoogle.com/products/
4

• Updates released on April 11th, 2019

• Compiler: removed the restriction for speciﬁc architectures

• New TensorFlow Lite C++ API

• Updated Python API, mainly for multiple Edge TPUs

• Updated Mendel OS and Mendel Management Tool (MDT) tool

• Environmental Sensor Board, https://coral.withgoogle.com/products/
environmental/

https://developers.googleblog.com/2019/04/updates-from-coral-new-compiler-and.html

https://coral.withgoogle.com/news/updates-04-2019/
!5

biology hobbyist in Edge TPU team?
!6
https://en.wikipedia.org/wiki/Coral https://en.wikipedia.org/wiki/Charles_Darwin
https://en.wikipedia.org/wiki/HMS_Beagle https://en.wikipedia.org/wiki/Gregor_Mendel

Coral USB Accelerator
• USB 3.1 (gen 1) port and
cable (SuperSpeed, 5Gb/s
transfer speed)

• MobileNet V1 1.0 224
quantized: ~ 4.3 MiB,

• Recommended operating
conditions

•
• https://coral.withgoogle.com/tutorials/accelerator-datasheet/
• https://coral.withgoogle.com/tutorials/accelerator/
4.3 * 106
* 8/(5 * 109
) ≈ 70μs
Operating frequency Max ambient temperature
Default 35°C
Maximum 25°C
• Software environment

• Linux computer with a USB Port

• Debian 6.0 or higher, or any
derivative thereof (such as Ubuntu
10.0+)

• System architecture of either x86_64
or ARM64 with ARMv8 instruction
set

• Some caveats

• USB 2.0 hurts

• With newer Ubuntu, you have to
modify the installation script

• actually, ARMv7 also works
7

https://coral.withgoogle.com/tutorials/accelerator-datasheet/
Performance Setting for
USB Accelerator
!8

Coral Dev Board
• Edge TPU Module (SOM)
◦ NXP i.MX 8M SOC (Quad-core
Cortex-A53, plus Cortex-M4F)
◦ Google Edge TPU ML accelerator
coprocessor
◦ Cryptographic coprocessor
◦ Wi-Fi 2x2 MIMO (802.11b/g/n/ac
2.4/5GHz)
◦ Bluetooth 4.1
◦ 8GB eMMC
◦ 1GB LPDDR4
• USB connections
◦ USB Type-C power port (5V DC)
◦ USB 3.0 Type-C OTG port
◦ USB 3.0 Type-A host port
◦ USB 2.0 Micro-B serial console port
• Audio connections
◦ 3.5mm audio jack (CTIA compliant)
◦ Digital PDM microphone (x2)
◦ 2.54mm 4-pin terminal for stereo speakers
• Video connections
◦ HDMI 2.0a (full size)
◦ 39-pin FFC connector for MIPI DSI
display (4-lane)
◦ 24-pin FFC connector for MIPI CSI-2
camera (4-lane)
• MicroSD card slot
• Gigabit Ethernet port
• 40-pin GPIO expansion header
• Supports Mendel Linux (derivative of Debian)
https://coral.withgoogle.com/tutorials/devboard-datasheet/
https://www.blog.google/products/google-cloud/bringing-intelligence-to-the-edge-with-cloud-iot/10

Mendel Linux?
• https://pypi.org/project/
mendel-development-tool/

• https://
coral.googlesource.com/
mdt.git

• 404, several weeks ago

• now it’s there

• actually, there are lots more
information at https://
coral.googlesource.com/, let’s
look at them later
https://pypi.org/project/mendel-development-tool/
11

Mendel Linux
• It’s Debian-based one, apt tools can tell us many things

• And take a look at /etc/apt/sources.list. Yup, it’s there

• https://packages.cloud.google.com/apt/dists/mendel-bsp-
enterprise-beaker/main

• https://packages.cloud.google.com/apt/dists/mendel-
beaker/main
!12

Mendel Linux
• https://
packages.cloud.google.com/
apt/dists/

mendel-animal
mendel-beaker
mendel-bsp-enterprise-animal
mendel-bsp-enterprise-beaker
mendel-bsp-enterprise-chef
mendel-bsp-enterprise-unstable
mendel-chef
mendel-chef-unstable
mendel-core-animal
mendel-core-beaker
mendel-core-chef
mendel-core-unstable
mendel-unstable
mendel-upstream-stretch
13

Performance?
https://coral.withgoogle.com/tutorials/edgetpu-faq/
!14

Let’s start from the first
demo
• USB getting started guide:

• https://coral.withgoogle.com/tutorials/accelerator/
• BasicEngine->{ClassificationEngine, DetectionEngine}, ImprintingEngine

• BasicEngine is single line

• from edgetpu.swig.edgetpu_cpp_wrapper import BasicEngine
• swig: yes, the > 20 yo SWIG

• _edgetpu_cpp_wrapper.so
!15
ClassificationEngine DetectionEngine
BasicEngine ImprintingEngine

ClassifyWithImage(img, threshold=0.1, top_k=3, resample=Image.NEAREST)
ClassifyWithInputTensor(input_tensor, threshold=0.0, top_k=3)
__dict__
…
ClassificationEngine
RunInference(input)
get_input_tensor_shape()
get_all_output_tensors_sizes()
get_num_of_output_tensors()
get_output_tensor_size()
required_input_array_size()
total_output_array_size()
model_path()
get_raw_output()
get_inference_time()
device_path()
__dict__
…
BasicEngine
What are in Engines
• BasicEngine

• input and output related

• Classification

• still I/O related

• classification specific:
resizing input image and
what to output
16

performance!
• no existing way to reproduce those numbers

• classify_image.py uses
ClassificationEngine.ClassifyWithImage()

• ClassifyWithImage() —>
ClassifyWithInputTensors() —>
RunInference()

• preprocessing: image resize time

• post-processing: top_k and finding labels/
classes

• BasicEngine.get_inference_time() returns
something I cannot understand

• modified label_image.py (and
object_detection) for TFLite

• quite close
https://github.com/freedomtan/edge_tpu_python_scripts
17

numbers in a git repo
• numbers and scripts

•
18
inception_v1_224_quant.tflite 412.79
inception_v1_224_quant_edget
pu.tflite
4.00
inception_v4_299_quant.tflite 3328.34
inception_v4_299_quant_edget
pu.tflite
100.33
mobilenet_ssd_v1_coco_quant
_postprocess.tflite
391.34
_postprocess_edgetpu.tflite
14.83
355.48
16.92
mobilenet_ssd_v2_face_quant
369.02
mobilenet_ssd_v2_face_quant
7.78
mobilenet_v1_1.0_224_quant.t
flite
184.99
mobilenet_v1_1.0_224_quant_
edgetpu.tflite
2.22
mobilenet_v2_1.0_224_quant.t
flite
160.94
mobilenet_v2_1.0_224_quant_
edgetpu.tflite
2.56
• benchmarks/basic_engine_benchmarks.py[Added - diff]
• benchmarks/classification_benchmarks.py[Added - diff]
• benchmarks/detection_benchmarks.py[Added - diff]
• benchmarks/imprinting_benchmarks.py[Added - diff]
• benchmarks/multiple_tpus_performance_analysis.py[Added - diff]
• benchmarks/reference/basic_engine_reference_aarch64.csv[Added - diff]
• benchmarks/reference/basic_engine_reference_rp3b+.csv[Added - diff]
• benchmarks/reference/basic_engine_reference_rp3b.csv[Added - diff]
• benchmarks/reference/basic_engine_reference_x86_64.csv[Added - diff]
• benchmarks/reference/classification_reference_aarch64.csv[Added - diff]
• benchmarks/reference/classification_reference_rp3b+.csv[Added - diff]
• benchmarks/reference/classification_reference_rp3b.csv[Added - diff]
• benchmarks/reference/classification_reference_x86_64.csv[Added - diff]
• benchmarks/reference/detection_reference_aarch64.csv[Added - diff]
• benchmarks/reference/detection_reference_rp3b+.csv[Added - diff]
• benchmarks/reference/detection_reference_rp3b.csv[Added - diff]
• benchmarks/reference/detection_reference_x86_64.csv[Added - diff]
• benchmarks/reference/imprinting_reference_aarch64.csv[Added - diff]
• benchmarks/reference/imprinting_reference_rp3b+.csv[Added - diff]
• benchmarks/reference/imprinting_reference_rp3b.csv[Added - diff]
• benchmarks/reference/imprinting_reference_x86_64.csv[Added - diff]
https://coral.googlesource.com/edgetpu/+/refs/heads/release-chef

Comparing with NCS 2
!19
device
MobileNet V1
1.0/224
MobileNet V2
1.0/224
Inception V3 ResNet 50 SqueezeNet 1.1
MobileNet V1
0.25/128
SSD MobileNet
V1 COCO
SSD MobileNet
V2 COCO
Coral: Edge
TPU
2.74 2.87 43.27 42.41 1.90 1.11 10.05 12.48
NCS 2 (fp16) 12.11 14.87 52.25 33.1 3.99 4.08 23.53 39.11
iPhone Xs Max
(Neural Engine
accelerated,
fp16)
1.74 2.15 8.65 6.91 1.75 1.16
Mobilenet V1/V2 and SSD Mobilenet V1/V2 are quite good
• Edge TPU: my scripts, https://github.com/freedomtan/edge_tpu_python_scripts
• NCS 2: ./benchmark_app-d MYRIAD -niter 50 -nireq 10 ..
• iPhone Xs Max: my CoreML benchmark, https://github.com/freedomtan/coremlbenchmark

0
2
4
6
8
10
12
14
time(ms)
Mobilenet V1: Edge TPU and NCS2
ncs2 mobilenet_v1_0.25 ncs2 mobilenet_v1_0.5 ncs2 mobilenet_v1_0.75 ncs2 mobilenet_v1_1.0
coral mobilenet_v1_0.25 coral mobilenet_v1_0.5 coral mobilenet_v1_0.75 coral mobilenet_v1_1.0
Mobilenet V1 on EdgeTPU
and NCS2
20
inference time size=128x128 size=160x160 size=192x192 size=224x224
ncs2
mobilenet_v1_0
.25
3.83 3.95 4.06 4.4
ncs2
mobilenet_v1_0
.5
4.98 4.86 5.51 6.51
ncs2
mobilenet_v1_0
.75
6.04 6.67 7.93 9.4
ncs2
mobilenet_v1_1
.0
7.43 8.68 10.13 12.2
coral
mobilenet_v1_0
.25
1.07 1.24 1.30 1.47
coral
mobilenet_v1_0
.5
1.16 1.40 1.53 1.95
coral
mobilenet_v1_0
.75
1.29 1.70 1.80 2.16
coral
mobilenet_v1_1
.0
1.50 1.95 2.15 2.85

https://www.tensorﬂow.org/lite/images/convert/workﬂow.svg
https://coral.withgoogle.com/docs/edgetpu/models-intro/• It’s said Edge TPU supports
TFLite

• well, not running TFLite
models directly
Edge TPU’s canned model
!21

Edge TPU’s canned model
• What do you mean by single
custom op
The compiler creates a single custom op for all Edge TPU
compatible ops; anything else stays the same
https://coral.withgoogle.com/docs/edgetpu/models-intro/
22
MobileNet V1 1×224×224×3
1×1001
edgetpu-custom-op
input
Softmax
1×300×300×3
1×1917×91
1×10×4 1×10 1×10 1
edgetpu-custom-op
TFLite_Detection_PostProcess
3 1917×4
normalized_input_image_tensor
TFLite_Detection_PostProcess TFLite_Detection_PostProcess:1 TFLite_Detection_PostProcess:2 TFLite_Detection_PostProcess:3
SSD MobileNet V1

Beyond Python
• _edgetpu_cpp_wrapper.so

• TensorFlow Lite runtime and others

• let’s take a look at _wrap_new_BasicEngine: aiy::BasicEngine::BasicEngine()
• aiy::BasicEngine::RunInference() —>
aiy::BasicEngine::RunInferenceHelper() —>
tflite::Interpreter::Invoke()
• unresolved edgetpu::EdgeTpuManager::GetSingleton()

• libedgetpu.so

• OpenSSL, Edge TPU context, communicating with the Edge TPU via USB or PCI

• edgetpu::EdgeTpuManager::GetSingleton()
• platforms::darwinn::tflite::EdgeTpuManagerDirect::GetSingleton()
!23

Edge TPU C++ API
• Released on April 11th, 2019

• binaries for x86_64, aarch64, and armeabi-v7a

• a simple header file

• two simple examples

• some doc at https://coral.withgoogle.com/docs/edgetpu/api-cpp/

• Native build on Dev Board

• the Dev Board is a quad-CA53 board, surely we can build code on it

• a small aarch64 patch https://github.com/tensorflow/tensorflow/commit/5520a9d82e5,
https://github.com/tensorflow/tensorflow/pull/16175

• https://github.com/freedomtan/edgetpu-native, label_image for tflite ported
!24

Edge TPU C++ API
•class EdgeTpuManager
•static EdgeTpuManager* GetSingleton();
•3 different
std::unique_ptr<EdgeTpuContext>
NewEdgeTpuContext()
•std::vector<DeviceEnumerationRecord>
EnumerateEdgeTpu()
•TfLiteStatus SetVerbosity(int verbosity)
•std::string Version()
• let’s take a look at ‘-v’ logs

• https://drive.google.com/
drive/folders/1-
MhGIgWHuhbKM6XrhPqyuLJ
DzoLD1t2g?usp=sharing

• in short, USB ones seem to
have more overhead
25
https://github.com/freedomtan/edgetpu-native/blob/label_image/libedgetpu/edgetpu.h#L110-
L158

1×224×224×3
1×1×1×1024
1×1×1×1024
1×1×1×5
1×5
1×5
edgetpu-custom-op
L2Normalization
Conv2D
weights 5×1×1×1024
bias 5
Reshape
Softmax
input
Output
Imprinting Engine
• Yes, let’s check what it is

• The Imprinting Engine implements a low-shot learning technique
called ‘Imprinted Weights’ [1][2]

• Can be used to retrain classifiers on-device (either on USB
Accelerator or Dev Board), no back-propagation gradient involved.

• Why?

• Transfer-learning happens on-device, at near-realtime speed.

• You don't need to recompile the model.

• Limitations

• Training data size is limited to a max of 200 images per class.

• It is most suitable only for datasets that have a small inner
class variation.

• The last fully-connected layer runs on the CPU, not the Edge
TPU. So it will be slightly less efficient than running a pre-
compiled on Edge TPU.

• if you are interested in it, check the paper and
aiy::learn::imprinting::ImprintingEngine::Train(un
signed char const*, int, int)
26
[1] https://coral.withgoogle.com/docs/edgetpu/retrain-classification-ondevice/

[2] https://arxiv.org/abs/1712.07136
1×224×224×3
1×1×1×1024
edgetpu-custom-op
input
AvgPool

PCIe device?
• it’s Linux

• `uname -a`: Linux hopeful-nexus 4.9.51-imx #1 SMP
PREEMPT Thu Jan 31 01:58:26 UTC 2019 aarch64
GNU/Linux

• there is /proc/config.gz

• $ zcat /proc/config.gz | grep -i
edge
• CONFIG_SND_GOOGLE_EDGETPU_CARD=y
!27

PCIe Device
• apex driver is in gasket
• https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/
drivers/staging/gasket
• It’s upstreamed last year already
• https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/
drivers/staging/gasket/apex_driver.c
!28

Global Unichip Corp
USB Vendor id 0x1a6e = “Global Unichip Corp”
PCI Vendor id 0x1ac1 = “Global Unichip Corp”
!29

USB Accelerator opened
https://twitter.com/generuso/status/1111733195244998656
!30

MCU on USB Accelerator
!31
https://www.seeedstudio.com/Coral-USB-Accelerator-p-2899.html

Power Consumption of the
USB Accelerator
• 4.94 x 0.18 ~= 0.9 W

• running Mobilenet-SSD
https://twitter.com/exsiva/status/1108692847719407616
32

Architecture of Edge TPU?
• Nope, I didn’t read it. Just
FYR

• https://patents.google.com/
patent/US20190050717A1/
33

Concluding Remarks
• Edge TPU is quite good for small models that you can converted to canned
ones

• Quantized UINT8

• not so good for some common larger models, e.g., Inception V3 and
ResNet 50

• your USB and CPU could be problems

• on-device re-training looks promising

• NCS 2 supports much more models for now

• How about NVIDIA Jetson Nano? Dunno, let’s wait and see. I don’t believe
GPU will win in the on long run.
!34

A Peek into Google's Edge TPU

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Peek into Google's Edge TPU

Similar to A Peek into Google's Edge TPU (20)

More from Koan-Sin Tan

More from Koan-Sin Tan (14)

Recently uploaded

Recently uploaded (20)

A Peek into Google's Edge TPU