Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

2

Share

Download to read offline

TFLite NNAPI and GPU Delegates

Download to read offline

TensorFlow is the most popular machine learning framework nowadays. TensorFlow Lite (TFLite), open sourced in late 2017, is TensorFlow’s runtime designed for mobile devices, esp. Android cell phones. TFLite is getting more and more mature. One the most interesting new components introduced recently are its GPU delegate and new NNAPI delegate. The GPU delegate uses Open GL ES compute shader on Android platforms and Metal shade on iOS devices. The original NNAPI delegate is an all-or-nothing design (if one of the ops in the compute graph is not supported by NNAPI, the whole graph is not delegated). The new one is a per-op design. When an op in a graph is not supported by NNAPI, the op is automatically fell back to the CPU runtime. I’ll have a quick review TFLite and its interpreter, then walk the audience through example usage of the two delegates and important source code of them.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

TFLite NNAPI and GPU Delegates

  1. 1. TFLite NNAPI and GPU Delegates Koan-Sin Tan freedom@computer.org Aug 18th, 2019 COSCUP 2019, Taipei, Taiwan
  2. 2. • disclaimer: Opinions Are My Own • feel free to interrupt me if you have any questions • questions in English, Taiwanese, and Mandarin are fine • note that i am gonna skip memory related code in the talk because of time constraint. Memory management, including locality and zero-copy, is always a crucial part of high-performance computing 2
  3. 3. who i am • Used open source before the term “open source” is used • A software guy, learned to use Unix and open source software on VAX-11/780 running 4.3BSD • Used to be a programming language junkie • Worked on various system software, e.g., CPU scheduling and power management of non- CPU components • Recently, on NN performance on edge devices related stuff • Contributed from time to time to TensorFlow Lite • started a command line label_image for TFLite https://github.com/tensorflow/tensorflow/releases/tag/v2.0.0-alpha0 http://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg 3
  4. 4. Delegation • Delegation: one of the commonly used old mechanisms mentioned in the GoF book • presumably, you know this well already • in case no, delegate definitions from dictionaries work figure from GoF, https://learning.oreilly.com/library/view/design-patterns-elements/0201633612/ch01.html#ch01lev3sec4
  5. 5. So, what is a TFLite delegate? • “A TensorFlow Lite delegate is a way to delegate part or all of graph execution to another executor.” • Why delegates? • running computation-intensive NN models on mobile devices is resource demanding for mobile CPUs, processing power and energy consumption could be problems • and matrix-multiplication which is there core of convolution and fully connected ops is highly parallel • Thus, some devices have hardware accelerators, such as GPU or DSP, that provide better performance and higher energy efficiency thru Android NNAPI • To use NNAPI, TFLite has an NNAPI delegate • Why I want to share what I know • used TFLite, contributed some code, e.g., label_image for TFLite • wrote quick-and-dirty TFLite GPU delegate benchmarks https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/g3doc/performance/delegates.md
  6. 6. What is TFLite • An lightweight inference engine • originally for Android and similar platforms. Extended to micro-controllers (e.g., ARM Cortex-M series) • Interpreter-based (what other choices do they have?) • ops are organized as a directed acyclic graph (DAG) • execute / interpret ops one bye one if no delegates involved https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/core/subgraph.cc#L734-L798
  7. 7. TfLiteContext • TfLiteContext: reporting facilities and access to global objects, including all the tensors • TfLiteNode: a single node or operation • TfLiteRegistration: the implementation of a conceptual operation https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L411-L485 ResizeTensor() ReportError() AddTensors() GetNodeAndRegistration() ReplaceNodeSubsetsWithDelegateKernels GetExternalContext() SetExternalContext() … tensors_size tensors impl_ recommended_num_threads allow_fp32_relax_to_fp16 profiler … TfLiteContext
  8. 8. TfLiteNode • TfLiteContext: reporting facilities and access to global objects, including all the tensors • TfLiteNode: a single node or operation • TfLiteRegistration: the implementation of a conceptual operation https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L377-L409 inputs outputs intermediates temporaries user_data builtin_data custom_initial_data custom_initial_data_size delegate … TfLiteNode
  9. 9. TfLiteRegistration • TfLiteContext: reporting facilities and access to global objects, including all the tensors • TfLiteNode: a single node or operation • TfLiteRegistration: the implementation of a conceptual operation https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L487-L544 init() free() prepare() invoke() profilling_string() … builtin_code custom_name version … TfLiteRegistration
  10. 10. To know more • Read [1][2] and create a custom op will help understanding TfLiteRegistration, TfLiteNode, and TfLiteContext deeper [1] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/guide/ inference.md#write-a-custom-operator [2] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/guide/ ops_custom.md
  11. 11. TfLiteDelegate: the interface • In case you didn’t notices it yet, TFLite is mainly written in C++ • C API for FFI from other high level languages • I hacked a Smalltalk one • many classes are structs and no member functions so that it could be used in C API easily https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L563-L602 Prepare() CopyFromBufferHandle() CopyToBufferHandle() FreeBufferHandler() … data_ flags … TfLiteDelegate
  12. 12. How TFLite delegates work? • Let's say we have a simple model graph such as the following: • Let's assume that there is a delegate "MyDelegate," which has a faster implementation for Conv2D and Mean operations. The resulting main graph will be updated to look like below. https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/g3doc/performance/delegates.md
  13. 13. 1×224×224×3 1×1001 TfLiteNnapiDelegate 1 32×3×3×3 2 1×3×3×512 3 512×1×1×512 4 1×3×3×512 5 512×1×1×512 6 1×3×3×512 7 1024×1×1×512 8 1×3×3×1024 9 1024×1×1×1024 10 1×3×3×32 11 64×1×1×32 12 1×3×3×64 13 128×1×1×64 14 1×3×3×128 15 128×1×1×128 16 1×3×3×128 17 256×1×1×128 18 1×3×3×256 19 256×1×1×256 20 1×3×3×256 21 512×1×1×256 22 1×3×3×512 23 512×1×1×512 24 1×3×3×512 25 512×1×1×512 26 1×3×3×512 27 512×1×1×512 28 1001 29 1001×1×1×1024 30 2 31 32 32 512 33 512 34 512 35 512 36 512 37 1024 38 1024 39 1024 40 32 41 64 42 64 43 128 44 128 45 128 46 128 47 256 48 256 49 256 50 256 51 512 52 512 53 512 54 512 55 512 56 512 57 512 input Reshape_1 What does a real model look like? • With the NNAPI delegate rewrite backed from Nov, 2018, a subgraph delegated to an “accelerator” is an op (named Delegate) in TFLite now • subgraph • all-or-nothing —> per op 1×224×224×3 1×112×112×32 1×112×112×32 1×112×112×64 1×56×56×64 1×56×56×128 1×56×56×128 1×56×56×128 1×28×28×128 1×28×28×256 1×28×28×256 1×28×28×256 1×14×14×256 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×7×7×512 1×7×7×1024 1×7×7×1024 1×7×7×1024 1×1×1×1024 1×1×1×1001 1×1001 1×1001 Conv2D weights 32×3×3×3 bias 32 DepthwiseConv2D weights 1×3×3×32 bias 32 Conv2D weights 64×1×1×32 bias 64 DepthwiseConv2D weights 1×3×3×64 bias 64 Conv2D weights 128×1×1×64 bias 128 DepthwiseConv2D weights 1×3×3×128 bias 128 Conv2D weights 128×1×1×128 bias 128 DepthwiseConv2D weights 1×3×3×128 bias 128 Conv2D weights 256×1×1×128 bias 256 DepthwiseConv2D weights 1×3×3×256 bias 256 Conv2D weights 256×1×1×256 bias 256 DepthwiseConv2D weights 1×3×3×256 bias 256 Conv2D weights 512×1×1×256 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 1024×1×1×512 bias 1024 DepthwiseConv2D weights 1×3×3×1024 bias 1024 Conv2D weights 1024×1×1×1024 bias 1024 AveragePool2D Conv2D weights 1001×1×1×1024 bias 1001 Squeeze Softmax input Reshape_1 http://localhost:8080/, http://localhost:8090/
  14. 14. delegates in TFLite • NNAPI delegate • mainly for Android • GPU delegate: NNAPI, which as introduced in Android O MR1 (late 2017), is not popular (yet) • GL ES Compute shader on Android • Metal shader on iOS • FlexDelegate: eager mode to run some ops • useful when not all ops are supported by TFLite or accelerators (thru something like NNAPI or GPU delegate) • not in TensorFlow repo: EdgeTPU delegate
  15. 15. NNAPI-enabled devices ~ 25.8% around May 7, 2019 https://developer.android.com/about/dashboards15
  16. 16. 16 GL ES compute shader capable devices ~ 50% https://developer.android.com/about/dashboards
  17. 17. Android NN API • Announced/published with Android 8.1 Preview 1 • Available to developer in NDK • yes, NDK • The Android Neural Networks API (NNAPI) is an Android C API designed for running computationally intensive operations for machine learning on mobile devices • NNAPI is designed to provide a base layer of functionality for higher-level machine learning frameworks (such as TensorFlow Lite, Caffe2, or others) that build and train neural networks • The API is available on all devices running Android 8.1 (API level 27) or higher https://developer.android.com/ndk/images/nnapi/nnapi_architecture.png 17
  18. 18. So, what a delegate is supposed to implement • Understanding how to add a delegate helps • define a kernel node, which means to implement TfLiteRegistration • create an instance of TfLiteDelegate, then register the kernel node in Prepare() typedef struct TfLiteDelegate { void* data_; TfLiteStatus (*Prepare)(TfLiteContext* context, struct TfLiteDelegate* delegate); TfLiteStatus (*CopyFromBufferHandle)(TfLiteContext* context, struct TfLiteDelegate* delegate, TfLiteBufferHandle buffer_handle, TfLiteTensor* tensor); TfLiteStatus (*CopyToBufferHandle)(TfLiteContext* context, struct TfLiteDelegate* delegate, TfLiteBufferHandle buffer_handle, TfLiteTensor* tensor); void (*FreeBufferHandle)(TfLiteContext* context, struct TfLiteDelegate* delegate, TfLiteBufferHandle* handle); int64_t flags; } TfLiteDelegate; typedef struct _TfLiteRegistration { void* (*init)(TfLiteContext* context, const char* buffer, size_t length); void (*free)(TfLiteContext* context, void* buffer); TfLiteStatus (*prepare)(TfLiteContext* context, TfLiteNode* node); TfLiteStatus (*invoke)(TfLiteContext* context, TfLiteNode* node); const char* (*profiling_string)(const TfLiteContext* context, const TfLiteNode* node); int32_t builtin_code; const char* custom_name; int version; } TfLiteRegistration;
  19. 19. NNAPI delegate • C++ code: instead of C style one • derived from TfLiteDelegate • Some private data structures • extra member functions corresponding to private data structures https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/ nnapi_delegate.h#L29-L161 Prepare() CopyFromBufferHandle() CopyToBufferHandle() FreeBufferHandler() … data_ flags … TfLiteDelegate Prepare() CopyFromBufferHandle() CopyToBufferHandle() FreeBufferHandler() GetOptions() RegisteNnMemory() GetTensorMemoryMap() … data_ flags acceleration_name (options) (memory_registration) … StateFullNnApiDelegate
  20. 20. data • execution_preference • power/perf tradeoff: not widely supported as far as I can tell • accelerator_name: e.g., “fallback” and “hvx” • cache_dir • model_token • tensor_memory_map: MemoryRegistration struct Data { // Preferred Power/perf trade-off. Options::ExecutionPreference execution_preference; // Selected NNAPI accelerator name. std::string accelerator_name; // The cache dir for NNAPI model. std::string cache_dir; // The unique token string for NNAPI model. std::string model_token; // Tensor to ANeuralNetworksMemory mapping. std::vector<MemoryRegistration> tensor_memory_map; }; // Encapsulates all fields related to memory registration for internal // bookkeeping only. struct MemoryRegistration { ANeuralNetworksMemory* memory; CopyToHostTensorFnPtr callback; void* callback_context; };
  21. 21. TfLiteRegistration for nnapi_delegate_kernel • init() • free() • prepare() • invoke() • no profiling_string() • builtin_code = … • custom_name https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L3575-L3607 init() free() prepare() invoke() profilling_string() … builtin_code custom_name version … TfLiteRegistration
  22. 22. Init() of NNAPI Delegate Kernel • mainly for NNAPI initialization ANeuralNetworksCompilation_*() • and build graph • if NNAPI >= 1.2, checking there is “real” NNAPI device • one interesting conversion is INT8 -> UINT8 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L2571-L2672
  23. 23. INT8 —> UINT8 conversion • Original TFLite and NNAPI uses asymmetric UINT8 quantization • asymmetric one provides more flexibilities, but usually symmetric INT8 is more hardware friendly • more and more INT8 code for TFLite • NNAPI doesn’t change as fast as TFLite, so conversion is needed • See the quantization paper for TFLite [1] and MLIR’s quantization doc [2] [1] Jacob, B et al., ”Quantization and Training of Neural Networks for Efficient Integer- Arithmetic-Only Inference”, https://arxiv.org/abs/1712.05877 [2] https://github.com/tensorflow/mlir/blob/master/g3doc/Quantization.md
  24. 24. Invoke() of NNAPI Delegate Kernel • mainly memory management and ANeuralNetworksExecution*() • To digger more we have to go thru more TFLite and NNAPI data structures • asking NNAPI to work for you is quite trivial when everything is well-prepared https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L2683-L2872
  25. 25. DoPrepare • for NNAPI >=1.2 (Android Q and later), if no real accelerators there, i.e., only NNAPI CPU fallback is there, computation is not offloaded. • Check for every node to see if it is supported • NN API Delegate Registration: previous pages • Request TFLite to partition the graph and make kernels for each independent node subset a new nnapi_delegate_kernel https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L3353-L3457
  26. 26. partition graph • in the end of DoPrepare(), ReplaceNodeSubsetsWithDele gateKernels() is called • DoPrepare() -> Subgraph::ReplaceNodeSubs etsWithDelegateKernels() -> tflite::PartitionGraphIntoIndepe ndentNodeSubsets() -> tflite::Partition() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/core/ subgraph.cc#L298-L363
  27. 27. tflite::Partition() did most partition job • part of Partition() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/graph_info.cc#L67-L118
  28. 28. GPU GL Delegate TfLiteRegistration • TfLiteRegistration in DelegatePrepare() • init() • no free() • prepare() is quite simple • invoke(): simply calls node- >Invoke() • context -> ReplaceNodeSubsetsWithDele gateKernels() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L392-L431
  29. 29. GPU GL Delegate • TfLiteDelegate • Prepare • CopyFromBufferHandle • CopyToBufferHandle • class Delegate • TFLiteGpuDelegateCreate() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L75-L457 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L464-L470 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L75-L457
  30. 30. GPU Metal Delegate TfLiteRegistration • TfLiteRegistration in DelegatePrepare() • init() • no free() • prepare() is quite simple • invoke(): simply calls node- >Invoke() • context -> ReplaceNodeSubsetsWithDele gateKernels() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L392-L431
  31. 31. GPU Metal Delegate • TfLiteDelegate • Prepare: yup, just Prepare() • class Delegate, which is quite large • NewGpuDelege() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L525-L532 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L620-L624 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L163-L613
  32. 32. GPU delegate kernels • GPU backends require initialization involving shader compilation and optimization by the driver before inference • PHWC4: P stands for plane • Reshape is expensive on GPU • RGBA is better than RGB on GPU • a tensor of shape [B,H,W,5], for instance, is twice as expensive as [B, H, W, 4], but about the same as [B, H, W, 8], then the architect can tune around those 4-channel boundaries rather than trying to optimize on other boundaries. • https://arxiv.org/pdf/1907.01989.pdf
  33. 33. Flex Delegate • Another delegate is the one that provides selected set of ops in Eager mode • It’s much easier to check what it does https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/flex/delegate.cc#L143-L148 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/flex/kernel.cc#L561-L573
  34. 34. Edge TPU’s canned model • supported ops are packed into single op for Edge TPU The compiler creates a single custom op for all Edge TPU compatible ops; anything else stays the same https://coral.withgoogle.com/docs/edgetpu/models-intro/ 34 MobileNet V1 1×224×224×3 1×1001 edgetpu-custom-op input Softmax 1×300×300×3 1×1917×91 1×10×4 1×10 1×10 1 edgetpu-custom-op TFLite_Detection_PostProcess 3 1917×4 normalized_input_image_tensor TFLite_Detection_PostProcess TFLite_Detection_PostProcess:1 TFLite_Detection_PostProcess:2 TFLite_Detection_PostProcess:3 SSD MobileNet V1
  35. 35. Edge TPU C++ API https://coral.withgoogle.com/docs/edgetpu/api-intro/
  36. 36. EdgeTPU Delegate • There is dynamic delegate plugin interface. Currently it’s only used by EdgeTPU’s https://coral.withgoogle.com/docs/edgetpu/api-intro/
  37. 37. There still are many trivial bugs in TensorFlow • There are many typos in comments of TensorFlow code • Many things are not well-documented • There are many many warnings when building TensorFlow from source code • a trivial fix in May, 2019 by me 37 https://github.com/tensorflow/tensorflow/pull/28618
  • gibsson

    Nov. 30, 2020
  • chithize

    Mar. 5, 2020

TensorFlow is the most popular machine learning framework nowadays. TensorFlow Lite (TFLite), open sourced in late 2017, is TensorFlow’s runtime designed for mobile devices, esp. Android cell phones. TFLite is getting more and more mature. One the most interesting new components introduced recently are its GPU delegate and new NNAPI delegate. The GPU delegate uses Open GL ES compute shader on Android platforms and Metal shade on iOS devices. The original NNAPI delegate is an all-or-nothing design (if one of the ops in the compute graph is not supported by NNAPI, the whole graph is not delegated). The new one is a per-op design. When an op in a graph is not supported by NNAPI, the op is automatically fell back to the CPU runtime. I’ll have a quick review TFLite and its interpreter, then walk the audience through example usage of the two delegates and important source code of them.

Views

Total views

3,157

On Slideshare

0

From embeds

0

Number of embeds

12

Actions

Downloads

72

Shares

0

Comments

0

Likes

2

×