A Peek into TFRT

Koan-Sin Tan,
freedom@computer.org
COSCUP, Aug 2nd, 2020
TensorFlow Runtime
A Peek into the Future of TensorFlow
1

• disclaimer: opinions are my own

• feel free to interrupt me if you have any questions during the presentation

• questions could be Taiwanese, English, or Mandarin

• most of TFRT materials are adapted from TFRT deep dive in MLIR design meeting [1] and TFRT docs [2]

• code around Aug 1, 2020 (git commit ecf1c20 [3])

[1] TFRT Deep Dive, slides - recording, https://mlir.llvm.org/talks/

[2] https://github.com/tensorﬂow/runtime/tree/master/documents

[3] https://github.com/tensorﬂow/runtime/commit/ecf1c20
2

• Used open source before the term “open
source” is used
• A software guy, learned to use Unix and open
source software on VAX-11/780 running 4.3BSD
• Used to be a programming language junkie
• Worked on various system software, e.g., CPU
scheduling and power management of non-
CPU components
• Recently, on NN performance on edge devices
related stuﬀ
• Contributed from time to time to TensorFlow Lite
• started a command line label_image for TFLite
who i am
https://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg
3

What is TFRT
• TensorFlow Runtime (TFRT) is one of the two new MLIR runtimes emerged in 2020 so far.

• The other one is Intermediate Representation Execution Environment, IREE. It seems
so far tfrt has better design documentation

• Both of them have mobile / edge environment in mind.

• I didn’t see mobile accelerated code in TFRT yet.

• IREE has some Vulkan related code and some simple code works on Android already

• ResNet GPU inference is 28% faster with TFRT

• https://github.com/tensorﬂow/runtime, https://youtu.be/15tiQoPpuZ8
4

Build it
• if you follow the instructions described in README.md, it should just work. At least on x86_64 linux.

• however, it’s not tested for non Linux environment yet

• ssize_t and int64_t

• on Mac OS X: ssize_t: long, int64_t: long long
• current code mixed the use of ssize_t and int64_t

• test: one the acclaimed features of TFRT, like MLIR, is its use of  
LLVM FileCheck

• my hacks, shape related (ssize_t) tests not ﬁxed yet

• it’s not tested on non-x86 platforms, such as aarch64, either  
•
5

• The three key directories under the TFRT root directory are

• lib: Contains core TFRT infrastructure code

• backends: Contains device speciﬁc infrastructure and op/kernel implementations

• include: Contains public header ﬁles for core TFRT infrastructure
6

Walking thru the tutorial
• unfortunately, it seems it’s not easy to jump directly into source code without having
some background knowledge

• so we’ll walk thru the tutorial [1]

• What are in the tutorial

• print hello world

• print integer

• adding kernels

[1] https://github.com/tensorﬂow/runtime/blob/master/documents/tutorial.md
7

using tfrt and tfrt_test
hello.mlir
func @hello() {
%chain = tfrt.new.chain
// Create a string containing "hello world" and store it in %hello.
%hello = "tfrt_test.get_string"() { string_attr = "hello world" } : () -> !tfrt.string
// Print the string in %hello.
"tfrt_test.print_string"(%hello, %chain) : (!tfrt.string, !tfrt.chain) -> !tfrt.chain
tfrt.return
}
The ‘@hello function above shows how to create and print a string. The text after each ‘:’ specifies the types involved:

• ()->!tfrt.string means that tfrt_test.get_string takes no arguments and returns a !tfrt.string. tfrt is a
MLIR dialect prefix (or namespace) for TFRT

• (!tfrt.string, !tfrt.chain) -> !tfrt.chain means that tfrt_test.print_string takes two arguments (!
tfrt.string and !tfrt.chain) and returns a !tfrt.chain. chain [1] is a TFRT abstraction to manage dependencies

[1] https://github.com/tensorflow/runtime/blob/master/documents/explicit_dependency.md
8

hello world in MLIR
func @stringconstant() -> !llvm<"[12 x i8]"> {
%1 = llvm.constant("Hello world!") : !llvm<"i8*">
// CHECK: ret [12 x i8] c"Hello world!"
llvm.return %1 : !llvm<"i8*">
}
func @main() {
%0 = llvm.constant(0) : !llvm.i64
%1 = call @stringconstant() : () -> !llvm<"[12 x i8]">
%2 = llvm.getelementptr %1[%0] : (!llvm<"[12 x i8]">, !llvm.i64) -> !llvm<"i8*">
%3 = llvm.bitcast %2 : !llvm<"i8*"> to !llvm<"i8*">
%32 = llvm.call @puts(%2) : (!llvm<"i8*">) -> !llvm.i32
return
}
func @puts(!llvm<"i8*">) -> !llvm.i32
• MLIR “standard dialect” doesn’t have I/O functions

• there is LLVM dialect, of course we can use LLVM to call standard libc
function
9

Hello integer
func @hello_integers() {
// Create an integer containing 42.
%forty_two = tfrt.constant.i32 42
// Print 42.
tfrt.print.i32 %forty_two, %chain
tfrt.return
}
• as stated in the tutorial, we can run other functions in the same modular

• we can turn to more basic ones, such as integers or ﬂoating point numbers

• @hello_integers shows how to create and print integers

• This example does not have the verbose type information we saw in @hello because there are
custom parsers for the tfrt.constant.i32 and tfrt.print.32 kernels in
basic_kernels.td
10

basic_kernels.td
• .td (table description?) files are for LLVM TableGen

[1] TableGen, https://llvm.org/docs/TableGen/
class ConstantOp<string suffix, Type baseType, Attr attr>
: TFRT_Op<"constant." # suffix, [NoSideEffect]> {
let summary = "host executor constant value constructor";
let arguments = (ins attr:$value);
let results = (outs baseType);
}
class PrintOp<string suffix, Type type> : TFRT_Op<"print." # suffix> {
let summary = "tfrt.print operation";
let description = [{
An operation takes a number input and a chain input.
It prints the number to stdout and returns a chain output.
The chain input must be the second operand.
Example:
%2 = tfrt.print.i32 %0, %1
}];
let arguments = (ins type, TFRT_ChainType);
let results = (outs TFRT_ChainType);
let assemblyFormat = "operands attr-dict";
let verifier = ?;
}
https://github.com/tensorflow/runtime/blob/master/include/tfrt/basic_kernels/opdefs/basic_kernels.td#L376-L390
https://github.com/tensorflow/runtime/blob/master/include/tfrt/basic_kernels/opdefs/basic_kernels.td#L58-L64
11

user defined kernels
func @print_coordinate() {
%two = tfrt.constant.i32 2
%four = tfrt.constant.i32 4
%coordinate = "my.create_coordinate"(%two, %four) : (i32, i32) -> !my.coordinate
"my.print_coordinate"(%coordinate, %chain) : (!my.coordinate, !tfrt.chain) -> !tfrt.chain
tfrt.return
}
coordinate.mlir shows several TFRT features:

• MLIR types that begin with exclamation mark (!) are user-deﬁned types like !my.coordinate,
compared to built-in types like i32

• Kernels are just C++ functions with a name in MLIR: my.print_coordinate is the MLIR name for
the C++ PrintCoordinate function

• Kernels may pass arbitrary user-deﬁned types: my.create_coordinate passes a custom
Coordinate struct to my.print_coordinate 13

to dig into some code we need
more system information
14

• TensorFlow user passes into TFRT a
TensorFlow graph created via high-level
TensorFlow APIs, and

• TFRT then calls the MLIR-based graph
compiler to optimize and lower the
graph into BEF, a Binary Executable
Format for TFRT graph execution (MLIR
is the compiler infrastructure that we
use to represent TFRT host programs).

• The blue arrows in the simpliﬁed
TensorFlow training stack diagram
show this ﬂow.
16

• In the README.md we are told to build two
binaries: tfrt_translate and bef_excutor

• tfrt_translate

• The tfrt_translate program does round trip
translation between MLIR and BEF, similar
to an assembler and disassembler.

• bef_executor

• The bef_executor program is the
execution driver of BEF files. It reads in a
BEF file, sets up runtime, and
asynchronously executes function(s) in
that file.
17

TFRT Host Runtime
• Foundation of TFRT: schedules work on the host and devices

• Clean separation between host and device runtimes:

• Host runtime does not know anything about devices, just their runtimes (sets of kernels)

• Key design points:

• Fully asynchronous - kernel executions can not block

• Excellent error propagation in the presence of asynchrony

• Performance as a ﬁrst-class concern, for graph and eager

• Outline:

• Common runtime infrastructure

• Graph execution

• Op-by-op execution (“eager”)
18

• Container for data or resources

• Not Tensor specific

• A “future” type, fulfilled with exactly one value, or an error

• Lock-free, low memory overhead, type erased, reference
counted

• Helper class AsyncValueRef<T> provides type safety when
contained type is known
• AsyncValues enable efficient asynchronous compute

• Asynchronous functions return unavailable AsyncValues
• Caller can schedule dependent
computations with AsyncValue::AndThen()
• Caller need not block until AsyncValue
becomes available
Key Abstraction: AsyncValue

https://github.com/tensorflow/runtime/blob/master/include/tfrt/host_context/async_value.h
19

Kernels
• Kernel: unit of computation scheduled by the runtime

• Similar to kernel concept in current TensorFlow

• Kernels accept AsyncValue inputs and produce AsyncValue output

• Runtime coordinates dataflow of AsyncValues between kernels

• Outputs may not be immediately available, unlike current TensorFlow

• Runtime generally does not understand kernel semantics
// Kernel that adds two integers.
// AsyncKernelFrame holds the kernel’s arguments and results.
static void TFRTAdd(AsyncKernelFrame* frame) {
// Fetch the kernel’s 0th argument.
AsyncValue* arg1 = frame->GetArgAt(0);
// Fetch the kernel’s 1st argument.
AsyncValue* arg2 = frame->GetArgAt(1);
int v1 = arg1->get<int>();
int v2 = arg2->get<int>();
// Set the kernel’s 0th result.
frame->EmplaceResultAt<int>(0, v1 + v2);
}
https://github.com/tensorflow/runtime/blob/master/documents/tfrt_host_runtime_design.md
https://github.com/tensorflow/runtime/blob/master/lib/basic_kernels/integer_kernels.cc#L39-L45
https://github.com/tensorflow/runtime/blob/master/include/tfrt/host_context/kernel_utils.h#L61-L149
20

Host Program
• Host programs encode a dataflow graph

• Similar to GraphDef in current TensorFlow

• Expressed in MLIR. Typically compiler generated

• Designed for low-level dispatch efficiency

• Designed for compiler transformations and analysis, e.g.,

• Use dataflow analysis for buffer reuse
func @sample_function() -> i32 {
%one = tfrt.constant.i32 1 // Make AsyncValue with value 1
%two = tfrt.constant.i32 2 // Make AsyncValue with value 2
%three = tfrt.add.i32 %one, %two // Make AsyncValue with value 3 (1+2)
%ch0 = tfrt.new.chain
tfrt.print.i32 %three, %ch0 // Print AsyncValue %three
tfrt.return %three : i32 // Return AsyncValue %three
}
21

TFRT Binary Executable Format (BEF)
• BEF encodes a hardware-specific lowered graph
function

• Primary interface between compiler and runtime  
• Designed for efficient execution

• Low overhead: execute program by reading mmap’d
byte array  
• Persistent and stable: Compile once offline, run
many times  
online. Great for inference use-cases  
• Composed of sections, similar to ELF. Each section
has its own format  
• Extensible: BEF is versioned, reader ignores unknown
sections, new versions may define new sections   https://github.com/tensorflow/runtime/blob/master/documents/binary_executable_format.md
22

BEF Executor
• BEF Executor evaluates a BEF dataflow graph “executor” style:

• Not a bytecode-like interpreter: no concept of program counter

• “Strict” execution by default: run a kernel only when all its inputs are available

• Executor features:

• Lock-free: atomics instead of mutexes

• Non-blocking: defer dependent work with AsyncValue::AndThen

• Supports “non-strict” execution: may run a kernel when some of its
inputs are available

• Good for efficiently forwarding unavailable inputs to outputs

• Key concepts:

• BEF: dataflow graph

• Kernel: dataflow node

• AsyncValues: dataflow edge
https://github.com/tensorflow/runtime/blob/master/lib/bef_executor/bef_interpreter.cc#L223-L25423

How about Core Runtime?
• Surely, we can do similar walkthrough, but that will takes more time

• Two things

• Op Execution API, Execute()

• BEF Executor can handle it too
void CoreRuntime::Impl::Execute(const ExecutionContext& exec_ctx,
string_view op_name, OpHandler* op_handler,
MutableArrayRef<TensorHandle> arguments,
const OpAttrsRef& attrs,
MutableArrayRef<TensorHandle> results,
AsyncValueRef<Chain>* chain) {
// Ask the op_handler to execute the op. If successful, we're done.
auto op_handle = op_handler->MakeOp(op_name);
if (op_handle) {
op_handle.get()(exec_ctx, arguments, attrs, results, chain);
return;
}
// Otherwise, we fail with an 'unknown op' error.
auto err =
EmitErrorAsync(exec_ctx, "op '" + op_name.str() + "' is not supported");
for (auto& result : results) result = TensorHandle(err.CopyRef());
if (chain) *chain = std::move(err);
}
25
https://github.com/tensorﬂow/runtime/blob/master/lib/core_runtime/core_runtime.cc#L124-L143
https://github.com/tensorﬂow/runtime/blob/master/documents/
tfrt_op_by_op_execution_design.md

BEF Executor for “op” graph
• corert.executeop

• sample
26
https://github.com/tensorﬂow/runtime/blob/master/lib/core_runtime/kernels.cc
func @example() -> !tfrt.chain {
%cpu = corert.get_op_handler("cpu")
// Create TensorHandles
%lhs = corert.executeop(%cpu)
"test.create_dense_tensor"() { shape = [1, 1], values = [-1.0 : f32] }
%rhs = corert.executeop(%cpu)
"test.create_dense_tensor"() { shape = [1, 1], values = [-2.0 : f32] }
%result = corert.executeop(%cpu) "test.add" (%lhs, %rhs)
%ch1 = corert.print_tensorhandle(%result, %ch0)
tfrt.return %ch1 : !tfrt.chain
}
func @example() -> !tfrt.chain {
%cpu = corert.get_op_handler %ch0 "cpu"
// Create TensorHandles
%lhs = corert.executeop(%cpu)
"test.create_dense_tensor"() { shape = [1, 1], values = [-1.0 : f32] } : 1
%rhs = corert.executeop(%cpu)
"test.create_dense_tensor"() { shape = [1, 1], values = [-2.0 : f32] } : 1
%result = corert.executeop(%cpu) "test.add" (%lhs, %rhs) : 1
%ch1 = "corert.print_tensorhandle"(%result, %ch0) : (!corert.tensorhandle, !tfrt.chain) -> !tfrt.chain
tfrt.return %ch1 : !tfrt.chain
}

Device Runtime
CPU
27
//===----------------------------------------------------------------------===//
// CPU Relu kernels
//===----------------------------------------------------------------------===//
// Computes B = Relu(A).
template <typename T>
static AsyncValueRef<Chain> Relu(const DenseHostTensor& A, DenseHostTensor* B,
const ExecutionContext& exec_ctx) {
auto fn = [](auto& a, auto& b) { return a.cwiseMax(static_cast<T>(0)); };
return ::tfrt::compat::UnaryEigenKernelAsync<T, T>(A, B, std::move(fn),
exec_ctx);
}
//===----------------------------------------------------------------------===//
// CPU BiasAdd kernels
//===----------------------------------------------------------------------===//
// A special case of tf.add where bias is restricted to be 1-D.
// Currently only support NHWC data format.
template <typename T, size_t RANK>
static AsyncValueRef<Chain> BiasAdd(const DenseHostTensor& input,
const DenseHostTensor& bias,
DenseHostTensor* output,
const ExecutionContext& exec_ctx) {
DHTIndexableView<T, RANK> input_view(&input);
MutableDHTIndexableView<T, RANK> output_view(output);
DHTIndexableView<T, 1> bias_view(&bias);
const auto& shape_input = input_view.FixedShape();
const auto& shape_bias = bias_view.FixedShape();
const auto& shape_output = output_view.FixedShape();
if (shape_input != shape_output) {
return EmitErrorAsync(exec_ctx, "unexpected output shape");
}
if (shape_bias[0] != shape_input[RANK - 1]) {
return EmitErrorAsync(exec_ctx, "bias shape does not match input shape");
}
// Reshape bias to the shape of input. Broadcast along the last axis of input.
Eigen::array<Eigen::Index, RANK> reshape_dims;
Eigen::array<Eigen::Index, RANK> broadcast_dims;
for (size_t i = 0; i < RANK - 1; ++i) {
reshape_dims[i] = static_cast<Eigen::Index>(1);
broadcast_dims[i] = static_cast<Eigen::Index>(shape_input[i]);
}
reshape_dims[RANK - 1] = static_cast<Eigen::Index>(shape_bias[0]);
broadcast_dims[RANK - 1] = static_cast<Eigen::Index>(1);
auto input_t = AsEigenConstTensor(input_view);
auto bias_t = AsEigenConstTensor(bias_view);
auto output_t = AsEigenTensor(output_view);
auto expr = input_t + bias_t.reshape(reshape_dims).broadcast(broadcast_dims);
return AsyncAssign(
exec_ctx.host()->GetOrCreateSharedContext<EigenHostContext>(),
std::move(output_t), std::move(expr),
KeepBuffers::alive(&input, &bias, output));
}
https://github.com/tensorﬂow/runtime/blob/master/backends/cpu/lib/kernels/cpu_kernels.h

Dialects we can see now
• tfrt: we know what this is for

• tfrt_test: to test tfrt

• tfrt_data: tf.data, to deal with input pipeline

• tfrt_dht: dense host tensor

• corert: Core Runtime, eager execution

• ts: tensor shape

• coo: COOrdinate list sparse tensor

• eigen: wrapper around the eigen library

• btf: binary tensor format

• cuda: you know what cuda means :-)
28

Concluding Remarks
• MLIR related talks and publications, https://mlir.llvm.org/talks/

• We scratched the surface of TFRT host runtime and core runtime. There are more details

• threading model: thread pool / work queue,

• memory allocation: tcmalloc for server, other small allocators for embedded systems,

• non-strict execution, and

• registers: BEF executor is a register machine

• we didn’t touch other important components such as device runtimes, eps. the GPU
part, and distributed environment
29

Device Runtime Design Principles

• A thin wrapper of low-level (driver) APIs, exposing device capabilities to graph compiler

• Memory Allocation

• Async host <-> device transfer, and kernel execution

• Dependency management

• Focus on mechanism instead of policy

• E.g. No built-in special-purpose streams for GPU support:
• For pure eager execution, can default to one stream for everything

• For tf.function execution, compiler can pick streams
31

A Peek into TFRT

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Peek into TFRT

Similar to A Peek into TFRT (20)

More from Koan-Sin Tan

More from Koan-Sin Tan (15)

Recently uploaded

Recently uploaded (20)

A Peek into TFRT