4. A collaborative effort
• Over the past seven years, FB has learned a great deal about how to best
collaborate with the hardware community
• Our work to help found and drive the Open Compute Project has been instrumental in
allowing us to build highly scalable, efficient networking and storage technologies for our
data centers
• We’ve applied this thinking to how we work with telecom operators and the connectivity
ecosystem overall with the Telecom Infra Project, as we work to get more people around the
world better connected to the internet
• As we look ahead, we now want to take these learnings and apply them to how we work with
our silicon partners on AI and ML
• We created Glow, an open source framework, to be community driven. This approach allows
partners to more rapidly design and optimize new silicon products for AI and ML by
leveraging community-driven compiler software
• Cadence, Esperanto, Intel, Marvell, and Qualcomm Technologies Inc, a subsidiary
of Qualcomm Incorporated, have committed to supporting Glow in future silicon
products
4
5. How Glow works
• Glow is designed to target a wide range of hardware accelerators
• The hardware-independent parts of the compiler focus on math-related
optimizations that are not tied to a specific hardware model
• It also contains a number of utilities and building blocks that can be
configured to support multiple hardware targets, including
• a powerful linear algebra optimizer
• an extensive test suite
• a CPU-based reference implementation for testing the accuracy of hardware
accelerators
• the memory allocator
• an instruction scheduler
• etc…
5
8. High-Level IR
• The high-level IR is a dataflow node-based graph representation
• similar to a graph that you may find inside Caffe or in ONNX format
• When we load a neural network model from some file we construct
this graph with a direct translation of one operator to one or more
nodes
• The graph is strongly typed, which means that inputs and output
have a known tensor type
• Consisting of the tensor's shape and element type, and that the types of
nodes are verified by the compiler
8
9. High-Level IR
• The Glow graph is structured as a module that contains multiple functions that
contain multiple nodes
• Nodes inside functions are able to reference Placeholders and Constants which
are owned by the module
• Placeholders and Constants, which are similar to global variables in C programs, are nodes
that are shared between the functions
• A module may have multiple functions
• For example, one module could contain both an inference function and the gradient of that
inference function
• The gradient function could perform training of the placeholder weights, and the
inference function could read from those same weights
9
10. High-Level IR
• Variable Visibility
• Glow variables are similar to PyTorch and TensorFlow variables
• They are persistent tensors that live across different executions of the neural network
• Variables are annotated with Public or Private labels. These labels specify whether
the node is visible outside of the graph
• If the node is public, then it means that C++ code from outside the graph may access the
variable directly and change its content before or after the execution of the program
• This means that the optimizer is not allowed to delete unused public variables or change their
dimensions
• In the case of private variables, the optimizer is allowed to delete unused variables, transpose,
perform constant propagation, etc.
10
11. High-Level IR
• Constants
• special nodes that represent tensors that
are a part of the graph
• These nodes can be used to represent
things like the weights of neural
networks
• Constants are immutable during the
execution of the program, but graph
optimizations can access the constants
and modify them
• This feature is useful for transformations
that prepare the weights by transposing
them or quantizing them before the
execution of the program
• Placeholders
• symbolic nodes that are not backed by a
concrete tensor during the compilation of
the program
• Inputs and outputs of Glow programs
should be modeled using Placeholder
nodes
• Concrete tensors are attached to
placeholder nodes during the execution
of the program
• Unlike constants, the optimizer can't
inspect or mutate the content of
Placeholder nodes
• The same program could be compiled
using different bound tensors without
changing the semantics of the program
11
12. High-Level IR
• Glow functions contain nodes that represent the
different operations of a neural network
• The function owns the nodes and has access to the
placeholders and constants in the module
• The image in the right-hand side depicts the compute
graph that represents the expression “saveD = A / B”
• Glow lowers the nodes that compute the gradient of
the expression and the stochastic gradient descent
(SGD) node into a sequence of low-level operators (Div,
Mul, Add and Save)
• The different compiler backends do not need to implement
support for the DivGrad, ReLUGrad or SGD nodes
12
13. Node Lowering
• Instead of compiling high-level operators directly, Glow performs
“node lowering”
• In this phase, the compiler breaks the high-level operator nodes into
low-level linear algebra operator nodes
• For example, the FullyConnected layer is represented as a matrix
multiplication followed by broadcasted add
• Different compiler backends do not have to implement the FullyConnected
layer and a dozen other high-level opcodes, just the low-level matrix
multiplication
13
14. Node Lowering
• In Glow, lowering is performed as part of the high-level graph as
described above, prior to moving to low-level IR
• This is due to a number of reasons
• First, the new lowered graph may allow for additional graph-level
optimizations
• Second, the new graph structure may affect the decisions of the instruction
scheduler
• And third, after lowering we allow the backends to perform additional target-
specific optimizations on the lowered graph
14
15. Low-Level IR
• After optimizing the graph with target-independent optimizations,
and lowering from high-level operator nodes to linear algebra
operator nodes, the code is further lowered into the low-level IR in a
phase that is called "IRGen" (which stands for IR generation)
• This is a one-to-many translation where each high-level node is translated
into one or more instructions
• During IRGen, constants and placeholders are converted into
WeightVars
• These WeightVars are annotated with Mutable or Constant labels, depending
on the source and whether the weights are modified during the execution of
the program
15
16. Low-Level IR
• The low-level IR enables a different kind of target independent
optimizations that are not possible with the high-level graph format
• This is an instruction-based representation that operates on tensors that are
referenced by address
• This gives the compiler the ability to perform low-level memory
optimizations that are not possible at the high-level, because memory is not
represented directly
• Hiding the latency of memory operations is important for utilizing the
execution units of the hardware effectively, and the instruction-based
representation allows the compiler to create a schedule that hides the
latency of the memory operations
16
17. Low-Level IR
• The IR is not Static Single Assignment (SSA) based representation,
because the IR does not support control flow
• The IR is strongly typed and each instruction operand kind has known
parameter types
• It is designed to be used as an in-memory form, though can be
dumped to human readable assembly-like format
17
18. Low-Level IR
• A function in IR form contains two sections:
'declare' and 'program’
• In the first section of the IR we declare a number
of memory regions that live throughout the
lifetime of the program
• This is similar to global variables in C
• The second part of the IR is a list of instructions
• There are two kinds of memory regions which
correspond to these two sections:
• global memory regions (found in 'declare’)
• and locally allocated regions (found in 'program’)
• The locally allocated memory regions are similar to
'alloca' in LLVM IR
• Memory regions are strongly typed, which
means that the kind of type of tensor that the
region represents is known
18
• Note that the 'alloc' instruction does not
allocate memory; it just marks the lifetime
of the activation
19. Low-Level IR
• Instructions operate on either global
variables or locally allocated buffers
• Each operand is annotated with one of
the qualifiers '@in'/'@out'/'@inout’
• '@in' means that the buffer is read from
• '@out' means that the buffer is written
into
• And '@inout' means that the instruction
may read and write into the buffer
• These operand qualifiers help the
optimizer decide when it is legal to
perform certain optimizations, such as
copy elimination or buffer sharing
19
20. How Glow works
20
The lowering phase is designed
to reduce the input space and
allow new hardware backends
to focus on a small number of
linear algebra primitives
The high-level IR allows
the optimizer to perform
domain-specific
optimizations
The lower-level instruction-based
address-only IR allows the compiler to
perform memory-related optimizations,
such as instruction scheduling, static
memory allocation and copy elimination
At the lowest level, the optimizer
performs machine-specific code
generation to take advantage of
specialized hardware features
Glow lowers a traditional neural network dataflow graph into a
two-phase strongly-typed intermediate representation (IR).
The graph is either
loaded via the graph
loader (from ONNX or
Caffe2 format), or
constructed via the
C++ interface
Additional rounds of optimizations
occur, both target independent and
target specific
IRGen
1
2
3
4
5
6 7
22. Glow Quantization
• Glow is able to convert floating-point-based networks into signed 8-
bit integer networks
• The canonical quantization representation is using signed integers, though it
is possible to support other quantization formats
• Arithmetic using small integers is more efficient than the computation of full-
width floating-point numbers, and additionally decreases memory usage
• Glow uses profile-guided quantization, observing execution during
inference to estimate the possible numeric range for each stage of the
neural network
• Training-based quantization is considered future work
22
23. Tensor Representation
• In Glow, tensors are typed and can represent floats, quantized non-
floating-point values such as currently supported Int8 (8-bit signed
integers), and index types
• To convert from the 8-bit integer range of [-128..127] to the floating-point
number that they represent, Glow uses the following conversion formula:
• Float value = (Int8 input - offset) * scale
• Activations, weights, and variables all use the same type-system and
represent information in a uniform way
23
24. Network Conversion
• Glow’s quantization conversion works
using a two-phase process
• First, we statically instrument the
network with special profiling nodes
that record the ranges of activations
that flow in the network, optimize the
network including these profiling nodes,
and then run inference
• Then, we recompile the network using
this profile information to convert the
network into a quantized form,
allowing for static optimization of the
quantized graph
• We convert portions of the network
into islands of integer computation
and aim to generate outputs in the
range that the original floating-point
network produces
24A quantized subgraph from Resnet50
Scale = 0.0364
Offset = -66
Max = 7.031
Min = -2.259
7.031 = (input –(-66)) * 0.0364
input = 127.159
input = 127 (int8)
-2.259 = (input –(-66)) * 0.0364
input = -128.060
input = -128 (int8)
Float value = (Int8 input - offset) * scale
25. Compiler Optimizations
• There are a few classes of optimizations and parameters to optimize
• First, we attempt to minimize the number of conversions between floating-point tensors and
integer tensors, in both directions
• Some operations, such as 'transpose' and 'concat' operate on both types, and changing the representation can
minimize conversions
• Second, the neural network contains 'rescale' nodes that change the range of the integers
• These nodes are required to convert between numeric ranges that mimic the original floating-point network
• However, in many cases, it is possible to fold the rescale operations into numeric-producing operations, and
eliminate them
• Third, it's possible to rescale the values in the network in order to allow fast hardware
implementations of the quantized operations
• By normalizing both sides of the 'max' operation to the same scale will allow hardware to perform a simple
comparison with efficient
• For more specific graph optimizations check here
25
26. Glow CPU Backend
Brief introduction to
Glow
Glow IR
Glow Quantization
Glow CPU Backend
26
27. Introduction
• The CPU Backend is a JIT ("Just In Time") compiler that generates
code in memory on demand for the host CPU
• The host cpu can be X86, ARM or anything that LLVM can target
• The Glow interpreter goes over the low-level IR one instruction at a
time and executes a switch statement that dispatches a C++
implementation for each instruction. This is suboptimal
• First, after each low-level instruction is executed, by calling a function call, we
return to the dispatch switch-loop
• Second, the C++ implementation of the low-level instruction had no
knowledge of the specific situation in which the instruction is being executed
27
28. Introduction
• The JIT, on the other hand, generates a single stream of highly
optimized instructions that don't go back to the interpreter
• Each instruction is optimized based on specific information on the context in
which the instruction is executed
• When a matrix multiplication is compiled the JIT knows exactly the dimensions of the
matrices that are being executed and where the tensors are placed in memory
• The JIT knows that the buffers do or do-not alias, and exactly the number of iterations
for the loop
• The knowledge enables much better code generation and vectorization
• The JIT is also able to eliminate all calls to 'malloc', because the memory is
statically allocated
• The whole network is allocated by a single malloc call
28
29. How the JIT Works
• The JIT accepts the low-level IR, and allocates concrete memory addresses for the
AllocActivation instructions in the module
• After this process the allocator knows the maximum number of bytes that the network
consumes
• The allocator assigns offsets for each alloc activation within the buffer
• Then, the JIT performs a single call to 'malloc' to allocates the heap
• At this point each activation and each weight has a concrete address on the heap
• Next, the JIT opens new LLVM functions and prepares for code generation
• The compiler goes over each low-level instruction and generates a sequence of LLVM-IR
• After the LLVM module is generated, the compiler calls the LLVM optimizer to
optimize the generated module and the code generator to generate efficient
machine code
• At this point the compilation phase is complete, and the network is ready for execution
29
30. Usage of the Standard Library
• During the compilation process, each Glow low-level instruction is
converted into a sequence of LLVM-IR instructions
• One way to implement this lowering is to use the IRBuilder to generate low-level
programs
• This is insane. Implementing and maintaining the low-level implementations of so many
operations using the LLVM-IR is not scalable
• Instead, the CPU backend compiles a small standard library into LLVM bitcode that it
ships with the compiler
• During the compilation process, Glow loads the bitcode from disk and specializes the operator
implementations for the specific context
• Glow replaces function arguments that represent the dimensions of some tensor or buffer
addresses with constants that LLVM can optimize to generate efficient code
• Most operators are very simple and the LLVM vectorizer is able to generate very efficient code
• The convolution and matrix multiplication operations are hand-optimized in C++ using the
clang extended OpenCL vector syntax, and LLVM does a good job allocating registers and
encoding the instructions, removing the need to use inline assembly
30
31. Operator Stacking
• One important optimization that the CPU backend implements is stacking of data-parallel
operators
• Consider a sequence of operators that operate one element at a time, for example a
ReLU, Add, Sub
• Iterating over a large buffer multiple times is inefficient because it requires the CPU to load the
memory multiple times, each time invalidating the whole cache
• Instead, Glow stacks operators and performs a few data-parallel operators one after the other on
the same memory location
• Operator stacking is similar to operator fusion
• However, when fusing multiple operators (e.g. Conv and ReLU fused together), all backends that
want to support this fused operator must implement a specific kernel for each permutation of
operators
• In contrast, Glow’s stacking automatically creates such kernels; all of the possible permutations of
data-parallel nodes are automatically fused into a fast kernel
31