SlideShare a Scribd company logo
1 of 33
Glow Compiler
2018
issue.hsu@gmail.com
Outline
• Brief introduction to Glow
• Glow IR
• Glow Quantization
• Glow CPU Backend
2
Brief introduction to
Glow
Brief introduction to
Glow
Glow IR
Glow Quantization
Glow CPU Backend
3
A collaborative effort
• Over the past seven years, FB has learned a great deal about how to best
collaborate with the hardware community
• Our work to help found and drive the Open Compute Project has been instrumental in
allowing us to build highly scalable, efficient networking and storage technologies for our
data centers
• We’ve applied this thinking to how we work with telecom operators and the connectivity
ecosystem overall with the Telecom Infra Project, as we work to get more people around the
world better connected to the internet
• As we look ahead, we now want to take these learnings and apply them to how we work with
our silicon partners on AI and ML
• We created Glow, an open source framework, to be community driven. This approach allows
partners to more rapidly design and optimize new silicon products for AI and ML by
leveraging community-driven compiler software
• Cadence, Esperanto, Intel, Marvell, and Qualcomm Technologies Inc, a subsidiary
of Qualcomm Incorporated, have committed to supporting Glow in future silicon
products
4
How Glow works
• Glow is designed to target a wide range of hardware accelerators
• The hardware-independent parts of the compiler focus on math-related
optimizations that are not tied to a specific hardware model
• It also contains a number of utilities and building blocks that can be
configured to support multiple hardware targets, including
• a powerful linear algebra optimizer
• an extensive test suite
• a CPU-based reference implementation for testing the accuracy of hardware
accelerators
• the memory allocator
• an instruction scheduler
• etc…
5
How Glow works
6
Glow Intermediate
Representation
Brief introduction to
Glow
Glow IR
Glow Quantization
Glow CPU Backend
7
High-Level IR
• The high-level IR is a dataflow node-based graph representation
• similar to a graph that you may find inside Caffe or in ONNX format
• When we load a neural network model from some file we construct
this graph with a direct translation of one operator to one or more
nodes
• The graph is strongly typed, which means that inputs and output
have a known tensor type
• Consisting of the tensor's shape and element type, and that the types of
nodes are verified by the compiler
8
High-Level IR
• The Glow graph is structured as a module that contains multiple functions that
contain multiple nodes
• Nodes inside functions are able to reference Placeholders and Constants which
are owned by the module
• Placeholders and Constants, which are similar to global variables in C programs, are nodes
that are shared between the functions
• A module may have multiple functions
• For example, one module could contain both an inference function and the gradient of that
inference function
• The gradient function could perform training of the placeholder weights, and the
inference function could read from those same weights
9
High-Level IR
• Variable Visibility
• Glow variables are similar to PyTorch and TensorFlow variables
• They are persistent tensors that live across different executions of the neural network
• Variables are annotated with Public or Private labels. These labels specify whether
the node is visible outside of the graph
• If the node is public, then it means that C++ code from outside the graph may access the
variable directly and change its content before or after the execution of the program
• This means that the optimizer is not allowed to delete unused public variables or change their
dimensions
• In the case of private variables, the optimizer is allowed to delete unused variables, transpose,
perform constant propagation, etc.
10
High-Level IR
• Constants
• special nodes that represent tensors that
are a part of the graph
• These nodes can be used to represent
things like the weights of neural
networks
• Constants are immutable during the
execution of the program, but graph
optimizations can access the constants
and modify them
• This feature is useful for transformations
that prepare the weights by transposing
them or quantizing them before the
execution of the program
• Placeholders
• symbolic nodes that are not backed by a
concrete tensor during the compilation of
the program
• Inputs and outputs of Glow programs
should be modeled using Placeholder
nodes
• Concrete tensors are attached to
placeholder nodes during the execution
of the program
• Unlike constants, the optimizer can't
inspect or mutate the content of
Placeholder nodes
• The same program could be compiled
using different bound tensors without
changing the semantics of the program
11
High-Level IR
• Glow functions contain nodes that represent the
different operations of a neural network
• The function owns the nodes and has access to the
placeholders and constants in the module
• The image in the right-hand side depicts the compute
graph that represents the expression “saveD = A / B”
• Glow lowers the nodes that compute the gradient of
the expression and the stochastic gradient descent
(SGD) node into a sequence of low-level operators (Div,
Mul, Add and Save)
• The different compiler backends do not need to implement
support for the DivGrad, ReLUGrad or SGD nodes
12
Node Lowering
• Instead of compiling high-level operators directly, Glow performs
“node lowering”
• In this phase, the compiler breaks the high-level operator nodes into
low-level linear algebra operator nodes
• For example, the FullyConnected layer is represented as a matrix
multiplication followed by broadcasted add
• Different compiler backends do not have to implement the FullyConnected
layer and a dozen other high-level opcodes, just the low-level matrix
multiplication
13
Node Lowering
• In Glow, lowering is performed as part of the high-level graph as
described above, prior to moving to low-level IR
• This is due to a number of reasons
• First, the new lowered graph may allow for additional graph-level
optimizations
• Second, the new graph structure may affect the decisions of the instruction
scheduler
• And third, after lowering we allow the backends to perform additional target-
specific optimizations on the lowered graph
14
Low-Level IR
• After optimizing the graph with target-independent optimizations,
and lowering from high-level operator nodes to linear algebra
operator nodes, the code is further lowered into the low-level IR in a
phase that is called "IRGen" (which stands for IR generation)
• This is a one-to-many translation where each high-level node is translated
into one or more instructions
• During IRGen, constants and placeholders are converted into
WeightVars
• These WeightVars are annotated with Mutable or Constant labels, depending
on the source and whether the weights are modified during the execution of
the program
15
Low-Level IR
• The low-level IR enables a different kind of target independent
optimizations that are not possible with the high-level graph format
• This is an instruction-based representation that operates on tensors that are
referenced by address
• This gives the compiler the ability to perform low-level memory
optimizations that are not possible at the high-level, because memory is not
represented directly
• Hiding the latency of memory operations is important for utilizing the
execution units of the hardware effectively, and the instruction-based
representation allows the compiler to create a schedule that hides the
latency of the memory operations
16
Low-Level IR
• The IR is not Static Single Assignment (SSA) based representation,
because the IR does not support control flow
• The IR is strongly typed and each instruction operand kind has known
parameter types
• It is designed to be used as an in-memory form, though can be
dumped to human readable assembly-like format
17
Low-Level IR
• A function in IR form contains two sections:
'declare' and 'program’
• In the first section of the IR we declare a number
of memory regions that live throughout the
lifetime of the program
• This is similar to global variables in C
• The second part of the IR is a list of instructions
• There are two kinds of memory regions which
correspond to these two sections:
• global memory regions (found in 'declare’)
• and locally allocated regions (found in 'program’)
• The locally allocated memory regions are similar to
'alloca' in LLVM IR
• Memory regions are strongly typed, which
means that the kind of type of tensor that the
region represents is known
18
• Note that the 'alloc' instruction does not
allocate memory; it just marks the lifetime
of the activation
Low-Level IR
• Instructions operate on either global
variables or locally allocated buffers
• Each operand is annotated with one of
the qualifiers '@in'/'@out'/'@inout’
• '@in' means that the buffer is read from
• '@out' means that the buffer is written
into
• And '@inout' means that the instruction
may read and write into the buffer
• These operand qualifiers help the
optimizer decide when it is legal to
perform certain optimizations, such as
copy elimination or buffer sharing
19
How Glow works
20
The lowering phase is designed
to reduce the input space and
allow new hardware backends
to focus on a small number of
linear algebra primitives
The high-level IR allows
the optimizer to perform
domain-specific
optimizations
The lower-level instruction-based
address-only IR allows the compiler to
perform memory-related optimizations,
such as instruction scheduling, static
memory allocation and copy elimination
At the lowest level, the optimizer
performs machine-specific code
generation to take advantage of
specialized hardware features
Glow lowers a traditional neural network dataflow graph into a
two-phase strongly-typed intermediate representation (IR).
The graph is either
loaded via the graph
loader (from ONNX or
Caffe2 format), or
constructed via the
C++ interface
Additional rounds of optimizations
occur, both target independent and
target specific
IRGen
1
2
3
4
5
6 7
Glow Quantization
Brief introduction to
Glow
Glow IR
Glow Quantization
Glow CPU Backend
21
Glow Quantization
• Glow is able to convert floating-point-based networks into signed 8-
bit integer networks
• The canonical quantization representation is using signed integers, though it
is possible to support other quantization formats
• Arithmetic using small integers is more efficient than the computation of full-
width floating-point numbers, and additionally decreases memory usage
• Glow uses profile-guided quantization, observing execution during
inference to estimate the possible numeric range for each stage of the
neural network
• Training-based quantization is considered future work
22
Tensor Representation
• In Glow, tensors are typed and can represent floats, quantized non-
floating-point values such as currently supported Int8 (8-bit signed
integers), and index types
• To convert from the 8-bit integer range of [-128..127] to the floating-point
number that they represent, Glow uses the following conversion formula:
• Float value = (Int8 input - offset) * scale
• Activations, weights, and variables all use the same type-system and
represent information in a uniform way
23
Network Conversion
• Glow’s quantization conversion works
using a two-phase process
• First, we statically instrument the
network with special profiling nodes
that record the ranges of activations
that flow in the network, optimize the
network including these profiling nodes,
and then run inference
• Then, we recompile the network using
this profile information to convert the
network into a quantized form,
allowing for static optimization of the
quantized graph
• We convert portions of the network
into islands of integer computation
and aim to generate outputs in the
range that the original floating-point
network produces
24A quantized subgraph from Resnet50
Scale = 0.0364
Offset = -66
Max = 7.031
Min = -2.259
7.031 = (input –(-66)) * 0.0364
input = 127.159
input = 127 (int8)
-2.259 = (input –(-66)) * 0.0364
input = -128.060
input = -128 (int8)
Float value = (Int8 input - offset) * scale
Compiler Optimizations
• There are a few classes of optimizations and parameters to optimize
• First, we attempt to minimize the number of conversions between floating-point tensors and
integer tensors, in both directions
• Some operations, such as 'transpose' and 'concat' operate on both types, and changing the representation can
minimize conversions
• Second, the neural network contains 'rescale' nodes that change the range of the integers
• These nodes are required to convert between numeric ranges that mimic the original floating-point network
• However, in many cases, it is possible to fold the rescale operations into numeric-producing operations, and
eliminate them
• Third, it's possible to rescale the values in the network in order to allow fast hardware
implementations of the quantized operations
• By normalizing both sides of the 'max' operation to the same scale will allow hardware to perform a simple
comparison with efficient
• For more specific graph optimizations check here
25
Glow CPU Backend
Brief introduction to
Glow
Glow IR
Glow Quantization
Glow CPU Backend
26
Introduction
• The CPU Backend is a JIT ("Just In Time") compiler that generates
code in memory on demand for the host CPU
• The host cpu can be X86, ARM or anything that LLVM can target
• The Glow interpreter goes over the low-level IR one instruction at a
time and executes a switch statement that dispatches a C++
implementation for each instruction. This is suboptimal
• First, after each low-level instruction is executed, by calling a function call, we
return to the dispatch switch-loop
• Second, the C++ implementation of the low-level instruction had no
knowledge of the specific situation in which the instruction is being executed
27
Introduction
• The JIT, on the other hand, generates a single stream of highly
optimized instructions that don't go back to the interpreter
• Each instruction is optimized based on specific information on the context in
which the instruction is executed
• When a matrix multiplication is compiled the JIT knows exactly the dimensions of the
matrices that are being executed and where the tensors are placed in memory
• The JIT knows that the buffers do or do-not alias, and exactly the number of iterations
for the loop
• The knowledge enables much better code generation and vectorization
• The JIT is also able to eliminate all calls to 'malloc', because the memory is
statically allocated
• The whole network is allocated by a single malloc call
28
How the JIT Works
• The JIT accepts the low-level IR, and allocates concrete memory addresses for the
AllocActivation instructions in the module
• After this process the allocator knows the maximum number of bytes that the network
consumes
• The allocator assigns offsets for each alloc activation within the buffer
• Then, the JIT performs a single call to 'malloc' to allocates the heap
• At this point each activation and each weight has a concrete address on the heap
• Next, the JIT opens new LLVM functions and prepares for code generation
• The compiler goes over each low-level instruction and generates a sequence of LLVM-IR
• After the LLVM module is generated, the compiler calls the LLVM optimizer to
optimize the generated module and the code generator to generate efficient
machine code
• At this point the compilation phase is complete, and the network is ready for execution
29
Usage of the Standard Library
• During the compilation process, each Glow low-level instruction is
converted into a sequence of LLVM-IR instructions
• One way to implement this lowering is to use the IRBuilder to generate low-level
programs
• This is insane. Implementing and maintaining the low-level implementations of so many
operations using the LLVM-IR is not scalable
• Instead, the CPU backend compiles a small standard library into LLVM bitcode that it
ships with the compiler
• During the compilation process, Glow loads the bitcode from disk and specializes the operator
implementations for the specific context
• Glow replaces function arguments that represent the dimensions of some tensor or buffer
addresses with constants that LLVM can optimize to generate efficient code
• Most operators are very simple and the LLVM vectorizer is able to generate very efficient code
• The convolution and matrix multiplication operations are hand-optimized in C++ using the
clang extended OpenCL vector syntax, and LLVM does a good job allocating registers and
encoding the instructions, removing the need to use inline assembly
30
Operator Stacking
• One important optimization that the CPU backend implements is stacking of data-parallel
operators
• Consider a sequence of operators that operate one element at a time, for example a
ReLU, Add, Sub
• Iterating over a large buffer multiple times is inefficient because it requires the CPU to load the
memory multiple times, each time invalidating the whole cache
• Instead, Glow stacks operators and performs a few data-parallel operators one after the other on
the same memory location
• Operator stacking is similar to operator fusion
• However, when fusing multiple operators (e.g. Conv and ReLU fused together), all backends that
want to support this fused operator must implement a specific kernel for each permutation of
operators
• In contrast, Glow’s stacking automatically creates such kernels; all of the possible permutations of
data-parallel nodes are automatically fused into a fast kernel
31
End
Thanks!
Brief introduction to
Glow
Glow IR
Glow Quantization
Glow CPU Backend
32
Reference
• Glow: A community-driven approach to AI infrastructure
• Glow: Graph Lowering Compiler Techniques for Neural Networks
• https://github.com/pytorch/glow/
• https://github.com/pytorch/glow/issues/1575
33

More Related Content

What's hot

UE4のレイトレで出来ること/出来ないこと
UE4のレイトレで出来ること/出来ないことUE4のレイトレで出来ること/出来ないこと
UE4のレイトレで出来ること/出来ないことSatoshi Kodaira
 
2015年度先端GPGPUシミュレーション工学特論 第5回 GPUのメモリ階層の詳細 (様々なメモリの利用)
2015年度先端GPGPUシミュレーション工学特論 第5回 GPUのメモリ階層の詳細(様々なメモリの利用)2015年度先端GPGPUシミュレーション工学特論 第5回 GPUのメモリ階層の詳細(様々なメモリの利用)
2015年度先端GPGPUシミュレーション工学特論 第5回 GPUのメモリ階層の詳細 (様々なメモリの利用) 智啓 出川
 
Graph neural network 2부 recommendation 개요
Graph neural network  2부  recommendation 개요Graph neural network  2부  recommendation 개요
Graph neural network 2부 recommendation 개요seungwoo kim
 
Android向けUnity製ゲーム最適化のためのCI/CDと連携した自動プロファイリングシステム
Android向けUnity製ゲーム最適化のためのCI/CDと連携した自動プロファイリングシステムAndroid向けUnity製ゲーム最適化のためのCI/CDと連携した自動プロファイリングシステム
Android向けUnity製ゲーム最適化のためのCI/CDと連携した自動プロファイリングシステムKLab Inc. / Tech
 
50分でわかるブループリントについて
50分でわかるブループリントについて50分でわかるブループリントについて
50分でわかるブループリントについてMasahiko Nakamura
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
バーチャルライブ配信アプリREALITYの3Dアバターシステムの全容について
バーチャルライブ配信アプリREALITYの3Dアバターシステムの全容についてバーチャルライブ配信アプリREALITYの3Dアバターシステムの全容について
バーチャルライブ配信アプリREALITYの3Dアバターシステムの全容についてgree_tech
 
ソーシャルゲーム運用チームにJIRAを導入してみた話
ソーシャルゲーム運用チームにJIRAを導入してみた話ソーシャルゲーム運用チームにJIRAを導入してみた話
ソーシャルゲーム運用チームにJIRAを導入してみた話Kimura Ryota
 
簡単マルチプレイヤー@Ue4
簡単マルチプレイヤー@Ue4簡単マルチプレイヤー@Ue4
簡単マルチプレイヤー@Ue4shiratori01
 
What's new in Shader Graph: ready for production – Unite Copenhagen 2019
What's new in Shader Graph: ready for production – Unite Copenhagen 2019What's new in Shader Graph: ready for production – Unite Copenhagen 2019
What's new in Shader Graph: ready for production – Unite Copenhagen 2019Unity Technologies
 
F.E.A.Rにおけるゴール指向プランニング
F.E.A.Rにおけるゴール指向プランニングF.E.A.Rにおけるゴール指向プランニング
F.E.A.Rにおけるゴール指向プランニングYouichiro Miyake
 
初めての Spanner 移行
初めての Spanner 移行初めての Spanner 移行
初めての Spanner 移行Igarashi Toru
 
사례를 통해 살펴보는 프로파일링과 최적화 NDC2013
사례를 통해 살펴보는 프로파일링과 최적화 NDC2013사례를 통해 살펴보는 프로파일링과 최적화 NDC2013
사례를 통해 살펴보는 프로파일링과 최적화 NDC2013Esun Kim
 
An Introduction to Spectral Graph Theory
An Introduction to Spectral Graph TheoryAn Introduction to Spectral Graph Theory
An Introduction to Spectral Graph Theoryjoisino
 
UE4.25のレイトレーシングで出来ること/出来ないこと
UE4.25のレイトレーシングで出来ること/出来ないことUE4.25のレイトレーシングで出来ること/出来ないこと
UE4.25のレイトレーシングで出来ること/出来ないことSatoshi Kodaira
 
Unityによるリアルタイム通信とMagicOnionによるC#大統一理論の実現
Unityによるリアルタイム通信とMagicOnionによるC#大統一理論の実現Unityによるリアルタイム通信とMagicOnionによるC#大統一理論の実現
Unityによるリアルタイム通信とMagicOnionによるC#大統一理論の実現Yoshifumi Kawai
 
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...taeseon ryu
 
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
 A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs) A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)Thomas da Silva Paula
 

What's hot (20)

UE4のレイトレで出来ること/出来ないこと
UE4のレイトレで出来ること/出来ないことUE4のレイトレで出来ること/出来ないこと
UE4のレイトレで出来ること/出来ないこと
 
2015年度先端GPGPUシミュレーション工学特論 第5回 GPUのメモリ階層の詳細 (様々なメモリの利用)
2015年度先端GPGPUシミュレーション工学特論 第5回 GPUのメモリ階層の詳細(様々なメモリの利用)2015年度先端GPGPUシミュレーション工学特論 第5回 GPUのメモリ階層の詳細(様々なメモリの利用)
2015年度先端GPGPUシミュレーション工学特論 第5回 GPUのメモリ階層の詳細 (様々なメモリの利用)
 
Graph neural network 2부 recommendation 개요
Graph neural network  2부  recommendation 개요Graph neural network  2부  recommendation 개요
Graph neural network 2부 recommendation 개요
 
Fbx解説 (1 構文編) (1)
Fbx解説 (1  構文編) (1)Fbx解説 (1  構文編) (1)
Fbx解説 (1 構文編) (1)
 
Android向けUnity製ゲーム最適化のためのCI/CDと連携した自動プロファイリングシステム
Android向けUnity製ゲーム最適化のためのCI/CDと連携した自動プロファイリングシステムAndroid向けUnity製ゲーム最適化のためのCI/CDと連携した自動プロファイリングシステム
Android向けUnity製ゲーム最適化のためのCI/CDと連携した自動プロファイリングシステム
 
50分でわかるブループリントについて
50分でわかるブループリントについて50分でわかるブループリントについて
50分でわかるブループリントについて
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
バーチャルライブ配信アプリREALITYの3Dアバターシステムの全容について
バーチャルライブ配信アプリREALITYの3Dアバターシステムの全容についてバーチャルライブ配信アプリREALITYの3Dアバターシステムの全容について
バーチャルライブ配信アプリREALITYの3Dアバターシステムの全容について
 
ソーシャルゲーム運用チームにJIRAを導入してみた話
ソーシャルゲーム運用チームにJIRAを導入してみた話ソーシャルゲーム運用チームにJIRAを導入してみた話
ソーシャルゲーム運用チームにJIRAを導入してみた話
 
簡単マルチプレイヤー@Ue4
簡単マルチプレイヤー@Ue4簡単マルチプレイヤー@Ue4
簡単マルチプレイヤー@Ue4
 
What's new in Shader Graph: ready for production – Unite Copenhagen 2019
What's new in Shader Graph: ready for production – Unite Copenhagen 2019What's new in Shader Graph: ready for production – Unite Copenhagen 2019
What's new in Shader Graph: ready for production – Unite Copenhagen 2019
 
F.E.A.Rにおけるゴール指向プランニング
F.E.A.Rにおけるゴール指向プランニングF.E.A.Rにおけるゴール指向プランニング
F.E.A.Rにおけるゴール指向プランニング
 
初めての Spanner 移行
初めての Spanner 移行初めての Spanner 移行
初めての Spanner 移行
 
사례를 통해 살펴보는 프로파일링과 최적화 NDC2013
사례를 통해 살펴보는 프로파일링과 최적화 NDC2013사례를 통해 살펴보는 프로파일링과 최적화 NDC2013
사례를 통해 살펴보는 프로파일링과 최적화 NDC2013
 
An Introduction to Spectral Graph Theory
An Introduction to Spectral Graph TheoryAn Introduction to Spectral Graph Theory
An Introduction to Spectral Graph Theory
 
UE4.25のレイトレーシングで出来ること/出来ないこと
UE4.25のレイトレーシングで出来ること/出来ないことUE4.25のレイトレーシングで出来ること/出来ないこと
UE4.25のレイトレーシングで出来ること/出来ないこと
 
Unityによるリアルタイム通信とMagicOnionによるC#大統一理論の実現
Unityによるリアルタイム通信とMagicOnionによるC#大統一理論の実現Unityによるリアルタイム通信とMagicOnionによるC#大統一理論の実現
Unityによるリアルタイム通信とMagicOnionによるC#大統一理論の実現
 
Extending the Unity Editor
Extending the Unity EditorExtending the Unity Editor
Extending the Unity Editor
 
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
 
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
 A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs) A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
 

Similar to Glow introduction

Basics of micro controllers for biginners
Basics of  micro controllers for biginnersBasics of  micro controllers for biginners
Basics of micro controllers for biginnersGerwin Makanyanga
 
Ch-4 Middleware Architectures.pptx
Ch-4 Middleware Architectures.pptxCh-4 Middleware Architectures.pptx
Ch-4 Middleware Architectures.pptxdagilema
 
C program execution and algorithm
C program execution and algorithm C program execution and algorithm
C program execution and algorithm Kunal Pandhram
 
Computer Organization : CPU, Memory and I/O organization
Computer Organization : CPU, Memory and I/O organizationComputer Organization : CPU, Memory and I/O organization
Computer Organization : CPU, Memory and I/O organizationAmrutaMehata
 
Modularisation techniques new
Modularisation techniques newModularisation techniques new
Modularisation techniques newJeet Thombare
 
Algorithms-Flowcharts-Data-Types-and-Pseudocodes.pptx
Algorithms-Flowcharts-Data-Types-and-Pseudocodes.pptxAlgorithms-Flowcharts-Data-Types-and-Pseudocodes.pptx
Algorithms-Flowcharts-Data-Types-and-Pseudocodes.pptxRobertCarreonBula
 
Computer programming and utilization
Computer programming and utilizationComputer programming and utilization
Computer programming and utilizationDigvijaysinh Gohil
 
Client side machine learning
Client side machine learningClient side machine learning
Client side machine learningKumar Abhinav
 
Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014 Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014 Uri Cohen
 
Session 9 advance_verification_features
Session 9 advance_verification_featuresSession 9 advance_verification_features
Session 9 advance_verification_featuresNirav Desai
 
Processor Organization and Architecture
Processor Organization and ArchitectureProcessor Organization and Architecture
Processor Organization and ArchitectureDhaval Bagal
 
VTU 6th Sem Elective CSE - Module 3 cloud computing
VTU 6th Sem Elective CSE - Module 3 cloud computingVTU 6th Sem Elective CSE - Module 3 cloud computing
VTU 6th Sem Elective CSE - Module 3 cloud computingSachin Gowda
 

Similar to Glow introduction (20)

Embedded _c_
Embedded  _c_Embedded  _c_
Embedded _c_
 
C- language Lecture 4
C- language Lecture 4C- language Lecture 4
C- language Lecture 4
 
CGV.pptx
CGV.pptxCGV.pptx
CGV.pptx
 
Coding
CodingCoding
Coding
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Basics of micro controllers for biginners
Basics of  micro controllers for biginnersBasics of  micro controllers for biginners
Basics of micro controllers for biginners
 
What is Serverless Computing?
What is Serverless Computing?What is Serverless Computing?
What is Serverless Computing?
 
Ch-4 Middleware Architectures.pptx
Ch-4 Middleware Architectures.pptxCh-4 Middleware Architectures.pptx
Ch-4 Middleware Architectures.pptx
 
loaders and linkers
 loaders and linkers loaders and linkers
loaders and linkers
 
C program execution and algorithm
C program execution and algorithm C program execution and algorithm
C program execution and algorithm
 
Computer Organization : CPU, Memory and I/O organization
Computer Organization : CPU, Memory and I/O organizationComputer Organization : CPU, Memory and I/O organization
Computer Organization : CPU, Memory and I/O organization
 
Modularisation techniques new
Modularisation techniques newModularisation techniques new
Modularisation techniques new
 
Callgraph analysis
Callgraph analysisCallgraph analysis
Callgraph analysis
 
Algorithms-Flowcharts-Data-Types-and-Pseudocodes.pptx
Algorithms-Flowcharts-Data-Types-and-Pseudocodes.pptxAlgorithms-Flowcharts-Data-Types-and-Pseudocodes.pptx
Algorithms-Flowcharts-Data-Types-and-Pseudocodes.pptx
 
Computer programming and utilization
Computer programming and utilizationComputer programming and utilization
Computer programming and utilization
 
Client side machine learning
Client side machine learningClient side machine learning
Client side machine learning
 
Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014 Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014
 
Session 9 advance_verification_features
Session 9 advance_verification_featuresSession 9 advance_verification_features
Session 9 advance_verification_features
 
Processor Organization and Architecture
Processor Organization and ArchitectureProcessor Organization and Architecture
Processor Organization and Architecture
 
VTU 6th Sem Elective CSE - Module 3 cloud computing
VTU 6th Sem Elective CSE - Module 3 cloud computingVTU 6th Sem Elective CSE - Module 3 cloud computing
VTU 6th Sem Elective CSE - Module 3 cloud computing
 

More from Yi-Hsiu Hsu

Yocto Project introduction
Yocto Project introductionYocto Project introduction
Yocto Project introductionYi-Hsiu Hsu
 
Understand more about C
Understand more about CUnderstand more about C
Understand more about CYi-Hsiu Hsu
 
Introduction to memory order consume
Introduction to memory order consumeIntroduction to memory order consume
Introduction to memory order consumeYi-Hsiu Hsu
 
RISC-V Introduction
RISC-V IntroductionRISC-V Introduction
RISC-V IntroductionYi-Hsiu Hsu
 
GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64Yi-Hsiu Hsu
 
Introduction to armv8 aarch64
Introduction to armv8 aarch64Introduction to armv8 aarch64
Introduction to armv8 aarch64Yi-Hsiu Hsu
 

More from Yi-Hsiu Hsu (8)

TensorRT survey
TensorRT surveyTensorRT survey
TensorRT survey
 
Yocto Project introduction
Yocto Project introductionYocto Project introduction
Yocto Project introduction
 
Understand more about C
Understand more about CUnderstand more about C
Understand more about C
 
Introduction to memory order consume
Introduction to memory order consumeIntroduction to memory order consume
Introduction to memory order consume
 
RISC-V Introduction
RISC-V IntroductionRISC-V Introduction
RISC-V Introduction
 
Memory model
Memory modelMemory model
Memory model
 
GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64
 
Introduction to armv8 aarch64
Introduction to armv8 aarch64Introduction to armv8 aarch64
Introduction to armv8 aarch64
 

Recently uploaded

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Recently uploaded (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Glow introduction

  • 2. Outline • Brief introduction to Glow • Glow IR • Glow Quantization • Glow CPU Backend 2
  • 3. Brief introduction to Glow Brief introduction to Glow Glow IR Glow Quantization Glow CPU Backend 3
  • 4. A collaborative effort • Over the past seven years, FB has learned a great deal about how to best collaborate with the hardware community • Our work to help found and drive the Open Compute Project has been instrumental in allowing us to build highly scalable, efficient networking and storage technologies for our data centers • We’ve applied this thinking to how we work with telecom operators and the connectivity ecosystem overall with the Telecom Infra Project, as we work to get more people around the world better connected to the internet • As we look ahead, we now want to take these learnings and apply them to how we work with our silicon partners on AI and ML • We created Glow, an open source framework, to be community driven. This approach allows partners to more rapidly design and optimize new silicon products for AI and ML by leveraging community-driven compiler software • Cadence, Esperanto, Intel, Marvell, and Qualcomm Technologies Inc, a subsidiary of Qualcomm Incorporated, have committed to supporting Glow in future silicon products 4
  • 5. How Glow works • Glow is designed to target a wide range of hardware accelerators • The hardware-independent parts of the compiler focus on math-related optimizations that are not tied to a specific hardware model • It also contains a number of utilities and building blocks that can be configured to support multiple hardware targets, including • a powerful linear algebra optimizer • an extensive test suite • a CPU-based reference implementation for testing the accuracy of hardware accelerators • the memory allocator • an instruction scheduler • etc… 5
  • 7. Glow Intermediate Representation Brief introduction to Glow Glow IR Glow Quantization Glow CPU Backend 7
  • 8. High-Level IR • The high-level IR is a dataflow node-based graph representation • similar to a graph that you may find inside Caffe or in ONNX format • When we load a neural network model from some file we construct this graph with a direct translation of one operator to one or more nodes • The graph is strongly typed, which means that inputs and output have a known tensor type • Consisting of the tensor's shape and element type, and that the types of nodes are verified by the compiler 8
  • 9. High-Level IR • The Glow graph is structured as a module that contains multiple functions that contain multiple nodes • Nodes inside functions are able to reference Placeholders and Constants which are owned by the module • Placeholders and Constants, which are similar to global variables in C programs, are nodes that are shared between the functions • A module may have multiple functions • For example, one module could contain both an inference function and the gradient of that inference function • The gradient function could perform training of the placeholder weights, and the inference function could read from those same weights 9
  • 10. High-Level IR • Variable Visibility • Glow variables are similar to PyTorch and TensorFlow variables • They are persistent tensors that live across different executions of the neural network • Variables are annotated with Public or Private labels. These labels specify whether the node is visible outside of the graph • If the node is public, then it means that C++ code from outside the graph may access the variable directly and change its content before or after the execution of the program • This means that the optimizer is not allowed to delete unused public variables or change their dimensions • In the case of private variables, the optimizer is allowed to delete unused variables, transpose, perform constant propagation, etc. 10
  • 11. High-Level IR • Constants • special nodes that represent tensors that are a part of the graph • These nodes can be used to represent things like the weights of neural networks • Constants are immutable during the execution of the program, but graph optimizations can access the constants and modify them • This feature is useful for transformations that prepare the weights by transposing them or quantizing them before the execution of the program • Placeholders • symbolic nodes that are not backed by a concrete tensor during the compilation of the program • Inputs and outputs of Glow programs should be modeled using Placeholder nodes • Concrete tensors are attached to placeholder nodes during the execution of the program • Unlike constants, the optimizer can't inspect or mutate the content of Placeholder nodes • The same program could be compiled using different bound tensors without changing the semantics of the program 11
  • 12. High-Level IR • Glow functions contain nodes that represent the different operations of a neural network • The function owns the nodes and has access to the placeholders and constants in the module • The image in the right-hand side depicts the compute graph that represents the expression “saveD = A / B” • Glow lowers the nodes that compute the gradient of the expression and the stochastic gradient descent (SGD) node into a sequence of low-level operators (Div, Mul, Add and Save) • The different compiler backends do not need to implement support for the DivGrad, ReLUGrad or SGD nodes 12
  • 13. Node Lowering • Instead of compiling high-level operators directly, Glow performs “node lowering” • In this phase, the compiler breaks the high-level operator nodes into low-level linear algebra operator nodes • For example, the FullyConnected layer is represented as a matrix multiplication followed by broadcasted add • Different compiler backends do not have to implement the FullyConnected layer and a dozen other high-level opcodes, just the low-level matrix multiplication 13
  • 14. Node Lowering • In Glow, lowering is performed as part of the high-level graph as described above, prior to moving to low-level IR • This is due to a number of reasons • First, the new lowered graph may allow for additional graph-level optimizations • Second, the new graph structure may affect the decisions of the instruction scheduler • And third, after lowering we allow the backends to perform additional target- specific optimizations on the lowered graph 14
  • 15. Low-Level IR • After optimizing the graph with target-independent optimizations, and lowering from high-level operator nodes to linear algebra operator nodes, the code is further lowered into the low-level IR in a phase that is called "IRGen" (which stands for IR generation) • This is a one-to-many translation where each high-level node is translated into one or more instructions • During IRGen, constants and placeholders are converted into WeightVars • These WeightVars are annotated with Mutable or Constant labels, depending on the source and whether the weights are modified during the execution of the program 15
  • 16. Low-Level IR • The low-level IR enables a different kind of target independent optimizations that are not possible with the high-level graph format • This is an instruction-based representation that operates on tensors that are referenced by address • This gives the compiler the ability to perform low-level memory optimizations that are not possible at the high-level, because memory is not represented directly • Hiding the latency of memory operations is important for utilizing the execution units of the hardware effectively, and the instruction-based representation allows the compiler to create a schedule that hides the latency of the memory operations 16
  • 17. Low-Level IR • The IR is not Static Single Assignment (SSA) based representation, because the IR does not support control flow • The IR is strongly typed and each instruction operand kind has known parameter types • It is designed to be used as an in-memory form, though can be dumped to human readable assembly-like format 17
  • 18. Low-Level IR • A function in IR form contains two sections: 'declare' and 'program’ • In the first section of the IR we declare a number of memory regions that live throughout the lifetime of the program • This is similar to global variables in C • The second part of the IR is a list of instructions • There are two kinds of memory regions which correspond to these two sections: • global memory regions (found in 'declare’) • and locally allocated regions (found in 'program’) • The locally allocated memory regions are similar to 'alloca' in LLVM IR • Memory regions are strongly typed, which means that the kind of type of tensor that the region represents is known 18 • Note that the 'alloc' instruction does not allocate memory; it just marks the lifetime of the activation
  • 19. Low-Level IR • Instructions operate on either global variables or locally allocated buffers • Each operand is annotated with one of the qualifiers '@in'/'@out'/'@inout’ • '@in' means that the buffer is read from • '@out' means that the buffer is written into • And '@inout' means that the instruction may read and write into the buffer • These operand qualifiers help the optimizer decide when it is legal to perform certain optimizations, such as copy elimination or buffer sharing 19
  • 20. How Glow works 20 The lowering phase is designed to reduce the input space and allow new hardware backends to focus on a small number of linear algebra primitives The high-level IR allows the optimizer to perform domain-specific optimizations The lower-level instruction-based address-only IR allows the compiler to perform memory-related optimizations, such as instruction scheduling, static memory allocation and copy elimination At the lowest level, the optimizer performs machine-specific code generation to take advantage of specialized hardware features Glow lowers a traditional neural network dataflow graph into a two-phase strongly-typed intermediate representation (IR). The graph is either loaded via the graph loader (from ONNX or Caffe2 format), or constructed via the C++ interface Additional rounds of optimizations occur, both target independent and target specific IRGen 1 2 3 4 5 6 7
  • 21. Glow Quantization Brief introduction to Glow Glow IR Glow Quantization Glow CPU Backend 21
  • 22. Glow Quantization • Glow is able to convert floating-point-based networks into signed 8- bit integer networks • The canonical quantization representation is using signed integers, though it is possible to support other quantization formats • Arithmetic using small integers is more efficient than the computation of full- width floating-point numbers, and additionally decreases memory usage • Glow uses profile-guided quantization, observing execution during inference to estimate the possible numeric range for each stage of the neural network • Training-based quantization is considered future work 22
  • 23. Tensor Representation • In Glow, tensors are typed and can represent floats, quantized non- floating-point values such as currently supported Int8 (8-bit signed integers), and index types • To convert from the 8-bit integer range of [-128..127] to the floating-point number that they represent, Glow uses the following conversion formula: • Float value = (Int8 input - offset) * scale • Activations, weights, and variables all use the same type-system and represent information in a uniform way 23
  • 24. Network Conversion • Glow’s quantization conversion works using a two-phase process • First, we statically instrument the network with special profiling nodes that record the ranges of activations that flow in the network, optimize the network including these profiling nodes, and then run inference • Then, we recompile the network using this profile information to convert the network into a quantized form, allowing for static optimization of the quantized graph • We convert portions of the network into islands of integer computation and aim to generate outputs in the range that the original floating-point network produces 24A quantized subgraph from Resnet50 Scale = 0.0364 Offset = -66 Max = 7.031 Min = -2.259 7.031 = (input –(-66)) * 0.0364 input = 127.159 input = 127 (int8) -2.259 = (input –(-66)) * 0.0364 input = -128.060 input = -128 (int8) Float value = (Int8 input - offset) * scale
  • 25. Compiler Optimizations • There are a few classes of optimizations and parameters to optimize • First, we attempt to minimize the number of conversions between floating-point tensors and integer tensors, in both directions • Some operations, such as 'transpose' and 'concat' operate on both types, and changing the representation can minimize conversions • Second, the neural network contains 'rescale' nodes that change the range of the integers • These nodes are required to convert between numeric ranges that mimic the original floating-point network • However, in many cases, it is possible to fold the rescale operations into numeric-producing operations, and eliminate them • Third, it's possible to rescale the values in the network in order to allow fast hardware implementations of the quantized operations • By normalizing both sides of the 'max' operation to the same scale will allow hardware to perform a simple comparison with efficient • For more specific graph optimizations check here 25
  • 26. Glow CPU Backend Brief introduction to Glow Glow IR Glow Quantization Glow CPU Backend 26
  • 27. Introduction • The CPU Backend is a JIT ("Just In Time") compiler that generates code in memory on demand for the host CPU • The host cpu can be X86, ARM or anything that LLVM can target • The Glow interpreter goes over the low-level IR one instruction at a time and executes a switch statement that dispatches a C++ implementation for each instruction. This is suboptimal • First, after each low-level instruction is executed, by calling a function call, we return to the dispatch switch-loop • Second, the C++ implementation of the low-level instruction had no knowledge of the specific situation in which the instruction is being executed 27
  • 28. Introduction • The JIT, on the other hand, generates a single stream of highly optimized instructions that don't go back to the interpreter • Each instruction is optimized based on specific information on the context in which the instruction is executed • When a matrix multiplication is compiled the JIT knows exactly the dimensions of the matrices that are being executed and where the tensors are placed in memory • The JIT knows that the buffers do or do-not alias, and exactly the number of iterations for the loop • The knowledge enables much better code generation and vectorization • The JIT is also able to eliminate all calls to 'malloc', because the memory is statically allocated • The whole network is allocated by a single malloc call 28
  • 29. How the JIT Works • The JIT accepts the low-level IR, and allocates concrete memory addresses for the AllocActivation instructions in the module • After this process the allocator knows the maximum number of bytes that the network consumes • The allocator assigns offsets for each alloc activation within the buffer • Then, the JIT performs a single call to 'malloc' to allocates the heap • At this point each activation and each weight has a concrete address on the heap • Next, the JIT opens new LLVM functions and prepares for code generation • The compiler goes over each low-level instruction and generates a sequence of LLVM-IR • After the LLVM module is generated, the compiler calls the LLVM optimizer to optimize the generated module and the code generator to generate efficient machine code • At this point the compilation phase is complete, and the network is ready for execution 29
  • 30. Usage of the Standard Library • During the compilation process, each Glow low-level instruction is converted into a sequence of LLVM-IR instructions • One way to implement this lowering is to use the IRBuilder to generate low-level programs • This is insane. Implementing and maintaining the low-level implementations of so many operations using the LLVM-IR is not scalable • Instead, the CPU backend compiles a small standard library into LLVM bitcode that it ships with the compiler • During the compilation process, Glow loads the bitcode from disk and specializes the operator implementations for the specific context • Glow replaces function arguments that represent the dimensions of some tensor or buffer addresses with constants that LLVM can optimize to generate efficient code • Most operators are very simple and the LLVM vectorizer is able to generate very efficient code • The convolution and matrix multiplication operations are hand-optimized in C++ using the clang extended OpenCL vector syntax, and LLVM does a good job allocating registers and encoding the instructions, removing the need to use inline assembly 30
  • 31. Operator Stacking • One important optimization that the CPU backend implements is stacking of data-parallel operators • Consider a sequence of operators that operate one element at a time, for example a ReLU, Add, Sub • Iterating over a large buffer multiple times is inefficient because it requires the CPU to load the memory multiple times, each time invalidating the whole cache • Instead, Glow stacks operators and performs a few data-parallel operators one after the other on the same memory location • Operator stacking is similar to operator fusion • However, when fusing multiple operators (e.g. Conv and ReLU fused together), all backends that want to support this fused operator must implement a specific kernel for each permutation of operators • In contrast, Glow’s stacking automatically creates such kernels; all of the possible permutations of data-parallel nodes are automatically fused into a fast kernel 31
  • 32. End Thanks! Brief introduction to Glow Glow IR Glow Quantization Glow CPU Backend 32
  • 33. Reference • Glow: A community-driven approach to AI infrastructure • Glow: Graph Lowering Compiler Techniques for Neural Networks • https://github.com/pytorch/glow/ • https://github.com/pytorch/glow/issues/1575 33