5. Tape-out costs for ASICs is exorbitant 10x cost gap between 16nm
and 65nm
Flexibility vsRisky bet to design hardware accelerators for ever-
changing applications
. Efficiency Tradeoffs
• Does deep learning constitute a stable workload to justify ASIC-based
hardware accelerator?
Specialization challenge
NPU 시작의 배경
| 5 |
6. Highlights:
• Custom ASIC deployed in datacenters since 2015
• 65k 8-bit matrix multiply that offers peak throughput of 92 TOPS
• Targets mainstream NN applications (MLPs, CNNs, and LSTMs)
• Shows 30-80x improved TOPS/Watt over K80
What make TPUs Efficient?
• Integer inference (saves 6-30x energy over 16bit FP)
• Large amount of MACs (25x over K80)
• Large amount of on-chip memory (3.5x over K80)
TPU: Google’s Entry in the Deep Learning Acceleration Race
NPU 시작의 배경
| 6 |
[1] Jouppi et al., In-Datacenter Performance Analysis of a Tensor Processing Unit, ISCA 2017
8. Problem: Reading a large SRAM uses much more power than
arithmetic
Solution: Using “Systolic Execution” to save energy by reducing
reads and writes of the Unified Buffer
A systolic array is two dimensional collection of arithmetic units
that each independently compute a partial result as a function of
inputs from other arithmetic units that are considered upstream to
each unit
It is similar to blood being pumped through the human circulatory
system by heart, which is the origin of the systolic name
Systolic Execution in Matrix Array
NPU 시작의 배경
| 8 |
[1] Why systolic architectures?, IEEE computer ,1982.
11. In the TPU, the systolic array is rotated
• Weights are loaded from the top and the input data flows into the array in from the left
• Weights are preloaded and take effect with the advancing wave alongside the first data of a
new block
Pros & Cons
• Principled: Efficiently makes use of limited memory bandwidth, balances
computation to bandwidth availability.
• Specialized (computation needs to fit PE organization/functions)
- Improved efficiency, simple design, high concurrency/performance
- Good to do more with less memory bandwidth requirement
• Specialized: Not generally applicable because computation needs to fit the PE
functions/organization
Systolic array architecture
NPU 시작의 배경
| 11 |
15. 4 modules: fetch, load, compute, store
3 stage architecture: load, compute, store
Two-Level ISA: provide the right tradeoff between expressiveness
and code compactness
• LOAD, GEMM, ALU, STORE instructions (CISC-like instructions)
- Multi-cycle compute and memory operations
• Use RISC micro-ops to perform single-cycle tensor operations
Parameterizablility
Exposing Task-level pipeline Parallelism: TLPP is based on the
paradigm of access-execute decoupling [1].
VTA Hardware Architecture 특징
VTA 구조 및 성능
| 15 |
[1] Decoupled access/execute computer architectures, ISCA’82
23. How the GEMM core performs computation over data stored in
the input, weight, and accumulator memories.
No control flow: need to be unrolled
Two types of compute micro-ops: ALU and GEMM operations.
VTA GEMM core
VTA 구조 및 성능
| 23 |
24. Vta/hardware/Xilinx/sources
• vta.cc: VTA module의 정의와 모델 동작들을 정의한다.
• vta.h: ap_int의 타입과 함수들의 프로토타입의 정의
• `vta/include/vta/hw_spec.h`
• 이 파라메터를 가지고 `vta/config/vta_config.json`이 생성된다. 그리고 이건
`vta/config/vta_config.py`에 의해서 생성된다.
Vivado High-level synthesis, Xilinx
VTA 구조 및 성능
| 24 |
25. The shape of the tensor intrinsic
Clock frequency
Pipelining
Data type width
On-chip buffer sizes
LOG_INP_WIDTH,
LOG_WGT_WIDTH,
LOG_OUT_WIDTH는 같도록 설계
H/W parameters
VTA 구조 및 성능
| 25 |
26. Pipelining Tasks to Hide Memory Latency
VTA 구조 및 성능
| 26 |
[1] Decoupled access/execute computer architectures, ISCA’82
28. NNVM
• graph-level IR 이걸로 합치면서 효율성을 높인다.
VTA Runtime
• JIT compilation of VTA binaries (instruction streams and micro-kernel
code)
• manages shared memory
• performs synchronization to hand off execution to VTA
VTAs two-level ISAs
• high-level CISC ISA
- latency operation들을 정의함
- DMA loads, deep learning operators
• low-level and fixed latency RISC ISA
- low-level matrix-matrix operations.
VTA micro-architecture
• 딥러닝 하드웨어의 상세 디자인을 유연하게 하기 위함.
TVM stack에서의 각각의 컴포넌트
VTA 구조 및 성능
| 28 |
29. VTA’s JIT runtime enables cooperative execution of deep learning
workloads between a CPU host and the accelerator.
• 1) enable heterogeneous execution: one challenge present in fixed
function accelerators is model evolution, because most of these
accelerators are built for fixed models. Heterogeneous execution
overcomes this limitation by properly scheduling operators into
targets(e.g., CPUs or VTs), depending on their affinity for different
types of operators.
- Ex: it is well known that the first convolutional layer in most CNNs
contains operators with low arithmetic intensity that perform well on
CPUs.
- Another motivation behind heterogeneous execution is providing a
fallback mechanism for supporting emerging operators that are not yet
supported by VTA.
• 2) lower compiler design complexity
• 3) overcome physical limitations
• 4) reduce binary bloat
• 5) future proofing: Advances in system show trends towards
heterogeneous multi-accelerator system and scale-out acceleration.
JIT Runtime System
VTA 구조 및 성능
| 29 |
30. Full evaluation on PYNQ FPGA board (Z1)
Full Stack Evaluation (TVM)
VTA 구조 및 성능
| 30 |
TVM can offload most convolution operations to
the FPGA (40x speedup on off-loadable layers)
31. For comparable systems, VTA provides a significant performance
edge over conventional CPU and GPU-based inference
Evaluation over multiple CPU, GPU, and FPGA-quipped edge systems
VTA 구조 및 성능
| 31 |
40. VTA hardware intrinsic로 변환 되기 위해서는 아래의 조건을 포함
해야함.
• DMA copy operations: global scope을 local scope으로 copy하는
operation을 의미함
• Vector ALU operation들은 vector add를 실행해서 수행 해야 한다.
VTA는 아래 세 가지의 On-Chip SRAMs를 보유하고 있음
• env.inp_scope (read-only)
- 입력 행렬을 저장함
- 모양은 env.BATCH, env.BLOCK_IN (env.inp_dtype)
• env.wgt_scope (read-only)
- Weight matric을 저장
- 모양은 env.BLOCK_OUT, env.BLOCK_IN (type = env.wgt_dtype)
• env.acc_scope (read/write SRAM buffer): general purpose register file
- Accumulator 행렬
- 모양은 env.BATCH, env.BLOCK_OUT (type = env.acc_dtype)
Default Schedule에서 부족한 점
코드 분석 및 튜토리얼 소개
| 40 |
42. Hardware accelerator에서 통상적으로 사용하는 방식
• DRAM에서 VTA on-chip buffer로 데이터 이동
- Pragmas 함수가 compiler에 DMA를 이용해서 copy operation을 bulk로
실행하라는 의미를 전달
ALU Operations
• VTA는 Accumulator buffer를 이용해서 tensor들을 연산하는 ALU가 내장
되어 있다.
• Vector addition loop를 VTA의 ALU를 이용하라고 명시적으로 지칭
해주어야한다.
DMA Transfers and ALU Operations
코드 분석 및 튜토리얼 소개
| 42 |
43. VTA에 맞게 변경된 Lowered TVM Schedule 코드
코드 분석 및 튜토리얼 소개
| 43 |
z
z
z ALU를 이용해서 연산 수행
VTA buffe로 접근하는
CPU Access 표시
Z
연산 처리 결과 내용 저장
45. TVM 함수로 컴파일 함
Tvm.build를 이용해서 function을 생성
• Schedule, desired signature of the function (inputs and outputs),
target language
모듈로 저장
• 모듈을 파일로 저장
• 추후에 로드함
• Ahead of time compilation의 기능
• Cross-compile the executable을 다른 환경으로 전달 할 수 있음 (RPC를
이용한 방법)
로드 하기
TVM Compilation
코드 분석 및 튜토리얼 소개
| 45 |
46. C api로 작성된 compiled tvm은 결국 어떤 언어를 이용해서도 invoke할
수 있음
DLPack에 기반한 array 접근 API를 제공함 (quick testing과 prototping)
• Remote context 생성 (pynq)
• Tvm.nd.array
• F()가 actual computation을 실행
• Asnumpy() 해석 가능 할 수 있게 결과를 복사해오고 포멧팅
Running the function
코드 분석 및 튜토리얼 소개
| 46 |
51. TVM, VTA 분석
• 각 IR PASS들이 어떻게 코드 변환을 수행하는 분석 중 (디버깅 환경에서)
- TVM IR PASS
- VTA IR PASS
• JIT compilation and runtime 분석
딥러닝 컴파일러 스택 개발
• Lowering 부분을 개발하여 통합 진행 (with 유미선 책임, 김영주 박사)
• A[1024] + B[1024] = C[1024] #vector addition
ResNet-18 network의 operation 분석
• VTA에서 실행 가능하도록 변환 되는 부분들 분석
안건
향후 계획
| 51 |