3. Co-design
for Nice Computer System
Nice is …
ü High Speed
ü Low Power and Energy
ü Reliable and Dependable
ü FunJ
高前田 研究室
CASYS (Computer Architecture and Systems Lab.)
Software
Compiler for Machine Learning Acceleration
Compiler for Hardware Design
Hardware / Architecture
Machine Learning Chip
Annealing Processor
FPGA Accelerator
Application / Algorithm
Machine Learning / Deep Learning
Combinatorial Optimization
Image Processing
4. Hardware / Architecture
SRAM
Register
Dec.
SRAM
Dec.
Register
Processing Units
Processing Units
Outgoing Weights
Incoming Weights
PIM (Processing-in-Memory)
Neuron
ID
Neuron
ID
Input
Neuron
Output
Neuron
PE Array
W_MEM
A_MEM
DMAC
TCI
I/F
O_MEM
B_MEM
Sequencer
μ
Controller
Inst.
Mem.
Sync.
Table
Neighbor Link (n/e/w/s)
Global Network
Core
From W_MEM
From
A_MEM
From B_MEM
PE Column
O
: Linear
: Log
PE0
ACT
FF0
FF1
PE1
FF31
PE31
20
20
32
1
32
1
1
1
Processing Elements
(MAC Array)
Input Buffer
Weight
Buffer
Output Buffer
Data
Mover
(DMA)
DRAM
Controller
Predictor
Deep Learning Accelerator Chip
BRein Memory: In-Memory
Binary Neural Network Chip
QUEST: Log-Quantized
Neural Network Chip
Deep Learning Accelerator Architecture
Dead
Neuron
Predictor
Dead
Neuron
Predictor
Dead
Neuron
Predictor
Main Graph
Predictor
Dead Neuron Prediction:
Runtime Neuron Pruning Architecture
FPGA Accelerator
Multi-FPGA based
Parallel Computer
Low-Power
Cloud Computing
Edge Computing
5. Application / Algorithm Software
Original Binary Binary w/ Dither
O
ut Ch
Pseudo Color on Binary
Neural Network by
Error Diffusion
Dither NN: Accurate Binary Neural Network
by Error Diffusion
DeltaNet: Accurate Binary Neural Network
by Neighbor Comparison
Σ f
Σ f
Σ f
Σ f
Σ f
Σ f
Σ f
Σ f
f
f
Σ
Σ
Σ
Σ
f
f
0 f
f
Σ
Σ
Σ
Σ
f
f
0
DeltaNet
Standard
Binary Net
Comparison-based
activation keeps the
partial order information
of neurons on Binary NN
Veriloggen: Multi-paradigm Hardware Synthesis
Compiler on Python
Veriloggen.Core (RTL)
ハードウェアメタプログラミング
Thread
RAM
Thread
RAM
Stream
データフロー型
高位合成
Stream
Computing
Unit
Thread
Python-to-FSM
高位合成
Stream
Control
Thread Bus + DMA
(AXI4 Master/Slave)
AXI4 Interconnect DRAM
CPU
RTL
Control
Intrinsic
RTL
埋め込み
RTL
Control DMA Control
DMA Burst Transfer
NNgen: Neural Network Hardware Synthesis
Compiler for FPGAs
You can develop a model-specific
hardware from neural network
definition without hardware
description
9. Edge Computing with “Intelligence of Things”
Cloud Computing
膨大なデータに基づく推論・制御J
情報送信から判断までの遅延大L
Edge computing
低遅延・リアルタイムな制御J
電力・計算能力の制約L
より高度な処理を
反射神経のように行うための
省エネルギー・高性能な
深層学習エッジデバイスが必要
9
10. Neural Network
パーセプトロン (Perceptron)
l 入力値に係数を乗算・総和、活性化関数を経て出力
ディープニューラルネットワーク (Deep Neural Network)
l 多層に積層したもの: 「畳み込み層」や「全結合層」などで構成
10
y = f (u)
u = wi xi
i=0
n
∑
x1
x2
x3
y
w1
w2
w3
f(u)
Perceptron
畳み込み (Convolution) 全結合 (Fully-connected)
26. NNgen-DNNアーキテクチャ
26
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4x2x2
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
Thread
Arg
Stream
Thread
Arg
Stream
Thread
Arg
Stream
Main Thread
Substream
Interconnect
BRAM
Width:
8x4-bit
BRAM
Width:
8x4-bit
BRAM
Memory
Interconnect
DMA
Interconnect
DMA
Controller
AXI4
Master
I/F
AXI4
Slave
I/F
Config Register
AXI4
Interconnect
NNgen Accelerator IP-core (IP-XACT)
DRAM
出現OPに対応するストリーム演算器
27. NNgen-DNNアーキテクチャ
27
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4x2x2
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
Thread
Arg
Stream
Thread
Arg
Stream
Thread
Arg
Stream
Main Thread
Substream
Interconnect
BRAM
Width:
8x4-bit
BRAM
Width:
8x4-bit
BRAM
Memory
Interconnect
DMA
Interconnect
DMA
Controller
AXI4
Master
I/F
AXI4
Slave
I/F
Config Register
AXI4
Interconnect
NNgen Accelerator IP-core (IP-XACT)
DRAM
各OPで再利用される細粒度演算器群
28. NNgen-DNNアーキテクチャ
28
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4x2x2
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
Thread
Arg
Stream
Thread
Arg
Stream
Thread
Arg
Stream
Main Thread
Substream
Interconnect
BRAM
Width:
8x4-bit
BRAM
Width:
8x4-bit
BRAM
Memory
Interconnect
DMA
Interconnect
DMA
Controller
AXI4
Master
I/F
AXI4
Slave
I/F
Config Register
AXI4
Interconnect
NNgen Accelerator IP-core (IP-XACT)
DRAM
演算器間を接続するカスタムNoC
29. NNgen-DNNアーキテクチャ
29
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4x2x2
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
Thread
Arg
Stream
Thread
Arg
Stream
Thread
Arg
Stream
Main Thread
Substream
Interconnect
BRAM
Width:
8x4-bit
BRAM
Width:
8x4-bit
BRAM
Memory
Interconnect
DMA
Interconnect
DMA
Controller
AXI4
Master
I/F
AXI4
Slave
I/F
Config Register
AXI4
Interconnect
NNgen Accelerator IP-core (IP-XACT)
DRAM
オンチップRAM
30. NNgen-DNNアーキテクチャ
30
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4x2x2
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
Thread
Arg
Stream
Thread
Arg
Stream
Thread
Arg
Stream
Main Thread
Substream
Interconnect
BRAM
Width:
8x4-bit
BRAM
Width:
8x4-bit
BRAM
Memory
Interconnect
DMA
Interconnect
DMA
Controller
AXI4
Master
I/F
AXI4
Slave
I/F
Config Register
AXI4
Interconnect
NNgen Accelerator IP-core (IP-XACT)
DRAM
演算器・RAM間カスタムNoC
31. NNgen-DNNアーキテクチャ
31
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4x2x2
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
Thread
Arg
Stream
Thread
Arg
Stream
Thread
Arg
Stream
Main Thread
Substream
Interconnect
BRAM
Width:
8x4-bit
BRAM
Width:
8x4-bit
BRAM
Memory
Interconnect
DMA
Interconnect
DMA
Controller
AXI4
Master
I/F
AXI4
Slave
I/F
Config Register
AXI4
Interconnect
NNgen Accelerator IP-core (IP-XACT)
DRAM
データ転送機構(AXI4-Master + DMA)
32. NNgen-DNNアーキテクチャ
32
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4x2x2
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
Thread
Arg
Stream
Thread
Arg
Stream
Thread
Arg
Stream
Main Thread
Substream
Interconnect
BRAM
Width:
8x4-bit
BRAM
Width:
8x4-bit
BRAM
Memory
Interconnect
DMA
Interconnect
DMA
Controller
AXI4
Master
I/F
AXI4
Slave
I/F
Config Register
AXI4
Interconnect
NNgen Accelerator IP-core (IP-XACT)
DRAM
トップレベル制御FSM
ストリーム演算制御FSM
トップFSMが動的に設定
する動作パラメータ
33. NNgen-DNNアーキテクチャ
33
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4x2x2
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
Thread
Arg
Stream
Thread
Arg
Stream
Thread
Arg
Stream
Main Thread
Substream
Interconnect
BRAM
Width:
8x4-bit
BRAM
Width:
8x4-bit
BRAM
Memory
Interconnect
DMA
Interconnect
DMA
Controller
AXI4
Master
I/F
AXI4
Slave
I/F
Config Register
AXI4
Interconnect
NNgen Accelerator IP-core (IP-XACT)
DRAM
外部制御インターフェース
46. 回路共有: 演算器、RAM
演算器(乗算器等)とRAMは異なるオペレータ回路で共有
l 演算器 (Substream) とRAMの要件を宣言する
46
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4x2x2
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
Thread
Arg
Stream
Thread
Arg
Stream
Thread
Arg
Stream
Main Thread
Substream
Interconnect
BRAM
Width:
8x4-bit
BRAM
Width:
8x4-bit
BRAM
Memory
Interconnect
DMA
Interconnect
DMA
Controller
AXI4
Master
I/F
AXI4
Slave
I/F
Config Register
AXI4
Interconnect
NNgen Accelerator IP-core (IP-XACT)
DRAM
共有RAMプール 共有演算器プール
pool.py
50. NNgenで「できること」
基本的なレイヤー/モデルのハードウェア化
l レイヤー: conv2d, matmul, max_pool, add, concat, slice, batchnorm, ...
l モデル: VGG, ResNet, ...
ハードウェア処理中へのソフトウェア処理の挿入
l Externレイヤー: 割り込みを発生させ途中処理をSWに委任
パラメータ指定による並列化・最適化
l データ型、並列度、メモリサイズ
学習済みモデルの任意ビット幅への量子化 (Post-training Quantization)
l 入力データの統計量に基づき、スケーリング、ビットシフト量を自動決定
ONNXを介した学習済みモデルのハードウェア化
l torchvision等の学習済みモデルがNo RTLでハードウェア化可能
57
51. NNgenで「できるようにしたいこと」
ハードウェア指向レイヤーのサポート
l Depth-wise Convolution, Grouped Convolution, ...
スパースなレイヤーのサポート
l PruningされたConv, FCの効率的な実行機構: CSC/CSR形式への対応、演算スキップ
予測に基づく計算スキップ機構のサポート
l Dead Neural Prediction
ベイジアンニューラルネットワークのサポート
l 信頼できるAIシステムの実現を支援
回路資源の効率化と大規模な並列化
l 異なる性質のオペレータ同士の回路共有: カーネルサイズが異なるConvの共有化を実装中
l 並列度に対して回路資源の増加が少ないスケーラブルなアーキテクチャ
58
大小関わらず、みなさまからのPull-requestをお待ちしています!
当研究室との産学連携・共同研究の提案もお待ちしています!