DSF2018講演スライド

低ビット化と枝刈りを同時に⾏う
３状態ディープラーニングとその設計法
中原啓貴
東京⼯業⼤学⼯学院情報通信系
DSF2018
@パシフィコ横浜

内容
• 背景︓組込み向けディープラーニングについて
• Convolutional Neural Network (CNN)の最適化法
• 低ビット化→混合精度
• 枝刈り
• 3状態CNN
• 低ビット化＋混合精度＋枝刈り
• FPGA専⽤ディープラーニング開発環境
GUINNESS DREI
• デモ︓⾃動運転向け物体検出
• まとめ
2

組込み向けディープラーニング
• ロボティクス, ⾃動運転, 監視カメラ, ドローン,など
3

Object Detection (物体認識)
4
Son
Baby
Daughter
J. Redmon and A. Farhadi, "YOLOv3: An Incremental Improvement," arXiv, 2018

Semantic Segmentation (領域分割)
5E. Shelhamer, J. Long and T. Darrell, "Fully Convolutional Networks for Semantic Segmentation," IEEE Trans. on
Pattern Analysis and Machine Intelligence, Vol.39, No.4, 2017, pp. 640 ‐ 651.

OpenPose (姿勢推定)
6
Z. Cao, T. Simon, S.‐E. Wei and Y. Sheikh, " Realtime Multi‐Person 2D Pose Estimation
using Part Affinity Fields," CVPR, 2017.

DepthMap (深さ推定)
7
D. Eigen, C. Puhrsch and R. Fergus, "Depth Map Prediction from a Single Image using a
Multi‐Scale Deep Network," arXiv:1406.2283 , 2014.

組込みシステムでの要求事項
8
Cloud Embedded
Many classes (1000s) Few classes (<10)
Large workloads Frame rates (15‐30 FPS)
High efficiency
(Performance/W)
Low cost & low power
(1W‐5W)
Server form factor Custom form factor
J. Freeman (Intel), “FPGA Acceleration in the era of high level design”, HEART2017

ディープラーニング推論デバイス
9
Flexibility
Power Performance
Efficiency
CPU
(Raspberry Pi3)
GPU
(Jetson TX2)
FPGA
(UltraZed)
ASIC
(Movidius)
• 柔軟性: R&D コスト, 特に新規アルゴリズムへの対応
• 電⼒性能効率
• FPGA→柔軟性と電⼒性能効率のバランスに優れる

内容
• 枝刈り
• 3状態CNN
GUINNESS DREI
• まとめ
10

物体検出タスク
• 複数の物体に対してクラス分類+位置検出を同時に⾏う
• 評価⽅法 (from Pascal VOC):
11
Ground truth
annotation
Detection results:
>50% overlap of
bounding box(BBox)
with ground truth
One BBox for each
object
Confidence value
for each object
Person (50%)
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
# 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑑𝑒𝑡𝑒𝑐𝑡.
#𝑎𝑙𝑙 𝑑𝑒𝑡𝑒𝑐𝑡.
𝑟𝑒𝑐𝑎𝑙𝑙
# 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑑𝑒𝑡𝑒𝑐𝑡.
#𝑎𝑙𝑙 𝑜𝑏𝑗𝑒𝑐𝑡𝑠
𝐴𝑃
1
11
𝑃 , ∈ ,. ,…,
Average Precision (AP):

YOLOv2
(You Only Look Once version 2)
12
Input
Image
(Frame)
Feature maps
CONV+Pooling
CNN
CONV+Pooling
Class score
Bounding Box
Detection
J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger," arXiv preprint arXiv:1612.08242, 2016.
• シングルショット⽅式の物体検出アルゴリズム
• 各グリッドに対してクラス確率と枠を推定

Convolutional Neural Network (CNN)
• 畳込み, プーリング, フル結合などの演算と
特徴マップで構成
• 画像認識タスクで⾼い精度
• 多くのアプリケーション
13Source: https://www.mathworks.com/discovery/convolutional‐neural‐network.html

2次元畳み込み演算
14
Input feature map
Output feature map
Kernel
(Binary)
X0,0 x W0,0
X0,1 x W0,1
X0,2 x W0,2
X1,0 x W1,0
X1,1 x W1,1
X1,2 x W1,2
X2,0 x W2,0
X2,1 x W2,1
+) X2,2 x W2,2
y
• YOLOv2でほぼすべての演算時間を占める

CNNの最適化
15
Source: http://www.isfpga.org/fpga2017/slides/D1_S1_InvitedTalk.pdf

2値化CNN
16
x1
w0 (Bias)
fsgn(Y)
Y
z
w1
x2
w2
xn
wn
...
x1 x2 Y
‐1 ‐1 1
‐1 +1 ‐1
+1 ‐1 ‐1
+1 +1 1
x1 x2 Y
0 0 1
0 1 0
1 0 0
1 1 1
M. Courbariaux, I. Hubara, D. Soudry, R.E.Yaniv, Y. Bengio, “Binarized neural networks: Training deep neural
networks with weights and activations constrained to +1 or ‐1," Computer Research Repository (CoRR), Mar.,
2016, http://arxiv.org/pdf/1602.02830v3.pdf

2値化によるメリット
17
x1
w0 (Bias)
fsgn(Y)
Y
z
w1
x2
w2
xn
wn
...
x1 x2 Y
‐1 ‐1 1
‐1 +1 ‐1
+1 ‐1 ‐1
+1 +1 1
x1 x2 Y
0 0 1
0 1 0
1 0 0
1 1 1
EXNORs → ⼤量の積和演算回路を実現
Binary Precision → オンチップメモリ実現

2値化によるニアメモリ実現
E. Joel et al., “Tutorial on Hardware Architectures
for Deep Neural Networks,” MICRO‐49, 2016. 18
On-chip
Memory
J. Dean, “Numbers everyone should know”
Source: https://gist.github.com/2841832
• 広帯域 (左)
• 低消費電⼒ (右)

2値化CNNのニューロン数
(特徴マップ数)と認識精度の関係
19
Source: “FINN: A Framework for Fast, Scalable Binarized Neural Network Inference,”
Yaman Umuroglu1,2, Nicholas J. Fraser1,3, Giulio Gambardella1, Michaela Blott1, Philip Leong3, Magnus Jahre2, Kees Vissers1

混合精度CNN
• Object Detectorで必須技術
• 前段: 2値精度CNN … ⾯積・スピードを稼ぐ
• 後段: 多値精度CNN … 回帰問題（枠推定）を解く
20
Input
Image
(Frame)
Feature maps
CONV+Pooling
CNN
CONV+Pooling
Class score
Bounding Box
Detection
2値 half
H. Nakahara et al., “A Lightweight YOLOv2: A Binarized CNN with A Parallel Support Vector Regression for an
FPGA,” Int’l Symp. on FPGA (ISFPGA), 2018.

CNNによる画像分類の解析
21
Feature maps
CONV+Pooling CONV+Pooling
“5”
Input
image
...
Feature extraction layers
Classification
layers
3
2 0
1
4
5
6
7
8 9

仮説
• 2値化しても位置情報を保持しているのでは︖
22
Feature maps
CONV+Pooling CONV+Pooling
“5”
Input
image
...
Feature extraction layers Classification
3
2 0
1
4
5
6
7
8 9
Regression

問題点
• 低精度NNでは回帰問題を解けない
• 例: sin(x) regression using a NN (3‐layers)
23
(a) Float 32 bit for
activation and weight
(b) Float32 for
activation and binary
weight
(c) All binarized
Sin(x)
BinNNFloat32NN
Sin(x)
Miss
localization

混合精度による効果
• VGG11, 学習画像: Pascal VOC2007 (⾞, ⼈, その他)
24
Integer Conv2D
Binary Conv2D
Max Pooling
Binary Conv2D
Binary Conv2D
Binary Conv2D
Max Pooling
Binary Conv2D
Binary Conv2D
Binary Conv2D
Average Pooling
Fully Connect
全層
2値化
→86.9%
最終層
Float32
→93.47%
2値化
最終2層
Float32
→97.29%
2値化

内容
• 枝刈り
• 3状態CNN
GUINNESS DREI
• まとめ
25

Sparse Neural Network
• 圧縮・⾼速化に有効
26
S. Han et. al, “Deep Compression: Compressing Deep Neural Networks with
Pruning, Trained Quantization and Huffman Coding,” ICLR2016.

フル結合層重みのヒストグラム
27
頻度
重みの値
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
‐1 ‐0.8 ‐0.6 ‐0.4 ‐0.2 0 0.2 0.4 0.6 0.8 1
⇒ 認識精度に
影響しない
→枝刈り

再学習&枝刈り
• 枝刈り後, 再度学習を⾏い枝刈り
• 認識精度99%を維持, 枝刈りが停⽌→終了
28
枝刈り
再学習

モデルの再学習（多値Net）
2018/9/11
29
学習が進む毎にNNモデル縮⼩
Layer Pruning #0 #1 #2 #3 #4 #5 #6
1 4,096 2,337 2,043 1,569 1,458 1,385 1,374
2 4,096 538 277 225 209 160 157
3 4,096 19 17 14 14 14 14
4 10 10 10 10 10 10 10
N‐Total 12,298 2,904 2,347 1,818 1,691 1,569 1,555
N‐Ratio 100.0 23.6 19.1 14.8 13.8 12.8 12.6
E‐Total 33,595,392 1,267,718 570,790 356,315 307,788 223,980 218,056
E‐Ratio 100.0 3.77 1.70 1.06 0.92 0.67 0.65
Tomoya Fujii, Shimpei Sato, Hiroki Nakahara, "A Threshold Neuron Pruning for a Binarized Deep Neural
Network on an FPGA," IEICE Transactions 101‐D(2): 376‐386 (2018)

モデルの再学習（Binary Net）
2018/9/11
30
Layer Pruning #0 #1 #2 #3 #4 #5 #6
1 4,096 2,259 1,578 1,458 1,457 1,457 1,457
2 4,096 3,853 3,826 3,754 3,716 3,716 3,534
3 4,096 3,438 1,149 1,059 498 373 193
4 10 10 10 10 10 10 10
N‐Total 12,298 9,560 6,563 6,281 5,681 5,556 5,194
N‐Ratio 100.0 77.7 53.4 51.1 46.2 45.2 42.2
E‐Total 33,595,392 21,984,921 10,444,992 9,459,408 7,269,760 6,804,010 5,833,030
E‐Ratio 100.0 65.44 31.09 28.16 21.64 20.25 17.36
Tomoya Fujii, Shimpei Sato, Hiroki Nakahara, "A Threshold Neuron Pruning for a Binarized Deep Neural
Network on an FPGA," IEICE Transactions 101‐D(2): 376‐386 (2018)

CNNでは︖→ 枝刈り可能
• 例: AlexNet (5層の重み分布)
31
3
64
64 64 64
conv&pool
k=11, s=4
100
13
13 6
6
1
1
100
conv
&pool
k=5, s=1
conv
k=3
s=1
conv
k=3
s=1
conv
&average pool
k=3, s=1
3
...

内容
• 枝刈り
• 3状態CNN
GUINNESS DREI
• まとめ
32

重み3状態Neural Network
33
Input feature map
Output feature map
Kernel
(Sparse)
X0,1 x W0
X1,0 x W1
+) X2,2 x W2
y
skip
skip
skip
ℎ𝑖𝑑
ℎ𝑖𝑑
: しきい値 (定数)

重み3値との混合精度CNN
34
Input feature map
Output feature map
Kernel
(Sparse)
X0,1 x W0
X1,0 x W1
+) X2,2 x W2
y
skip
skip
skip
ℎ𝑖𝑑
ℎ𝑖𝑑
: しきい値 (定数)
±1に丸めると3値

35
2D Convolution for Tri‐state Weight
Input Feature Map
Output F. Map
x x
0 0 0
w 0 ‐w
0 0 0
Weight
x
x
0 0 ‐w
0 0 0
0 w 0
𝑓
“0”をスキップする演算を導⼊
→2値化CNNの回路を利⽤できる
→⾼速化はスキップする割合(=“0”の割合)次第

間接メモリアクセス⽅式積和演算
36
idx ⾮零重み相対アドレス
1 w1 adr1
2 w2 adr2
：：：
アドレスデータ
000…0 X1
000…1 X2
：：
111…1 Xn
カウンタ値
Xtmp
Xtmp +adr1
+adr2
3値化CNNのときは
EXNORゲート

間接メモリアクセスの⽤途
• ほとんどの２次元畳み込みをカバー
→広い範囲の応⽤事例
→異なるサイズのカーネル
37
Dilated Convolution
→物体検出
Deformable
Convolution
→形状認識

カーネル並列化
• 畳み込み演算の性質(重み共有化)を利⽤
38

3状態CNN回路構成
39
Feature map memory
Binary
weight
Address
Read a corr.
row at a time
Register
Bitwise EXNOR
+
Reg.
+
Bias
Sign bit
(Binary Act.)
Seq. acc. unit
Counter
Ternary weight
and corr. address
memory
w0 addr (w0)
w1 addr (w1)

内容
• 枝刈り
• 3状態CNN
GUINNESS DREI
• まとめ
40

GUINNESS (試⽤版)
• GUI based Neural Network Synthesizer の略
• 2値化CNNの学習 (GPUサポート)
• C/C++コード⽣成→SDSoCによるシステム合成
• RTLを記述する必要なし
• 試⽤版: https://github.com/HirokiNakahara
41
AvNET様に紹介して頂きました！
https://www.avnet.co.jp/ip/XILINX/GUINNESS.aspx

GUINNESS DREI
• タブに応じてプロジェクト管理・CNN構成・学習設
定・FPGAコード⽣成切り替え
42
タブ

他のフレームワークとの連携
• ONNX (Open Neural Network Exchange)を通した
学習済みモデルの共有
• 蒸留(Distillation)による3状態化が可能
43
GUINNESS DREIにより
3状態CNNに圧縮

ユーザー様事例
• SYNKOM Edge AI Solution: https://synkom.co.jp/edge‐ai/
44

内容
• 枝刈り
• 3状態CNN
GUINNESS DREI
• まとめ
45

まとめ
• 様々なディープラーニングベースのアルゴリズムが実
現可能に
• CNNの最適化⼿法
• 混合精度
• 枝刈り
• GUINNESS DREI
• 3状態CNNの学習、FPGAコード⽣成
• 蒸留とONNXによる多様なフレームワーク対応へ
• 物体認識アルゴリズムYOLOv2の実装
• GPUを超える速度と低消費電⼒
47

DSF2018講演スライド

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DSF2018講演スライド

Similar to DSF2018講演スライド (20)

More from Hiroki Nakahara

More from Hiroki Nakahara (20)

DSF2018講演スライド