2. DNNの動向・CVのトレンド(1/34)
2
• DNN時代以前の動向
– Perceptron, MLP, Neocognitron, BackProp, CNN
– DNNが流⾏る直前の画像認識では局所特徴が使⽤
2020
1960
CNN
Perceptron
Neocognitron
1980
1990
2000
2010
1970
HOG/SIFT
BoF/SVM
1st AI 2nd AI 3rd AI
Deep
Learning
F. Rosenblatt et al. “Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms” in 1961.
Rumelhart et al. “Learning representations by back-propagating errors” in Nature 1986.
K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position”, in
1980
Y. LeCun et al. “Gradient-based learning applied to document recognition” in IEEE 1998.
Backprop.
3. DNNの動向・CVのトレンド(2/34)
3
• Perceptron, MLP, Neocognitron/ConvNet
– Perceptron
• ⼊⼒とコネクション(重み)の線形和,活性化関数により構成
– MLP: Multi-layer Perceptron
• Perceptronの多層化
– Neocognitron/ConvNet
• 畳込みの概念を導⼊,特に隣接ピクセルに類似関係のある画像処理
に有効
Perceptron(パーセプトロン)
Neocognitron
K. Fukushima, “Neocognitron: A Self-organizing Neural Network
Model for a Mechanism of Pattern Recognition Unaffected by Shift
in Position,” Biol. Cybenetics 36, pp.193-202, 1980.
https://www.rctn.org/bruno/public/papers/Fukushima1980.pdf
Convolutional Neural Net
Y. LeCun et al. “Gradient-Based Learning
Applied to Document Recognition,” IEEE,
1998.
http://yann.lecun.com/exdb/publis/pdf/l
ecun-01a.pdf
9. DNNの動向・CVのトレンド(8/34)
9
• 構造の複雑化・⾃動化(2016〜)
– ResNet以降のアーキテクチャ
• ResNeXt, DenseNet, SENet, ...
– ⾃動化・効率化(Neural Architecture Search, EfficientNet)
• (P)NASNet, EfficientNet...
SENet(SE-block)
J. Hu et al. “Squeeze-and-Excitation Networks, “ in TPAMI 2019.
https://arxiv.org/abs/1709.01507
S. Xie et al. “Aggregated Residual
Transformations for Deep Neural
Networks, “ in CVPR 2017.
https://arxiv.org/abs/1709.01507
ResNeXt
G. Huang et al. “Densely Connected
Convolutional Networks, “ in CVPR
2017.
https://arxiv.org/abs/1608.06993
DenseNet B. Zoph et al. “Learning Transferable
Architectures for Scalable Image
Recognition,“ in CVPR 2018.
https://arxiv.org/abs/1707.07012
NASNet
C. Liu et al. “Progressive Neural
Architecture Search,“ in CVPR 2018.
https://arxiv.org/abs/1712.00559
PNASNet
M. Tan et al. “EfficientNet: Rethinking
Model Scaling for Convolutional Neural
Networks,“ in ICML 2019.
https://arxiv.org/pdf/1905.11946.pdf
EfficientNet
12. DNNの動向・CVのトレンド(10ʼ/34)
12
物体検出の流れ(2018〜2020)
B. Singh et al. “An Analysis of Scale Invariance in
Object Detection,“ in CVPR 2018.
https://arxiv.org/abs/1711.08189
SNIP
B. Singh et al. “SNIPER: Efficient Multi-Scale
Training,“ in NeurIPS 2018.
https://papers.nips.cc/paper/8143-sniper-efficient-
multi-scale-training.pdf
SNIPER
Q. Zhao et al. “M2Det: A Single-Shot Object
Detector based on Multi-Level Feature Pyramid
Network,“ in AAAI 2019.
https://arxiv.org/abs/1811.04533
M2Det
Z. Xingyi et al. “Objects as Points,“ in arXiv 2019.
https://arxiv.org/abs/1904.07850
CenterNet
H. Law et al. “CornerNet: Detecting Objects as
Paired Keypoints,“ in ECCV 2018.
https://arxiv.org/abs/1808.01244
CornerNet
K. Duan et al. “CenterNet: Keypoint Triplets for
Object Detection,“ in ICCV 2019.
https://arxiv.org/abs/1904.08189
CenterNet
Mask R-CNN 41.7 -> CenterNet 47.0まで向上︕
物体候補が多く位置する領域を拡⼤ 多重解像度の特徴マップ使⽤
2点(左上, 右下)を回帰 中央からxyサイズを回帰 中央+2点, xyサイズを回帰
14. DNNの動向・CVのトレンド(12/34)
14
– CVとNLP(⾃然⾔語処理)の融合分野
• 画像説明⽂(Image Captioning)
• 視覚的質問回答(Visual Question Answering; VQA)
• Visual Dialog
画像 (Visual)と質問⽂ (Q)を⼊⼒,
回答 (A)を返却
画像を⼊⼒として
⽂章を出⼒
【VQA】
【Image Captioning】 【Visual Dialog】
画像とそれに対する対話を繰り返しながら
回答を⾏う
https://visualqa.org/
O. Vinyals et al. “Show and Tell: A
Neural Image Caption Generator,” in
CVPR 2015.
https://arxiv.org/pdf/1411.4555.pdf
https://visualdialog.org/
15. DNNの動向・CVのトレンド(13/34)
15
• 動画像認識のモデル(N-Dimension Conv)
– 2D: Two-Stream ConvNets(フロー画像を同時に準備)
– 2D+再帰モデル: CNN+LSTM(最近ベンチマークとしてしか使われなくなった)
– 3D: 3D CNN(データが⽤意できればこれが本命)
– (2+1)D: Separable Conv.(少量データの場合のオプション)
3D畳み込み(C3D, 3D ResNet, I3D)
3Dフィルタ
畳み込みマップ
(3D)
識別
D. Tran et al. “A Closer Look at Spatiotemporal
Convolutions for Action Recognition,” in CVPR 2018.
16. DNNの動向・CVのトレンド(13ʼ/34)
16
• 動画認識タスク
R. Krishna et al. “Dense Captioning Events in
Videos,“ in ICCV 2017.
https://arxiv.org/abs/1705.00754
DenseCaptioning
T. Lin et al. “Single Shot Temporal Action
Detection,“ in ACM MM 2017.
https://arxiv.org/abs/1710.06236
SSAD
Qiu et al. “Learning Spatio-Temporal Representation
with Local and Global Diffusion,“ in CVPR 2019.
https://arxiv.org/abs/1906.05571
Local & Global Diffusion (LGD)
動画から説明⽂を出⼒
動画の区間を推定(⾏動の始点・終点)
局所・⼤域特徴を同時抽出(⾼精度な動画認識)
34. DNNの動向・CVのトレンド(30/34)
34
• ロボティクスへの応⽤ @CVPR2019
– 点群の利⽤や6D Object Detectionが多数
– マニピュレーション︓DenseFusionでは6D Det.
と把持を実施(左図)
– ⾃動運転︓点群トラッキング+形状復元(右図)
C. Wang et al. “DenseFusion: 6D Object Pose
Estimation by Iteratibve Dense Fusion,” in
CVPR 2019
http://openaccess.thecvf.com/content_CVPR_2019/papers/Wang_D
enseFusion_6D_Object_Pose_Estimation_by_Iterative_Dense_Fusion
_CVPR_2019_paper.pdf
S. Giancola et al. “Leveraging Shape Completion for 3D Siamese
Tracking,” in CVPR 2019.
http://openaccess.thecvf.com/content_CVPR_2019/papers/Giancola_Leveraging_Shape_Completion_for_3D_Sia
mese_Tracking_CVPR_2019_paper.pdf
点群トラッキング+形状復元(下の例は⾞両の復元)
35. DNNの動向・CVのトレンド(31/34)
35
• ファッション分野への応⽤ @CVPR2019
– ファッションのアノテーションが進展
– DeepFashion2(左図)
• DeepFashionの強化版,より詳細なラベルを付与
– FCDBv2(右図)
• FCDBの強化版,バイナリ識別器によりデータクレンジング
Y. Ge et al. “DeepFashion2: A Versatile Benchmark for
Detection, Pose Estimation,Segmentation and Re-
Identification of Clothing Images,” in CVPR 2019.
http://openaccess.thecvf.com/content_CVPR_2019/papers/Ge_DeepFashion2_A_
Versatile_Benchmark_for_Detection_Pose_Estimation_Segmentation_and_CVPR_2
019_paper.pdf
H. Kataoka, K. Abe, M. Minoguchi, A. Nakamura, Y. Satoh, "Ten-
million-order Human Database for World-wide Fashion Culture
Analysis", in CVPR 2019 Workshop on FFSS-USAD.
http://openaccess.thecvf.com/content_CVPR_2019/papers/Ge_DeepFashion2_A_Versatile_B
enchmark_for_Detection_Pose_Estimation_Segmentation_and_CVPR_2019_paper.pdf
36. DNNの動向・CVのトレンド(32/34)
36
• 動画認識の応⽤
– 動画⾃動ラベル付け
– ⾒守り,料理⾏動解析,ホームロボティクス,,,
• 動画DBは群雄割拠(下図)
D. Damen et al. “Scaling Egocentric
Vision: The EPIC-KITCHENS Dataset,” in
ECCV 2018.
https://epic-kitchens.github.io/2018
W. Kay et al. “The
Kinetics Human Action
Video Dataset,” in
arXiv:1705.06950
2017.
https://deepmind.com/research/open-
source/open-source-datasets/kinetics/
C. Gu et al. “AVA: A
Video Dataset of Spatio-
temporally Localized
Atomic Visual Actions,” in
CVPR 2018.
https://research.google.com/ava/downl
oad.html
M. Monfort et al. “Moments in Time
Dataset: one million videos for event
understanding,” in arXiv pre-print
1801.03150, 2018.
http://moments.csail.mit.edu/
H. Zhao et al. “HACS: Human Action
Clips and Segments Dataset for
Recognition and Temporal Localization,”
in arXiv pre-print 1712.09374 2017.
http://hacs.csail.mit.edu/
Something-Something v2 dataset
https://20bn.com/datasets/something-something