Convolutional Neural Networks のトレンド @WBAFLカジュアルトーク#2

全脳アーキテクチャ若若⼿手の会カジュアルトーク#2 (2016.2.7)
Convolutional Neural Networks のトレンド
全脳アーキテクチャ若若⼿手の会
法政⼤大学⼤大学院理理⼯工学研究科修⼠士課程
島⽥田⼤大樹

⾃自⼰己紹介
島⽥田⼤大樹 (SHIMADA Daiki)
@sheema_̲sheema (Twitter)
•  法政⼤大学⼤大学院理理⼯工学研究科 M1
•  画像解析による授業受講者の態度度推定
•  深層学習関連⼿手法の提案
•  全脳アーキテクチャ若若⼿手の会副代表
•  会全体の運営 (運営メンバー⼤大募集中!!)
•  2014年年第2回勉強会発表者
1

今⽇日話すこと
l  CNN: 画像分野における深層学習⼿手法のスタンダード
l  CNN系⽂文献 26 本ノック !!
l  いま何ができるのか?
どんな⽅方向で研究がされているか? を知る
l  中⾝身の詳細については参照している論論⽂文を
l  ⼀一部CNNが⽤用いられていない研究も紹介します
Convolutional Neural Networks (CNN) の研究動向
2

⽬目次全脳アーキテクチャ若若⼿手の会カジュアルトーク#2
Convolutional Neural Networks
のトレンド
1.  CNNアーキテクチャの変遷 / 最適化⼿手法
2.  特徴量量の解析 / 可視化
3.  物体検出・領領域分割
4.  画像⽣生成・超解像
5.  3Dタスクへ
6.  映像への挑戦
7.  より “⼈人間らしい” 機械知覚へ
8.  マルチモーダル・アプリケーション
9.  CNNと強化学習
10.  Whatʼ’s Next ? –ポスト ImageNet ...
3

のトレンド
4

CNNのアーキテクチャの変遷 –畳み込み型ネットの発⾒見見
l  局所的な結合というアイディア
l  2種類（特徴抽出と情報集約）の処理理を繰り返す
Neocognitron (1980) [1]
5
[1] K. Fukushima. Neocognitron: A self-‐‑‒organizing neural network model for a mechanism of
pattern recognition unaﬀected by shift in position. Biological Cybernetics 36, 1980.
l  畳み込みとプーリング（サブサンプリング）の形に
l  Back Propagation(BP) によって学習
LeNet (1998) [2]
[2] Y LeCun, L Bottou, Y Bengio, P Haﬀner. Gradient-‐‑‒based learning applied to document
recognition. Proceedings of the IEEE 86, 1998.

CNNのアーキテクチャの変遷 –プーリング,活性化関数,正則化
l  ⾮非CNN系画像認識識のアイディアを導⼊入
Ave./Max Pooling, Local Contrast Normalization (2009) [3]
6
[3] K. Jarrett, K. Kavukcuoglu, M. Ranzato, Y. LeCun. What is the best multi-‐‑‒stage architecture for
object recognition?. CVPR, 2009.
l  活性化関数を単純に
ReLU (2011) [4]
[4] X. Glorot, A. Bordes, Y. Bengio. Deep Sparse Rectiﬁer Neural Networks. AISTATS 11, 2011.
l  過学習を防ぐための正則化技術の導⼊入
Dropout (2012) [5]
[5] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. R. Salakhutdinov. Improving neural
networks by preventing co-‐‑‒adaptation of feature detectors. arXiv: 1207.0580, 2012.

CNNのアーキテクチャの変遷 –畳み込みの多層化と複雑化
l  ⼤大規模⼀一般物体認識識での成功
l  Data Augmentationとこれまでの要素技術の結集 (8層)
AlexNet (2012) [6]
7
[6] A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet Classiﬁcation with Deep Convolutional
Neural Networks. NIPS, 2012.
l  畳み込み層に⾮非線形性を導⼊入
l  全結合部を使わないという提案 (global ave. pooling)
Network in Network, global ave. pooling (2013) [7]
[7] M. Lin, Q. Chen, S. Yan. Network In Network. arXiv: 1312.4400, 2013.

CNNのアーキテクチャの変遷 –畳み込みの多層化と複雑化
l  ⼀一般物体認識識⽤用で19層のアーキテクチャへ
l  ⼩小さい畳み込みサイズ(3x3)を多段にした
VGG-‐‑‒Net (2014) [8]
8
[8] K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-‐‑‒Scale Visual
Recognition. arXiv: 1409.1556, 2014.
l  22層のアーキテクチャ
l  auxiliary classiﬁers , Inception module
GoogLeNet / Inception (2014 ~∼ 2015) [9, 10]
[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A.
Rabinovich. Going deeper with convolutions. arXiv: 1409.4842, 2014.
[10] C. Szegedy, V. Vanhoucke, S. Ioﬀe, J. Shlens, Z. Wojna. Rethinking the Inception Architecture
for Computer Vision. arXiv: 1512.00567, 2015.

CNNのアーキテクチャの変遷 –アーキテクチャの多様化
l  様々なサイズの⼊入⼒力力画像を許容
l  CNN⼊入⼒力力時のリサイズを回避
SPP-‐‑‒Net (2014) [11]
9
[11] K. He, X. Zhang, S. Ren, J. Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for
Visual Recognition. arXiv: 1406.4729, 2014.
l  プーリングをストライド2の畳み込みに置き換える
l  guided BPによる超⾼高次層の特徴可視化
All Convolutional Net, guided BP (2014) [12]
[12] J. T. Springenberg, A. Dosovitskiy, T. Brox, M. Riedmiller. Striving for Simplicity: The All
Convolutional Net. arXiv: 1412.6806, 2014.

CNNのアーキテクチャの変遷 –学習⽅方法の多様化
l  Data Augmentation を利利⽤用して教師なし表現学習
Exemplar CNN (2014) [13]
10
[13] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, T. Brox. Discriminative Unsupervised
Feature Learning with Exemplar Convolutional Neural Networks. arXiv: 1406.6909, 2014.
l  ユークリッド空間上でCNN上の特徴同⼠士が,
同クラスなら近くなるように, 別クラスなら遠くなるように
Triplet Network (2014) [14]
[14] E. Hoﬀer, N. Ailon. Deep metric learning using Triplet network. arXiv: 1412.6622, 2014.

CNNのアーキテクチャの変遷 –超多層アーキテクチャへ
l  パラメータ付き正規化処理理
l  複雑なアーキテクチャをスクラッチで学習させる必須技術
Batch Normalization (2015) [15]
11
[15] S. Ioﬀe, C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift. arXiv: 1502.03167, 2015.
l  152層からなる超多層アーキテクチャ
l  途中の特徴マップを何層か先にバイパスしてやる
Residual Network; ResNet (2015) [16]
[16] K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. arXiv:
1512.03385, 2015.

確率率率的勾配降降下法における学習率率率調整法
AdaGrad [17]
RMSProp [18]
AdaDelta [19]
Adam [20]
12
[17] J. Duchi, E. Hazan, Y. Singer. Adaptive Subgradient Methods for Online Learning and
Stochastic Optimization. Journal of Machine Learning Research 12 ,2011.
l  ⼀一概にどれが最も良良いとは⾔言えない (AdaGrad以外は⽐比較的優秀…？)
l  データセットや問題によって，適切切なハイパーパラメータ
が異異なってくる
[18] T. Tieleman, G. Hinton. Divide the gradient by a run-‐‑‒ ning average of its recent magnitude.
COURSERA: Neural Networks for Machine Learning 4, 2012.
[19] M. D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arXiv: 1212.5701, 2012.
[20] D. Kingma, J. Ba. Adam: A Method for Stochastic Optimization. arXiv: 1412.6980, 2014.

のトレンド
13

CNNの特徴量量解析 / 可視化
l  DeconvolutionとUnpoolingで特徴マップを⼊入⼒力力空間へ
Deconvnet for visualizing
14
[21] M.D. Zeiler, and R. Fergus. Visualizing and understanding convolutional networks.
arXiv,: 1311.2901, 2013.

l  正則化⼿手法の導⼊入でより綺麗麗に再構成できるように
⼊入⼒力力画像の最適化
15
[22] A. Mahendran, A. Vedaldi. Understanding Deep Image Representations by Inverting Them.
arXiv: 1412.0035, 2014.

l  ⼈人間からすると違いは分からないが，CNNは間違える
l  そういったものはAdversarial exampleと呼ばれる
CNNを “だます”
16
[24] I. J. Goodfellow, J. Shlens, C. Szegedy. Explaining and Harnessing Adversarial Examples.
arXiv: 1412.6572, 2014.
ostrich !! ostrich !!
[23] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, R. Fergus. Intriguing
properties of neural networks. arXiv: 1312.6199, 2013.

l  ⾼高い確信度度で分類する意味不不明画像も作れる
CNNを “だます”
17
[25] A. Nguyen, J. Yosinski, J. Clune. Deep Neural Networks are Easily Fooled: High Conﬁdence
Predictions for Unrecognizable Images. arXiv: 1412.1897, 2014.

のトレンド
18

物体検出
l  従来のCVテクニックで取った物体領領域候補をCNNに投げる
R-‐‑‒CNN (2013)
19
[26] R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for accurate object detection
and semantic segmentation. arXiv:1311.2524, 2013.

物体検出
l  多段だった学習を1本に (分類と矩形回帰を同時に解く)
l  CNNの特徴マップ上でROIを切切り出す (ROI Pooling)
l  物体候補領領域⾃自体はCVテクニックで取り出す必要がある
Fast R-‐‑‒CNN (2015/4)
20
[27] R. Girshick. Fast R-‐‑‒CNN. arXiv:1504.08083, 2015.

物体検出
l  物体候補領領域の抽出もCNNでやる (Region Proposal Net)
Faster R-‐‑‒CNN (2015/6)
21
[28] S. Ren, K. He, R. Girshick, J. Sun. Faster R-‐‑‒CNN:
Towards Real-‐‑‒Time Object Detection with Region Proposal
Networks. arXiv:1506.01497, 2015.

セグメンテーション
l  CNNにおける全結合部を畳み込みに置き換える
l  Deconvolutionでアップサンプリング
Fully Convolutional Networks (FCN)
22
[29] K. Simonyan, A. Vedaldi, A. Zisserman. Deep Inside Convolutional Networks: Visualising
Image Classiﬁcation Models and Saliency Maps. arXiv: 1312.6034, 2013.

l  Poolingで選択された場所を覚えておいて，アップサンプル,
⽐比較的⾼高速にセグメンテーション出来る（らしい）
SegNet
23
[30] V. Badrinarayanan, A.
Handa, R. Cipolla. SegNet:
A Deep Convolutional
Encoder-‐‑‒Decoder
Architecture for Robust
Semantic Pixel-‐‑‒Wise
Labelling. arXiv:
1505.07293, 2015.

l  セグメンテーション⼿手法に使われていたCRFとの合わせ技
l  CRFにおける平均場近似の処理理をRNNと解釈(CRF-‐‑‒RNN)，
CNNとCRFを同時に学習
CNN + 条件付き確率率率場(CRF)
24
[31] S. Zheng, S. Jayasumana, B. R. Paredes, V. Vineet, Z. Su, D. Du, C. Huang, P. H. S.
Torr. Conditional Random Fields as Recurrent Neural Networks. arXiv: 1502.03240, 2015.

l  セグメンテーション / 物体検出のための領領域候補抽出
l  中央の物体の”セグメント”と”物体の有無”をそれぞれ学習
Deep Mask
25
[32] P. O. Pinheiro, R. Collobert, P. Dollar. Learning to Segment Object Candidates. arXiv: 1506.06204,
2015.

顔認識識
l  3次元モデルで顔領領域をアライメントして, CNNで識識別
l  ほぼ⼈人間と同等の顔認識識性能
Deep Face
26
[33] Y. Taigman, M. Yang, M. A. Ranzato and L. Wolf. DeepFace: Closing the Gap to Human-‐‑‒Level
Performance in Face Veriﬁcation. CVPR, 2014.

注視点的アイディア
l  オブジェクトへの変形操作を学習させる
Spatial Transformer Networks
27
[34] M. Jaderberg, K. Simonyan, A. Zisserman, K. Kavukcuoglu. Spatial Transformer Networks. arXiv:
1506.02025, 2015.

のトレンド
28

画像⽣生成
l  CNNが”なんとなく⾒見見えているもの”を強調する
Deep Dream
29
[36] K. Simonyan, A. Vedaldi, A. Zisserman. Deep Inside
Convolutional Networks: Visualising Image Classiﬁcation
Models and Saliency Maps. arXiv: 1312.6034, 2013.
[35] Inceptionism: Going Deeper into Neural Networks.
http://googleresearch.blogspot.ch/2015/06/inceptionism-‐‑‒going-‐‑‒
deeper-‐‑‒into-‐‑‒neural.html

画像⽣生成
l  3Dのイスモデルを学習させ，
物体のタイプや視点情報から画像を⽣生成できるように．
モーフィング
30
[37] A. Dosovitskiy, J. T. Springenberg, M. Tatarchenko, T. Brox. Learning to
Generate Chairs, Tables and Cars with Convolutional Networks.
arXiv: 1411.5928, 2014.

画像⽣生成
l  元画像のCNN表現とスタイル⾏行行列列による，⼊入⼒力力最適化
画⾵風変換
31
[38] L. A. Gatys, A. S. Ecker, M. Bethge. A Neural Algorithm
of Artistic Style. arXiv: 1508.06576, 2015.
1層⽬目の特徴で⽣生成→
5層⽬目の特徴で⽣生成→

画像⽣生成
l  CNNとMRFを組み合わせたモデルによる画⾵風変換
画⾵風変換
32
[39] C. Li, M. Wand. Combining Markov Random Fields and Convolutional Neural
Networks for Image Synthesis. arXiv:1601.04589, 2016.

画像⽣生成
l  Adversarial Networksで
⾼高画質な画像を作る
DCGANによる画像⽣生成とベクトル演算性
33
[40] A. Radford, L. Metz, S. Chintala. Unsupervised Representation Learning with Deep
Convolutional Generative Adversarial Networks. arXiv:1511.06434, 2015.

超解像
l  waifu2x[42]という名前のソフトウェアも登場した
Super-‐‑‒Resolution CNN (SRCNN)
34
[41] C. Dong, C. C. Loy, K. He, X. Tang. Image Super-‐‑‒Resolution Using Deep
Convolutional Networks. arXiv:1501.00092, 2015.
[42] waifu2x. http://waifu2x.udp.jp/index.ja.html

超解像
l  CNNでパッチ内の”motion kernel”を推定
MRFで画像全体のモーションブラーを推定する
Deblurring (モーションブラー除去)
35
[43] J. Sun, W. Cao, Z. Xu, J. Ponce. Learning a Convolutional Neural Network for Non-‐‑‒
uniform Motion Blur Removal. arXiv:1503.00593, 2015.

⾃自動彩⾊色
l  “hypercolumns” [45] のアイディアを上⼿手く活⽤用
Automatic Colorization CNN
36
[44] Automatic Colorization, http://tinyclouds.org/colorize/
[45] B. Hariharan, P. Arbeláez, R. Girshick, J. Malik. Hypercolumns for Object
Segmentation and Fine-‐‑‒grained Localization. arXiv: 1411.5752, 2014.
original CNN human(Reddit)

のトレンド
37

3D タスクへ
l  Selection Tower (depth推定)と，
Color Tower (⾊色推定) の2本のネットワークで視点補間
Deep Stereo
38
[46] J. Flynn, I. Neulander, J. Philbin, N. Snavely. DeepStereo: Learning to
Predict New Views from the World's Imagery. arXiv:1506.06825, 2015.

3D タスクへ
Deep Stereo
39
[46] J. Flynn, I. Neulander, J. Philbin, N. Snavely. DeepStereo: Learning to
Predict New Views from the World's Imagery. arXiv:1506.06825, 2015.
[47] DeepStereo: Learning to Predict New Views from the Worldʼ’s Imagery -‐‑‒
YouTube, https://www.youtube.com/watch?v=cizgVZ8rjKA

3D タスクへ
l  両画像のパッチ類似度度をCNN特徴量量から計算
ステレオマッチング
40
[48] J. Žbontar, Y. LeCun. Stereo Matching by Training a Convolutional Neural
Network to Compare Image Patches. arXiv: 1510.05970, 2015.

3D タスクへ
l  マルチスケールなCNNで
depth, surface normal, semantic labelのタスクを解く
単⼀一画像による3Dタスク例例
41
[49] D. Eigen, R. Fergus. Predicting Depth, Surface Normals and Semantic Labels with
a Common Multi-‐‑‒Scale Convolutional Architecture. arXiv: 1411.4734, 2014.
input Eigen et al. proposal ground
truth

のトレンド
42

映像への挑戦
l  487種のスポーツ(!?)を分類, Top-‐‑‒5で正解率率率およそ80％
l  フレームごとにCNNで処理理 (復復数のアーキテクチャを提案)
スポーツ映像分類
43
[50] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, F. Li. Large-‐‑‒
scale Video Classiﬁcation with Convolutional Neural Networks. CVPR, 2014.

のトレンド
44

より ”⼈人間らしい” 機械知覚へ
l  Memorability: どれくらい記憶に残りやすいか
l  ⼼心理理実験から様々な画像のMemorability scoreを算出，
⼤大規模データセット: LaMem を公開
MemNet: CNN for Memorability
45
[51] LaMem, http://memorability.csail.mit.edu/
[52] A. Khosla, A. S. Raju, A. Torralba and A. Oliva. Understanding and
Predicting Image Memorability at a Large Scale. ICCV, 2015..
⾼高低Memorability

より ”⼈人間らしい” 機械知覚へ
l  Memorabilityを推定するようにCNNを学習
l  Rank Correlation: 0.64(MemNet) v.s. 0.68(human)
MemNet: CNN for Memorability
46
[51] LaMem, http://memorability.csail.mit.edu/
[52] A. Khosla, A. S. Raju, A. Torralba and A. Oliva. Understanding and
Predicting Image Memorability at a Large Scale. ICCV, 2015..

のトレンド
47

マルチモーダルなアプリケーション
l  もともと画像キャプション⽣生成課題はあった
l  CNN(画像表現) + LSTM(⽂文⽣生成; 翻訳)
画像キャプション⽣生成
48
Google NIC [53] LRCN [54]
[53] O. Vinyals, A. Toshev, S. Bengio, D. Erhan. Show and Tell: A Neural Image Caption Generator. arXiv: 1411.4555, 2014./
[54] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell. Long-‐‑‒term
Recurrent Convolutional Networks for Visual Recognition and Description. arXiv: 1411.4389, 2014.

画像キャプション⽣生成 (上: Google NIC, 下: LRCN)
49

l  画像⼊入⼒力力に加えて⽂文⼊入⼒力力ができるアーキテクチャ
画像に関する質問に答える (Visual Turing Test)
50
mQA [55]
Neural-‐‑‒Image QA [56]
[55] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, W. Xu. Are You Talking to a
Machine? Dataset and Methods for Multilingual Image Question
Answering. arXiv: 1505.05612, 2015.
[56] M. Malinowski, M. Rohrbach, M. Fritz. Ask Your Neurons: A Neural-‐‑‒
Based Approach to Answering Questions About Images. ICCV, 2015.

51
mQA [55]の結果

52
Neural-‐‑‒Image QA [56]
DAQUARは⼈人間でも回答に迷うものも．
システムはほぼ⾔言語情報に頼っている(?)

l  Bidirectional RNNで⽂文章をエンコード, RNNで画像⽣生成
⽂文章から画像⽣生成
53
[57] E. Mansimov, E. Parisotto, J. L. Ba, R. Salakhutdinov. Generating
Images from Captions with Attention. arXiv: 1511.02793, 2015.

画像と単語のクロスモーダル分散表現
54
[58] R. Kiros, R. Salakhutdinov, R. S. Zemel. Unifying Visual-‐‑‒Semantic Embeddings
with Multimodal Neural Language Models. arXiv: 1411.2539, 2014.

のトレンド
55

CNNと強化学習
l  Q-‐‑‒Learning における価値関数の近似にCNN (DQN)
l  「ピンボール」や「ブレイクアウト」は得意，
「パックマン」や「モンテズマの復復讐」はかなり苦⼿手
Atari 2600 (Deep Q-‐‑‒Networks)
56
[60] V. Mnih, at al. Human-‐‑‒level control through deep reinforcement learning. nature, 2015.
[59] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller. Playing
Atari with Deep Reinforcement Learning. arXiv:1312.5602, 2013.

CNNと強化学習
l  2つのネットワーク(⽅方策&価値)と
モンテカルロ⽊木探索索(MCTS)で良良い⼿手を考える
l  盤⾯面を19x19の画像としてCNNへ
l  ⼈人の⼿手を教師として学習 -‐‑‒> self-‐‑‒playで学習
AlphaGo
57
[61] D. Silver, et al. Mastering the game of Go with deep neural networks and tree search. nature, 2016.

CNNと強化学習
l  ハードウェア⾯面での条件に注意だが，他の囲碁AIを圧倒
l  碁の欧州チャンピオンに5戦5勝, 3⽉月にトッププロと対戦
AlphaGo
58
[61] D. Silver, et al. Mastering the game of Go with deep neural networks and tree search. nature, 2016.
[62] Y. Tian, Y. Zhu. Better Computer Go Player with Neural Network and Long-‐‑‒term Prediction. arXiv:
1511.06410, 2015.

CNNと強化学習
l  DQNを⾮非同期型の学習に拡張
l  1つのマシンで 16 actor-‐‑‒learner threads を⾛走らせる
⼀一⼈人称視点ゲームへの適⽤用 (Asynchronous DQN)
59
[63] V. Mnih, A.P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, K.
Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. arXiv:1602.01783, 2016.

のトレンド
10. Whatʼ’s Next ?
60

Whatʼ’s Next ?
l  Fei-‐‑‒Fei Li のチームによる⼤大規模画像データセット
Visual Genome
61
[64] Visual Genome, https://visualgenome.org/
108,249 images
4.2 million Region Descriptions
1.7 million Visual Q&A
2.1 Million Object Instances
(75,729 unique objects)
1.8 Million Attributes
(40,513 unique attributes)

Convolutional Neural Networks のトレンド @WBAFLカジュアルトーク#2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Convolutional Neural Networks のトレンド @WBAFLカジュアルトーク#2

Similar to Convolutional Neural Networks のトレンド @WBAFLカジュアルトーク#2 (20)

Recently uploaded

Recently uploaded (9)

Convolutional Neural Networks のトレンド @WBAFLカジュアルトーク#2