画像キャプションと動作認識の最前線〜データセットに注目して〜（第17回ステアラボ人工知能セミナー）

•
• 2015 9
• 2015 10
• 2018 6 AIP
•
•
•
•
2

6
man in black shirt
is playing guitar.
•
•

Neural Image Caption (NIC)
1. CNN Encode
2. LSTM Decode
7
[Vinyals+ 2015]
Chainer Tensorflow PyTorch

Flickr8k
• Flickr 8092
5
• ” ”
[Hodosh+ 2013]
8

Flickr30k
Flickr8k Flickr8k 31,783
5
9
[Young+ 2014]

Flickr30k Entities
• Flickr30k
( )
•
10
[Plummer+ 2016]
[Liu+ 2017]

MS COCO
• Flickr
5
•
11
http://cocodataset.org/#explore?id=409091
• a lady blowing out candles on a cake
• the woman is blowing out her birthday
cake candles
• a woman blowing candles on a frosted
cake.
• two people blowing out candles on a
cake.
• a girl is blowing out candles on a
birthday cake.
[Chen+ 2015]

MS COCO
Amazon Mechanical Turk (AMT)
12
•
• “There is”
•
•
•
• 8 words

Visual Genome
13
[Krishna+ 2017]
Park bench is made of gray weathered wood The man is almost bald
• MS COCO YFCC100M
• 1
•

Visual Genome
• Object ( )
•
• Attribute ( )
•
• Relationship ( )
• 2
• : jumbing_over(man, fire hydrant)
• Region graph ( )
• object, attribute, relationship
• Scene graph ( )
• Region graph
14
Region Graph

1 1
•
Visual Genome 1
16
[Krause+ 2017]
Two children are sitting at a table in a restaurant.
The children are one little girl and one little boy. The
little girl is eating a pink frosted donut with white icing
lines on top of it. The girl has blonde hair and is
wearing a green jacket with a black long sleeve shirt
underneath. The little boy is wearing a black zip up
jacket and is holding his finger to his lip but is not
eating. A metal napkin dispenser is in between them
at the table. The wall next to them is white brick. Two
adults are on the other side of the short white brick
wall. The room has white circular lights on the ceiling
and a large window in the front of the restaurant. It is
daylight outside.

STAIR Captions
MS COCO
17
[Yoshikawa+ 2017]
http://captions.stair.center/explore/

STAIR Captions
• 2100
18
1. 15
2.
3.
4.
5.

STAIR Captions
MS COCO Google
( ) STAIR Captions ( )
19
STAIR Captions

STAIR Captions
http://captions.stair.center
20

• Pascal Sentence
• PASCAL VOC2008 1000 5
•
• Abstract Scenes
•
• YJ Captions
• MS COCO
• Multi30k
• Flickr30k MS COCO
21
[Rashtchian+ 2010]
[Funaki+ 2015]
[Zitnick+ 2013]
[Miyazaki+ 2016]
[Elliott+ 2016]

22
Pascal
Sentence
1,000 5
MS COCO 123,287 5
Flickr8k 8,092 5
Flickr30k 31,783 5
Visual
Genome
108,077 50
Krause et al. 19,551 1
Multi30k 123,287 5
STAIR
Captions
123,287 5

Classification (Recognition)
•
Temporal Localization
•
24
Spatial-Temporal Localization
•

Classification
C3D (3DCNN)
• 3 (Conv)
(Pool)
• Conv :
3x3x3 kernels with stride 1
• Pool : 2x2x2
25
[Tran+ 2015]
input
3 channels
16 frames
112x112 pixels
output
3 (Conv)

MNIST
• HMDB51
• 51 6766
• Prelinger archive, YouTube
• UCF101
• 101
13320
• YouTube
26
[Kuehne+ 2011]
[Soomro+ 2012]

ActivityNet 200
•
•
27
200 1.5 2.3
[Heilbron+ 2015]
• CVPR2016 ActivityNet Challenge
• ActivityNet Challenge 2017
(Untrimmed Video Classification)
8.8%
1 YouTube

ActivityNet 200 (1/4)
(1)
• American Time Use Survey (ATUS)
2000 200
28
American Time Use Survey Activity Lexicon 2016

(2)
• WordNet
YouTube
29

(3)
• (AMT)
30

(4)
•
31

Charades
•
•
•
32
[Sigurdsson+ 2016]
: (mAP)
157 6.7

Charades (1/3)
(1)
• 1 5
5
• 2 2
33

Charades (3/3)
(3)
•
•
5
• 5
35

Charades-Ego
• 1 3
• Charades
• 60%
36
157 4000
[Sigurdsson+ 2018]

Kinetics-400
•
• YouTube 10
• 1 YouTube 1
37
400 30
[Kay+ 2017]
Top-1/Top-5
600
Kinetics-600

Kinetics
1.
•
AMT
2.
• YouTube
•
10
3.
•
AMT
38

SOMETHING-SOMETHING (v1)
•
• Something
•
• 1
•
• Holding something
• Dropping something into something
• Something falling like a rock
•
• 88.5%
39
174 10 2~6
[Goyal+ 2017]

AVA
•
• 14
49
17
• Bounding box
•
40
80 430 15
[Gu+ 2017]

AVA (1/2)
(1) YouTube
• 15 30
• 15 1 3 900
(2) Bounding Box
• Faster-RCNN person detector
•
(3) Bounding Box
• Bounding Box
41

Moments in Time
•
•
•
•
43
339 100 3
[Monfort+ 2018]
Moments in Time
• Top-1: 0.39, Top-5: 0.67
(The Moments in Time Recognition Challenge 2018 )

STAIR Actions (v1.0)
•
• 100
• YouTube
44
100 9 5
[Yoshikawa+ 2018]

STAIR Actions
45
•
•
•
•
• PC
Wiktionary: 1000

STAIR Actions (1/4)
1. YouTube
• 4
CC0
2. 5
• 5
5
3. 5
4.
46

STAIR Actions (2/4)
10
5 10
47
5
10
5

STAIR Actions (4/4)
STAIR Lab
49

STAIR Actions Kinetics
OpenPose
50
STAIR Actions 95.6% Kinetics 55.5%

STAIR Actions
• 2DCNN+LSTM (LRCN) Two-stream CNN 3DCNN
STAIR Actions
• 76.5%
• c.f. Kinetics 61.0% (Two-stream CNN)
51
STAIR Actions

52
/
Bounding
Box
HMDB51 / YouTube 51 6K
UCF101 / YouTube 101 13K
ActivityNet 200 / YouTube 200 15K
Charades / 157 67K
Charades-Ego / 157 8K
Kinetics / YouTube 400 300K
SOMETHING-
SOMETHING (v1)
/ 174 100K
AVA / YouTube 80 430
Moments in
Time
/ YouTube 339 >1M
STAIR Actions
(v1.0)
/
/
YouTube
100 >90K

•
• STAIR Captions
•
• MS COCO 5
• STAIR Actions
• 100
53

1
54
• [Vinyals+ 2015] Vinyals, Oriol, et al. "Show and tell: A neural image
caption generator." Computer Vision and Pattern Recognition (CVPR),
2015 IEEE Conference on. IEEE, 2015.
• [Hodosh+ 2013] Hodosh, Micah, Peter Young, and Julia Hockenmaier.
"Framing image description as a ranking task: Data, models and
evaluation metrics." Journal of Artificial Intelligence Research 47
(2013): 853-899.
• [Young+ 2014] Young, Peter, et al. "From image descriptions to visual
denotations: New similarity metrics for semantic inference over event
descriptions." Transactions of the Association for Computational
Linguistics 2 (2014): 67-78.
• [Plummer+ 2016] Plummer, Bryan A., et al. "Flickr30k entities:
Collecting region-to-phrase correspondences for richer image-to-
sentence models." Computer Vision (ICCV), 2015 IEEE International
Conference on. IEEE, 2015.
• [Liu+ 2017] Liu, Chenxi, et al. "Attention Correctness in Neural Image
Captioning." AAAI. 2017.
• [Chen+ 2015] Chen, Xinlei, et al. "Microsoft COCO captions: Data
collection and evaluation server." arXiv preprint
arXiv:1504.00325 (2015).

2
55
• [Krishna+ 2017] Krishna, Ranjay, et al. "Visual genome: Connecting
language and vision using crowdsourced dense image
annotations." International Journal of Computer Vision 123.1 (2017):
32-73.
• [Krause+ 2017] Krause, Jonathan, et al. "A hierarchical approach for
generating descriptive image paragraphs." 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
• [Yoshikawa+ 2017] Yoshikawa, Yuya, Yutaro Shigeto, and Akikazu
Takeuchi. "Stair captions: Constructing a large-scale japanese image
caption dataset." arXiv preprint arXiv:1705.00823 (2017).
• [Rashtchian+ 2010] Rashtchian, Cyrus, et al. "Collecting image
annotations using Amazon's Mechanical Turk." Proceedings of the NAACL
HLT 2010 Workshop on Creating Speech and Language Data with Amazon's
Mechanical Turk. Association for Computational Linguistics, 2010.
• [Funaki+ 2015] Funaki, Ruka, and Hideki Nakayama. "Image-mediated
learning for zero-shot cross-lingual document retrieval." Proceedings
of the 2015 Conference on Empirical Methods in Natural Language
Processing. 2015.

3
56
• [Zitnick+ 2013] Zitnick, C. Lawrence, and Devi Parikh. "Bringing
semantics into focus using visual abstraction." Computer Vision and
Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013.
• [Miyazaki+ 2016] Miyazaki, Takashi, and Nobuyuki Shimizu. "Cross-
lingual image caption generation." Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers). Vol. 1. 2016.
• [Elliott+ 2016] Elliott, Desmond, et al. "Multi30k: Multilingual
english-german image descriptions." arXiv preprint
arXiv:1605.00459 (2016).
• [Tran+ 2015] Tran, Du, et al. "C3D: generic features for video
analysis." CoRR, abs/1412.0767 2.7 (2014): 8.
• [Kuehne+ 2011] Kuehne, Hilde, et al. "HMDB51: A large video database
for human motion recognition." High Performance Computing in Science
and Engineering ‘12. Springer, Berlin, Heidelberg, 2013. 571-582.
• [Soomro+ 2012] Soomro, Khurram, Amir Roshan Zamir, and Mubarak Shah.
"UCF101: A dataset of 101 human actions classes from videos in the
wild." arXiv preprint arXiv:1212.0402 (2012).

4
57
• [Heilbron+ 2015] Caba Heilbron, Fabian, et al. "Activitynet: A large-
scale video benchmark for human activity understanding." Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition.
2015.
• [Sigurdsson+ 2016] Sigurdsson, Gunnar A., et al. "Hollywood in homes:
Crowdsourcing data collection for activity understanding." European
Conference on Computer Vision. Springer, Cham, 2016.
• [Sigurdsson+ 2018] Sigurdsson, Gunnar A., et al. "Charades-Ego: A
Large-Scale Dataset of Paired Third and First Person Videos." arXiv
preprint arXiv:1804.09626 (2018).
• [Kay+ 2017] Kay, Will, et al. "The kinetics human action video
dataset." arXiv preprint arXiv:1705.06950 (2017).
• [Goyal+ 2017] Goyal, Raghav, et al. "The” something something” video
database for learning and evaluating visual common sense." Proc. ICCV.
2017.
• [Gu+ 2017] Gu, Chunhui, et al. "AVA: A video dataset of spatio-
temporally localized atomic visual actions." arXiv preprint
arXiv:1705.08421(2017).

5
58
• [Monfort+ 2018] Monfort, Mathew, et al. "Moments in Time Dataset: one
million videos for event understanding." arXiv preprint
arXiv:1801.03150(2018).
• [Yoshikawa+ 2018] Yoshikawa, Yuya, Jiaqing Lin, and Akikazu Takeuchi.
"STAIR Actions: A Video Dataset of Everyday Home Actions." arXiv
preprint arXiv:1804.04326 (2018).

画像キャプションと動作認識の最前線〜データセットに注目して〜（第17回ステアラボ人工知能セミナー）

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 画像キャプションと動作認識の最前線〜データセットに注目して〜（第17回ステアラボ人工知能セミナー）

Similar to 画像キャプションと動作認識の最前線〜データセットに注目して〜（第17回ステアラボ人工知能セミナー） (20)

More from STAIR Lab, Chiba Institute of Technology

More from STAIR Lab, Chiba Institute of Technology (20)

Recently uploaded

Recently uploaded (20)