11. MS COCO
• Flickr
5
•
11
http://cocodataset.org/#explore?id=409091
• a lady blowing out candles on a cake
• the woman is blowing out her birthday
cake candles
• a woman blowing candles on a frosted
cake.
• two people blowing out candles on a
cake.
• a girl is blowing out candles on a
birthday cake.
[Chen+ 2015]
16. 1 1
•
Visual Genome 1
16
[Krause+ 2017]
Two children are sitting at a table in a restaurant.
The children are one little girl and one little boy. The
little girl is eating a pink frosted donut with white icing
lines on top of it. The girl has blonde hair and is
wearing a green jacket with a black long sleeve shirt
underneath. The little boy is wearing a black zip up
jacket and is holding his finger to his lip but is not
eating. A metal napkin dispenser is in between them
at the table. The wall next to them is white brick. Two
adults are on the other side of the short white brick
wall. The room has white circular lights on the ceiling
and a large window in the front of the restaurant. It is
daylight outside.
54. 1
54
• [Vinyals+ 2015] Vinyals, Oriol, et al. "Show and tell: A neural image
caption generator." Computer Vision and Pattern Recognition (CVPR),
2015 IEEE Conference on. IEEE, 2015.
• [Hodosh+ 2013] Hodosh, Micah, Peter Young, and Julia Hockenmaier.
"Framing image description as a ranking task: Data, models and
evaluation metrics." Journal of Artificial Intelligence Research 47
(2013): 853-899.
• [Young+ 2014] Young, Peter, et al. "From image descriptions to visual
denotations: New similarity metrics for semantic inference over event
descriptions." Transactions of the Association for Computational
Linguistics 2 (2014): 67-78.
• [Plummer+ 2016] Plummer, Bryan A., et al. "Flickr30k entities:
Collecting region-to-phrase correspondences for richer image-to-
sentence models." Computer Vision (ICCV), 2015 IEEE International
Conference on. IEEE, 2015.
• [Liu+ 2017] Liu, Chenxi, et al. "Attention Correctness in Neural Image
Captioning." AAAI. 2017.
• [Chen+ 2015] Chen, Xinlei, et al. "Microsoft COCO captions: Data
collection and evaluation server." arXiv preprint
arXiv:1504.00325 (2015).
55. 2
55
• [Krishna+ 2017] Krishna, Ranjay, et al. "Visual genome: Connecting
language and vision using crowdsourced dense image
annotations." International Journal of Computer Vision 123.1 (2017):
32-73.
• [Krause+ 2017] Krause, Jonathan, et al. "A hierarchical approach for
generating descriptive image paragraphs." 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
• [Yoshikawa+ 2017] Yoshikawa, Yuya, Yutaro Shigeto, and Akikazu
Takeuchi. "Stair captions: Constructing a large-scale japanese image
caption dataset." arXiv preprint arXiv:1705.00823 (2017).
• [Rashtchian+ 2010] Rashtchian, Cyrus, et al. "Collecting image
annotations using Amazon's Mechanical Turk." Proceedings of the NAACL
HLT 2010 Workshop on Creating Speech and Language Data with Amazon's
Mechanical Turk. Association for Computational Linguistics, 2010.
• [Funaki+ 2015] Funaki, Ruka, and Hideki Nakayama. "Image-mediated
learning for zero-shot cross-lingual document retrieval." Proceedings
of the 2015 Conference on Empirical Methods in Natural Language
Processing. 2015.
56. 3
56
• [Zitnick+ 2013] Zitnick, C. Lawrence, and Devi Parikh. "Bringing
semantics into focus using visual abstraction." Computer Vision and
Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013.
• [Miyazaki+ 2016] Miyazaki, Takashi, and Nobuyuki Shimizu. "Cross-
lingual image caption generation." Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers). Vol. 1. 2016.
• [Elliott+ 2016] Elliott, Desmond, et al. "Multi30k: Multilingual
english-german image descriptions." arXiv preprint
arXiv:1605.00459 (2016).
• [Tran+ 2015] Tran, Du, et al. "C3D: generic features for video
analysis." CoRR, abs/1412.0767 2.7 (2014): 8.
• [Kuehne+ 2011] Kuehne, Hilde, et al. "HMDB51: A large video database
for human motion recognition." High Performance Computing in Science
and Engineering ‘12. Springer, Berlin, Heidelberg, 2013. 571-582.
• [Soomro+ 2012] Soomro, Khurram, Amir Roshan Zamir, and Mubarak Shah.
"UCF101: A dataset of 101 human actions classes from videos in the
wild." arXiv preprint arXiv:1212.0402 (2012).
57. 4
57
• [Heilbron+ 2015] Caba Heilbron, Fabian, et al. "Activitynet: A large-
scale video benchmark for human activity understanding." Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition.
2015.
• [Sigurdsson+ 2016] Sigurdsson, Gunnar A., et al. "Hollywood in homes:
Crowdsourcing data collection for activity understanding." European
Conference on Computer Vision. Springer, Cham, 2016.
• [Sigurdsson+ 2018] Sigurdsson, Gunnar A., et al. "Charades-Ego: A
Large-Scale Dataset of Paired Third and First Person Videos." arXiv
preprint arXiv:1804.09626 (2018).
• [Kay+ 2017] Kay, Will, et al. "The kinetics human action video
dataset." arXiv preprint arXiv:1705.06950 (2017).
• [Goyal+ 2017] Goyal, Raghav, et al. "The” something something” video
database for learning and evaluating visual common sense." Proc. ICCV.
2017.
• [Gu+ 2017] Gu, Chunhui, et al. "AVA: A video dataset of spatio-
temporally localized atomic visual actions." arXiv preprint
arXiv:1705.08421(2017).
58. 5
58
• [Monfort+ 2018] Monfort, Mathew, et al. "Moments in Time Dataset: one
million videos for event understanding." arXiv preprint
arXiv:1801.03150(2018).
• [Yoshikawa+ 2018] Yoshikawa, Yuya, Jiaqing Lin, and Akikazu Takeuchi.
"STAIR Actions: A Video Dataset of Everyday Home Actions." arXiv
preprint arXiv:1804.04326 (2018).