Giro-i-Nieto, X. One Perceptron to Rule Them All: Language, Vision, Audio and Speech. In Proceedings of the 2020 International Conference on Multimedia Retrieval (pp. 7-8).
Tutorial page:
https://imatge.upc.edu/web/publications/one-perceptron-rule-them-all-language-vision-audio-and-speech-tutorial
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep neural representation. This tutorial will firstly review the basic neural architectures to encode and decode vision, text and audio, to later review the those models that have successfully translated information across modalities.
7. 7
#ShowAndTell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption
generator." CVPR 2015.
Image Captioning with RNN
8. 8
#DeepImageSent Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions."
CVPR 2015 (Slides by Marc Bolaños)
Image Captioning with RNN
9. 9
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
Image Captioning with RNN & Attention
10. 10
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
Image Captioning with RNN & Attention
11. 11
Cornia, Marcella, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. "Meshed-Memory Transformer for Image
Captioning." CVPR 2020. [tweet]
Image Captioning with Transformers
12. 12
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense
captioning." CVPR 2016
Dense Captioning
13. 13
XAVI: “man has
short hair”, “man
with short hair”
AMAIA:”a woman
wearing a black
shirt”, “
BOTH: “two men
wearing black
glasses”
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense
captioning." CVPR 2016
Dense Captioning
14. 14
Recipe Generation
Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe
Generation from Food Images." CVPR 2019.
15. 15
Recipe Generation
Title: Edamame corn salad
Ingredients
pepper, corn, onion, edamame, salt, vinegar, cilantro, avocado, oil
Instructions
- In a large bowl, combine edamame, corn, red onion, cilantro,
avocado, and red bell pepper.
- In a small bowl, whisk together olive oil, vinegar, salt, and
pepper.
- Pour dressing over edamame mixture and toss to coat.
- Cover and refrigerate for at least 1 hour before serving.
Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe
Generation from Food Images." CVPR 2019.
16. 16
#Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming
Bias in Captioning Models." ECCV 2018.
Fighting Data Bias in Captioning
17. 17
#Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming
Bias in Captioning Models." ECCV 2018.
Fighting Data Bias in Captioning
18. 18
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor
Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
Video Captioning
19. 19
(Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural
Encoder for Video Representation with Application to Captioning, CVPR 2016.
LSTM unit
(2nd layer)
Time
Image
t = 1 t = T
hidden state
at t = T
first chunk
of data
Captioning: Video
21. 21
Multimodal Machine Translation
Sulubacak, Umut, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, and Jörg Tiedemann.
"Multimodal machine translation through visuals and speech." Machine Translation (2020): 1-51. [tweet]
22. 22
Sign Language Translation with RNN+Att
Camgoz, Necati Cihan, et al. Neural Sign Language Translation. CVPR 2018.
23. 23
Sign Language Translation with Transformers
Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, Richard Bowden, “Sign Language Transformers: Joint
End-to-end Sign Language Recognition and Translation” CVPR 2020.
24. 24
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading."
(2016).
25. 25
Lip Reading
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level
Lipreading." (2016).
26. 26
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild."
CVPR 2017
27. 27
Lipreading: Watch, Listen, Attend & Spell
Audio
features
Image
features
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
28. 28
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
Attention over output
states from audio and
video is computed at
each timestep
30. 30
Image Captioning Grounded on Detected Objects
Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]
31. 31Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]
Image Captioning Grounded on Detected Objects
32. 32Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal
Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code]
Image Captioning Grounded on Heatmaps
33. 33
Outline
1. Generative Models
a. Text
b. Vision
2. Discriminative Models
a. Text
b. Vision
3. Representation Learning
4. Control Tasks
35. 35
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016.
Image Generation
36. 36
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016. [code]
Image Generation
37. 37
Image Synthesis
#StackGAN Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas.
"Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code]
38. 38
Image Synthesis with Cycle Consistency
#MirroGAN Qiao, Tingting, Jing Zhang, Duanqing Xu, and Dacheng Tao. "Mirrorgan: Learning text-to-image generation by
redescription." CVPR 2019. [code]
39. 39
Image Synthesis with Cycle Consistency
#MirroGAN Qiao, Tingting, Jing Zhang, Duanqing Xu, and Dacheng Tao. "Mirrorgan: Learning text-to-image generation by
redescription." CVPR 2019. [code]
40. 40Justin Johnson, Agrim Gupta, Li Fei-Fei, “Image Generation from Scene Graphs” CVPR 2018
Image Generation via Scene Graphs
41. 41
#Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual
Descriptions." CVPR 2019 [blog].
42. 42
#CRAFT Gupta, Tanmay, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to
compositions to videos." ECCV 2018
Video Generation by Composition
43. 43
Saunders, B., Camgoz, N. C., & Bowden, R. (2020). Progressive Transformers for End-to-End Sign Language Production.
ECCV 2020.
Sign Language Generation with Transformers
44. 44
Lucas Ventura, Amanda Duarte, Xavier Giro-i-Nieto, “Can Everybody Sign Now ? Exploring Sign Language
Video Generation from 2D Poses”. ECCV SLRTP Workshop 2020.
Sign Language Generation (pose 2 pixels)
45. 45
Outline
1. Generative Models
a. Text
b. Vision
2. Discriminative Models
a. Text
b. Vision
3. Representation Learning
4. Control Tasks
47. 47
Visual Question Answering
Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA:
Visual question answering." CVPR 2015.
48. 48
Visual Question Answering (VQA)
Francisco Roldán, Issey Masuda, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Visual
Question-Answering 2.0." ETSETB UPC TelecomBCN (2017).
49. 49
Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter
prediction. CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
Visual Question Answering (VQA)
50. 50
VQA: Dynamic Memory Networks
(Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for
Visual and Textual Question Answering." ICML 2016
51. 51
Visual Reasoning
#Clevr Johnson, Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick.
"CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning." CVPR 2017
52. 52
Visual Reasoning: Programming
(Slides by Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Fei-Fei Li, Larry
Zitnick, Ross Girshick , “Inferring and Executing Programs for Visual Reasoning”. ICCV 2017
Program Generator Execution Engine
53. 53
Visual Reasoning: Relation Networks
#RN Santoro, Adam, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy
Lillicrap. "A simple neural network module for relational reasoning." NIPS 2017.
Relation Networks concatenate all possible pairs of objects with the an encoded question to later find the
answer with a MLP.
54. 54
Visual Dialog
Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. "Visual
Dialog." CVPR 2017 [Project]
57. 57
Hate Speech Detection in Memes
Benet Oriol, Cristian Canton, Xavier Giro-i-Nieto, “Hate Speech in Pixels: Detection of Offensive Memes
towards Automatic Moderation”. NeurIPS 2019 AI for Good Workshop. [code]
Hate Speech Detection
60. 60
Niu, Yulei, Hanwang Zhang, Zhiwu Lu, and Shih-Fu Chang. "Variational Context: Exploiting Visual and Textual Context for
Grounding Referring Expressions." arXiv preprint arXiv:1907.03609 (2019).
Objects from Referring Expressions
61. 61
Video Objects from Referring Expressions
Li, Zhenyang, Ran Tao, Efstratios Gavves, Cees GM Snoek, and Arnold WM Smeulders. "Tracking by natural language
specification." CVPR 2017. [code]
62. 62
Video Object Detection with Transformers
Sadhu, A., Chen, K., & Nevatia, R. (2020). Video Object Grounding using Semantic Roles in Language Description. CVPR 2020.
63. 63
#Mattnet Yu, Licheng, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. "Mattnet: Modular
attention network for referring expression comprehension." CVPR 2018. [code]
Segments from Referring Expressions
64. 64
Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. "Video object segmentation with language referring expressions." ACCV
2018.
Video Objects from Referring Expressions
65. 65
Herrera-Palacio, Alba, Carles Ventura, and Xavier Giro-i-Nieto. "Video object linguistic grounding." ACM Multimedia
Workshops 2019.
Video Objects from Referring Expressions
66. 66
#RefVOS Bellver, Miriam, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, and Xavier Giro-i-Nieto.
"RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation." arXiv preprint
arXiv:2010.00263 (2020).
Video Objects from Referring Expressions
67. 67
#RefVOS Bellver, Miriam, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, and Xavier Giro-i-Nieto.
"RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation." arXiv preprint
arXiv:2010.00263 (2020).
Video Objects from Referring Expressions
68. 68
#SynthRef Ioannis Kazakos, Bellver, Miriam, Carles Ventura, Carina Silberer, and Xavier Giro-i-Nieto, “Generation
of Synthetic Referring Expressions for Object Segmentation” (submitted)
Synthetic Expressions w/ Scene Graphs
69. 69
#SynthRef Ioannis Kazakos, Bellver, Miriam, Carles Ventura, Carina Silberer, and Xavier Giro-i-Nieto, “Generation
of Synthetic Referring Expressions for Object Segmentation” (submitted)
70. Segments from Questions
Gan, Chuang, Yandong Li, Haoxiang Li, Chen Sun, and Boqing Gong. "VQS: Linking segmentations to questions and
answers for supervised attention in vqa and question-focused semantic segmentation." ICCV 2017.
73. 73
Joint Representations (Embeddings)
#Devise Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A deep
visual-semantic embedding model." NIPS 2013
74. 74
Zero-shot learning
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code]
No images from “cat” in
the training set...
...but they can still be
recognised as “cats”
thanks to the
representations learned
from text .
75. 75
Multimodal Retrieval
Kiros, Ryan, Ruslan Salakhutdinov, and Richard S. Zemel. "Unifying visual-semantic embeddings with multimodal neural
language models." NeurIPS 2014 Deep Learning Workshop.
76. 76
Multimodal Retrieval
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks."
CVPR 2016.
77. 77
Multimodal Retrieval
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks."
CVPR 2016.
78. 78
Image and text retrieval with joint embeddings.
Joint Neural Embeddings
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio
Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 [video]
79. 79
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba,
“Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 [video]
Joint Neural Embeddings
80. 80
Joint Neural Embeddings
joint
embedding
LSTM Bidirectional LSTM
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba,
“Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017
82. 82
Representations
#ViLBERT Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic
representations for vision-and-language tasks." NeurIPS 2019. [MIT talk by Devih Parikh] [demo]
Visual Task:
Predict the visual categories for the
masked video frame
Language Task:
Predict the masked word (same as in
language-only BERT).
83. 83
Representations
#ViLBERT Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic
representations for vision-and-language tasks." NeurIPS 2019. [MIT talk by Devih Parikh] [demo]
Multimodal Task:
Predict whether the video frames correspond to the caption.
84. 84
Representations
#VideoBERT Sun, Chen, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. "Videobert: A joint model for video
and language representation learning." ICCV 2019.
85. 85
Representations
#VideoBERT Sun, Chen, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. "Videobert: A joint model for video
and language representation learning." ICCV 2019.
Rich representations can be used to retrieve matching video frames, which are encoded after vector
quantization.
87. 87
Learning Language from Video
Doughty, Hazel, Ivan Laptev, Walterio Mayol-Cuevas, and Dima Damen. "Action Modifiers: Learning from Adverbs in
Instructional Videos." CVPR 2020..
88. 88
Learning Language from Video
Surís, Dídac, Dave Epstein, Heng Ji, Shih-Fu Chang, and Carl Vondrick. "Learning to Learn Words from Visual Scenes." ECCV
2020.
90. 90
Platforms for Embodied AI
#Habitat Savva, Manolis, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub et
al. "Habitat: A platform for embodied ai research." ICCV 2019. [site]
91. 91
Navigation
Fried, Daniel, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor
Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. "Speaker-Follower Models for Vision-and-Language
Navigation." NeurIPS 2018.
92. 92
Navigation
#R2R Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., ... & van den Hengel, A. Vision-and-language
navigation: Interpreting visually-grounded navigation instructions in real environments. CVPR 2018. [tweet]
93. 93
Navigation
#RxR Alexander Ku and Peter Anderson and Roma Patel and Eugene Ie and Jason Baldridge, “Room-Across-Room:
Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding” EMNLP 2020.
94. 94
Navigation
Ünal, Emre, Ozan Arkan Can, and Yücel Yemez. "Visually Grounded Language Learning For Robot Navigation." ACMMM
Workshops 2019.
95. 95
Object manipulation
Hill, F., Lampinen, A. K., Schneider, R., Clark, S., Botvinick, M., McClelland, J. L., & Santoro, A. Environmental drivers of
systematicity and generalization in a situated agent. ICLR 2020. [talk]
97. 97
My take home message
1. Generative Models
a. Text
b. Vision
2. Discriminative Models
a. Text
b. Vision
3. Feature Learning
4. Control Tasks
98. Xavier Giro-i-Nieto
@DocXavi
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Barcelona Supercomputing Center
Was this tutorial helpful ? Please consider citing:
Go raibh maith agat / Thank you
Giro-i-Nieto, X. One Perceptron to Rule Them All: Language,
Vision, Audio and Speech. In Proceedings of the 2020
International Conference on Multimedia Retrieval (pp. 7-8).