SlideShare a Scribd company logo
1 of 98
Download to read offline
Xavier Giro-i-Nieto
@DocXavi
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Barcelona Supercomputing Center
Slides @DocXavi
Tutorial:
One Perceptron to Rule Them All
Part III: Language & Vision
2
Acknowledgments
Mariona
Carós
Benet
Oriol
Amaia
Salvador
Santiago
Pascual
Marta R.
Costa-jussà
Francisco
Roldan
Issey
Masuda
Ionut
Sorodoc
Carina
Silberer
Gemma
Boleda
Carles
Ventura
Ioannis
Kazakos
Míriam
Bellver
Alba M.
Herrera
Amanda
Duarte
3
4
Outline
1. Generative Models
a. Text
b. Vision
2. Discriminative Models
a. Text
b. Vision
3. Representation Learning
4. Control Tasks
5
Outline
1. Generative Models
a. Text
b. Vision
2. Discriminative Models
a. Text
b. Vision
3. Representation Learning
4. Control Tasks
6
Encoder Decoder
Representation
7
#ShowAndTell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption
generator." CVPR 2015.
Image Captioning with RNN
8
#DeepImageSent Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions."
CVPR 2015 (Slides by Marc Bolaños)
Image Captioning with RNN
9
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
Image Captioning with RNN & Attention
10
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
Image Captioning with RNN & Attention
11
Cornia, Marcella, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. "Meshed-Memory Transformer for Image
Captioning." CVPR 2020. [tweet]
Image Captioning with Transformers
12
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense
captioning." CVPR 2016
Dense Captioning
13
XAVI: “man has
short hair”, “man
with short hair”
AMAIA:”a woman
wearing a black
shirt”, “
BOTH: “two men
wearing black
glasses”
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense
captioning." CVPR 2016
Dense Captioning
14
Recipe Generation
Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe
Generation from Food Images." CVPR 2019.
15
Recipe Generation
Title: Edamame corn salad
Ingredients
pepper, corn, onion, edamame, salt, vinegar, cilantro, avocado, oil
Instructions
- In a large bowl, combine edamame, corn, red onion, cilantro,
avocado, and red bell pepper.
- In a small bowl, whisk together olive oil, vinegar, salt, and
pepper.
- Pour dressing over edamame mixture and toss to coat.
- Cover and refrigerate for at least 1 hour before serving.
Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe
Generation from Food Images." CVPR 2019.
16
#Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming
Bias in Captioning Models." ECCV 2018.
Fighting Data Bias in Captioning
17
#Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming
Bias in Captioning Models." ECCV 2018.
Fighting Data Bias in Captioning
18
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor
Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
Video Captioning
19
(Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural
Encoder for Video Representation with Application to Captioning, CVPR 2016.
LSTM unit
(2nd layer)
Time
Image
t = 1 t = T
hidden state
at t = T
first chunk
of data
Captioning: Video
20
Multimodal Machine Translation
Challenge on Multimodal Image Translation:
http://www.statmt.org/wmt17/multimodal-task.html#task1
21
Multimodal Machine Translation
Sulubacak, Umut, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, and Jörg Tiedemann.
"Multimodal machine translation through visuals and speech." Machine Translation (2020): 1-51. [tweet]
22
Sign Language Translation with RNN+Att
Camgoz, Necati Cihan, et al. Neural Sign Language Translation. CVPR 2018.
23
Sign Language Translation with Transformers
Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, Richard Bowden, “Sign Language Transformers: Joint
End-to-end Sign Language Recognition and Translation” CVPR 2020.
24
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading."
(2016).
25
Lip Reading
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level
Lipreading." (2016).
26
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild."
CVPR 2017
27
Lipreading: Watch, Listen, Attend & Spell
Audio
features
Image
features
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
28
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
Attention over output
states from audio and
video is computed at
each timestep
29
Lipreading
Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "Deep Lip Reading: a comparison of models and an online
application." Interspeech 2018.
30
Image Captioning Grounded on Detected Objects
Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]
31Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]
Image Captioning Grounded on Detected Objects
32Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal
Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code]
Image Captioning Grounded on Heatmaps
33
Outline
1. Generative Models
a. Text
b. Vision
2. Discriminative Models
a. Text
b. Vision
3. Representation Learning
4. Control Tasks
34
Encoder Decoder
Representation
35
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016.
Image Generation
36
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016. [code]
Image Generation
37
Image Synthesis
#StackGAN Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas.
"Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code]
38
Image Synthesis with Cycle Consistency
#MirroGAN Qiao, Tingting, Jing Zhang, Duanqing Xu, and Dacheng Tao. "Mirrorgan: Learning text-to-image generation by
redescription." CVPR 2019. [code]
39
Image Synthesis with Cycle Consistency
#MirroGAN Qiao, Tingting, Jing Zhang, Duanqing Xu, and Dacheng Tao. "Mirrorgan: Learning text-to-image generation by
redescription." CVPR 2019. [code]
40Justin Johnson, Agrim Gupta, Li Fei-Fei, “Image Generation from Scene Graphs” CVPR 2018
Image Generation via Scene Graphs
41
#Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual
Descriptions." CVPR 2019 [blog].
42
#CRAFT Gupta, Tanmay, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to
compositions to videos." ECCV 2018
Video Generation by Composition
43
Saunders, B., Camgoz, N. C., & Bowden, R. (2020). Progressive Transformers for End-to-End Sign Language Production.
ECCV 2020.
Sign Language Generation with Transformers
44
Lucas Ventura, Amanda Duarte, Xavier Giro-i-Nieto, “Can Everybody Sign Now ? Exploring Sign Language
Video Generation from 2D Poses”. ECCV SLRTP Workshop 2020.
Sign Language Generation (pose 2 pixels)
45
Outline
1. Generative Models
a. Text
b. Vision
2. Discriminative Models
a. Text
b. Vision
3. Representation Learning
4. Control Tasks
46
Encoder
Decoder
Representation
Encoder
Representation
47
Visual Question Answering
Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA:
Visual question answering." CVPR 2015.
48
Visual Question Answering (VQA)
Francisco Roldán, Issey Masuda, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Visual
Question-Answering 2.0." ETSETB UPC TelecomBCN (2017).
49
Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter
prediction. CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
Visual Question Answering (VQA)
50
VQA: Dynamic Memory Networks
(Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for
Visual and Textual Question Answering." ICML 2016
51
Visual Reasoning
#Clevr Johnson, Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick.
"CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning." CVPR 2017
52
Visual Reasoning: Programming
(Slides by Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Fei-Fei Li, Larry
Zitnick, Ross Girshick , “Inferring and Executing Programs for Visual Reasoning”. ICCV 2017
Program Generator Execution Engine
53
Visual Reasoning: Relation Networks
#RN Santoro, Adam, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy
Lillicrap. "A simple neural network module for relational reasoning." NIPS 2017.
Relation Networks concatenate all possible pairs of objects with the an encoded question to later find the
answer with a MLP.
54
Visual Dialog
Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. "Visual
Dialog." CVPR 2017 [Project]
55
Visual Dialog
Caros, Mariona, Maite Garolera, Petia Radeva, and Xavier Giro-i-Nieto. "Automatic Reminiscence Therapy for
Dementia." ICMR 2020. [talk]
Demo @ ICMR 2020 (Wednesday 11:00am)
56
Visual Dialog
Caros, Mariona, Maite Garolera, Petia Radeva, and Xavier Giro-i-Nieto. "Automatic Reminiscence Therapy for
Dementia." ICMR 2020. [talk]
57
Hate Speech Detection in Memes
Benet Oriol, Cristian Canton, Xavier Giro-i-Nieto, “Hate Speech in Pixels: Detection of Offensive Memes
towards Automatic Moderation”. NeurIPS 2019 AI for Good Workshop. [code]
Hate Speech Detection
58
Outline
1. Generative Models
a. Text
b. Image
2. Discriminative Models
a. Text
b. Image
3. Representation Learning
4. Control Tasks
59
Encoder
Decoder
Representation
Encoder
Representation
60
Niu, Yulei, Hanwang Zhang, Zhiwu Lu, and Shih-Fu Chang. "Variational Context: Exploiting Visual and Textual Context for
Grounding Referring Expressions." arXiv preprint arXiv:1907.03609 (2019).
Objects from Referring Expressions
61
Video Objects from Referring Expressions
Li, Zhenyang, Ran Tao, Efstratios Gavves, Cees GM Snoek, and Arnold WM Smeulders. "Tracking by natural language
specification." CVPR 2017. [code]
62
Video Object Detection with Transformers
Sadhu, A., Chen, K., & Nevatia, R. (2020). Video Object Grounding using Semantic Roles in Language Description. CVPR 2020.
63
#Mattnet Yu, Licheng, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. "Mattnet: Modular
attention network for referring expression comprehension." CVPR 2018. [code]
Segments from Referring Expressions
64
Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. "Video object segmentation with language referring expressions." ACCV
2018.
Video Objects from Referring Expressions
65
Herrera-Palacio, Alba, Carles Ventura, and Xavier Giro-i-Nieto. "Video object linguistic grounding." ACM Multimedia
Workshops 2019.
Video Objects from Referring Expressions
66
#RefVOS Bellver, Miriam, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, and Xavier Giro-i-Nieto.
"RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation." arXiv preprint
arXiv:2010.00263 (2020).
Video Objects from Referring Expressions
67
#RefVOS Bellver, Miriam, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, and Xavier Giro-i-Nieto.
"RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation." arXiv preprint
arXiv:2010.00263 (2020).
Video Objects from Referring Expressions
68
#SynthRef Ioannis Kazakos, Bellver, Miriam, Carles Ventura, Carina Silberer, and Xavier Giro-i-Nieto, “Generation
of Synthetic Referring Expressions for Object Segmentation” (submitted)
Synthetic Expressions w/ Scene Graphs
69
#SynthRef Ioannis Kazakos, Bellver, Miriam, Carles Ventura, Carina Silberer, and Xavier Giro-i-Nieto, “Generation
of Synthetic Referring Expressions for Object Segmentation” (submitted)
Segments from Questions
Gan, Chuang, Yandong Li, Haoxiang Li, Chen Sun, and Boqing Gong. "VQS: Linking segmentations to questions and
answers for supervised attention in vqa and question-focused semantic segmentation." ICCV 2017.
71
Outline
1. Generative Models
a. Text
b. Image
2. Discriminative Models
a. Text
b. Image
3. Representation Learning
4. Control Tasks
72
Encoder Encoder
Representation
73
Joint Representations (Embeddings)
#Devise Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A deep
visual-semantic embedding model." NIPS 2013
74
Zero-shot learning
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code]
No images from “cat” in
the training set...
...but they can still be
recognised as “cats”
thanks to the
representations learned
from text .
75
Multimodal Retrieval
Kiros, Ryan, Ruslan Salakhutdinov, and Richard S. Zemel. "Unifying visual-semantic embeddings with multimodal neural
language models." NeurIPS 2014 Deep Learning Workshop.
76
Multimodal Retrieval
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks."
CVPR 2016.
77
Multimodal Retrieval
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks."
CVPR 2016.
78
Image and text retrieval with joint embeddings.
Joint Neural Embeddings
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio
Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 [video]
79
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba,
“Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 [video]
Joint Neural Embeddings
80
Joint Neural Embeddings
joint
embedding
LSTM Bidirectional LSTM
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba,
“Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017
81
Representations
Sariyildiz, Mert Bulent, Julien Perez, and Diane Larlus. "Learning Visual Representations with Caption Annotations." ECCV
2020. [tweet]
82
Representations
#ViLBERT Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic
representations for vision-and-language tasks." NeurIPS 2019. [MIT talk by Devih Parikh] [demo]
Visual Task:
Predict the visual categories for the
masked video frame
Language Task:
Predict the masked word (same as in
language-only BERT).
83
Representations
#ViLBERT Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic
representations for vision-and-language tasks." NeurIPS 2019. [MIT talk by Devih Parikh] [demo]
Multimodal Task:
Predict whether the video frames correspond to the caption.
84
Representations
#VideoBERT Sun, Chen, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. "Videobert: A joint model for video
and language representation learning." ICCV 2019.
85
Representations
#VideoBERT Sun, Chen, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. "Videobert: A joint model for video
and language representation learning." ICCV 2019.
Rich representations can be used to retrieve matching video frames, which are encoded after vector
quantization.
86
Representations
#VirTEX Karan Desai, Justin Johnson, “VirTex: Learning Visual Representations from Textual Annotations” arXiv 2020
[tweet]
87
Learning Language from Video
Doughty, Hazel, Ivan Laptev, Walterio Mayol-Cuevas, and Dima Damen. "Action Modifiers: Learning from Adverbs in
Instructional Videos." CVPR 2020..
88
Learning Language from Video
Surís, Dídac, Dave Epstein, Heng Ji, Shih-Fu Chang, and Carl Vondrick. "Learning to Learn Words from Visual Scenes." ECCV
2020.
89
Outline
1. Generative Models
a. Text
b. Image
2. Discriminative Models
a. Text
b. Image
3. Representation Learning
4. Control Tasks
90
Platforms for Embodied AI
#Habitat Savva, Manolis, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub et
al. "Habitat: A platform for embodied ai research." ICCV 2019. [site]
91
Navigation
Fried, Daniel, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor
Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. "Speaker-Follower Models for Vision-and-Language
Navigation." NeurIPS 2018.
92
Navigation
#R2R Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., ... & van den Hengel, A. Vision-and-language
navigation: Interpreting visually-grounded navigation instructions in real environments. CVPR 2018. [tweet]
93
Navigation
#RxR Alexander Ku and Peter Anderson and Roma Patel and Eugene Ie and Jason Baldridge, “Room-Across-Room:
Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding” EMNLP 2020.
94
Navigation
Ünal, Emre, Ozan Arkan Can, and Yücel Yemez. "Visually Grounded Language Learning For Robot Navigation." ACMMM
Workshops 2019.
95
Object manipulation
Hill, F., Lampinen, A. K., Schneider, R., Clark, S., Botvinick, M., McClelland, J. L., & Santoro, A. Environmental drivers of
systematicity and generalization in a situated agent. ICLR 2020. [talk]
96
Outline
1. Generative Models
a. Text
b. Image
2. Discriminative Models
a. Text
b. Image
3. Representation Learning
4. Control Tasks
97
My take home message
1. Generative Models
a. Text
b. Vision
2. Discriminative Models
a. Text
b. Vision
3. Feature Learning
4. Control Tasks
Xavier Giro-i-Nieto
@DocXavi
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Barcelona Supercomputing Center
Was this tutorial helpful ? Please consider citing:
Go raibh maith agat / Thank you
Giro-i-Nieto, X. One Perceptron to Rule Them All: Language,
Vision, Audio and Speech. In Proceedings of the 2020
International Conference on Multimedia Retrieval (pp. 7-8).

More Related Content

What's hot

Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...Universitat Politècnica de Catalunya
 
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Universitat Politècnica de Catalunya
 
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019Universitat Politècnica de Catalunya
 
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Universitat Politècnica de Catalunya
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaUniversitat Politècnica de Catalunya
 
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...Universitat Politècnica de Catalunya
 
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Universitat Politècnica de Catalunya
 
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Universitat Politècnica de Catalunya
 
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...Universitat Politècnica de Catalunya
 
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Universitat Politècnica de Catalunya
 

What's hot (20)

Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019
Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019
Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
 
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
 
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
 
Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)
 
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
 
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
 
Deep Learning from Videos (UPC 2018)
Deep Learning from Videos (UPC 2018)Deep Learning from Videos (UPC 2018)
Deep Learning from Videos (UPC 2018)
 
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
 
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
 
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
 
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
 
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
 
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
 
Deep Learning for Video: Language (UPC 2018)
Deep Learning for Video: Language (UPC 2018)Deep Learning for Video: Language (UPC 2018)
Deep Learning for Video: Language (UPC 2018)
 
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
 
Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)
Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)
Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)
 
Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)
Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)
Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)
 

Similar to Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)

One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018Universitat Politècnica de Catalunya
 
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoUniversitat Politècnica de Catalunya
 
Modeling perceptual similarity and shift invariance in deep networks
Modeling perceptual similarity and shift invariance in deep networksModeling perceptual similarity and shift invariance in deep networks
Modeling perceptual similarity and shift invariance in deep networksNAVER Engineering
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Extracting individual information using facial recognition in a smart mirror....
Extracting individual information using facial recognition in a smart mirror....Extracting individual information using facial recognition in a smart mirror....
Extracting individual information using facial recognition in a smart mirror....IQRARANI11
 
Hacking Human Language (PyData London)
Hacking Human Language (PyData London)Hacking Human Language (PyData London)
Hacking Human Language (PyData London)hen_drik
 
2022 AAAI DSTC10 Invited Talk
2022 AAAI DSTC10 Invited Talk2022 AAAI DSTC10 Invited Talk
2022 AAAI DSTC10 Invited TalkVerena Rieser
 
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)SSII
 
How machines learn to talk. Machine Learning for Conversational AI
How machines learn to talk. Machine Learning for Conversational AIHow machines learn to talk. Machine Learning for Conversational AI
How machines learn to talk. Machine Learning for Conversational AIVerena Rieser
 
Evolution of Pattern Languages: Designing Human Actions, Dialogue, & Films (P...
Evolution of Pattern Languages: Designing Human Actions, Dialogue, & Films (P...Evolution of Pattern Languages: Designing Human Actions, Dialogue, & Films (P...
Evolution of Pattern Languages: Designing Human Actions, Dialogue, & Films (P...Takashi Iba
 
TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019sipij
 
final ppt -ORIGINAL_Facial_Emotion_Detection special topic -2 review 1-1 (1) ...
final ppt -ORIGINAL_Facial_Emotion_Detection special topic -2 review 1-1 (1) ...final ppt -ORIGINAL_Facial_Emotion_Detection special topic -2 review 1-1 (1) ...
final ppt -ORIGINAL_Facial_Emotion_Detection special topic -2 review 1-1 (1) ...SharaneshUpase1
 
International Perspectives: Visualization in Science and Education
International Perspectives: Visualization in Science and EducationInternational Perspectives: Visualization in Science and Education
International Perspectives: Visualization in Science and EducationLiz Dorland
 

Similar to Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial) (20)

One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
 
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
 
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
 
Once Perceptron to Rule Them all: Deep Learning for Multimedia
Once Perceptron to Rule Them all: Deep Learning for MultimediaOnce Perceptron to Rule Them all: Deep Learning for Multimedia
Once Perceptron to Rule Them all: Deep Learning for Multimedia
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Modeling perceptual similarity and shift invariance in deep networks
Modeling perceptual similarity and shift invariance in deep networksModeling perceptual similarity and shift invariance in deep networks
Modeling perceptual similarity and shift invariance in deep networks
 
Deep Learning Representations for All (a.ka. the AI hype)
Deep Learning Representations for All (a.ka. the AI hype)Deep Learning Representations for All (a.ka. the AI hype)
Deep Learning Representations for All (a.ka. the AI hype)
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
 
Extracting individual information using facial recognition in a smart mirror....
Extracting individual information using facial recognition in a smart mirror....Extracting individual information using facial recognition in a smart mirror....
Extracting individual information using facial recognition in a smart mirror....
 
Hacking Human Language (PyData London)
Hacking Human Language (PyData London)Hacking Human Language (PyData London)
Hacking Human Language (PyData London)
 
Deep Speech and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Speech and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018Deep Speech and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Speech and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
 
2022 AAAI DSTC10 Invited Talk
2022 AAAI DSTC10 Invited Talk2022 AAAI DSTC10 Invited Talk
2022 AAAI DSTC10 Invited Talk
 
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)
SSII2021 [SS2] Deepfake Generation and Detection – An Overview (ディープフェイクの生成と検出)
 
How machines learn to talk. Machine Learning for Conversational AI
How machines learn to talk. Machine Learning for Conversational AIHow machines learn to talk. Machine Learning for Conversational AI
How machines learn to talk. Machine Learning for Conversational AI
 
Evolution of Pattern Languages: Designing Human Actions, Dialogue, & Films (P...
Evolution of Pattern Languages: Designing Human Actions, Dialogue, & Films (P...Evolution of Pattern Languages: Designing Human Actions, Dialogue, & Films (P...
Evolution of Pattern Languages: Designing Human Actions, Dialogue, & Films (P...
 
TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019
 
final ppt -ORIGINAL_Facial_Emotion_Detection special topic -2 review 1-1 (1) ...
final ppt -ORIGINAL_Facial_Emotion_Detection special topic -2 review 1-1 (1) ...final ppt -ORIGINAL_Facial_Emotion_Detection special topic -2 review 1-1 (1) ...
final ppt -ORIGINAL_Facial_Emotion_Detection special topic -2 review 1-1 (1) ...
 
International Perspectives: Visualization in Science and Education
International Perspectives: Visualization in Science and EducationInternational Perspectives: Visualization in Science and Education
International Perspectives: Visualization in Science and Education
 

More from Universitat Politècnica de Catalunya

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Universitat Politècnica de Catalunya
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Universitat Politècnica de Catalunya
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Universitat Politècnica de Catalunya
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 

More from Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
 
Neural Architectures for Video Encoding
Neural Architectures for Video EncodingNeural Architectures for Video Encoding
Neural Architectures for Video Encoding
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
 

Recently uploaded

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 

Recently uploaded (20)

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 

Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)

  • 1. Xavier Giro-i-Nieto @DocXavi xavier.giro@upc.edu Associate Professor Universitat Politècnica de Catalunya Barcelona Supercomputing Center Slides @DocXavi Tutorial: One Perceptron to Rule Them All Part III: Language & Vision
  • 3. 3
  • 4. 4 Outline 1. Generative Models a. Text b. Vision 2. Discriminative Models a. Text b. Vision 3. Representation Learning 4. Control Tasks
  • 5. 5 Outline 1. Generative Models a. Text b. Vision 2. Discriminative Models a. Text b. Vision 3. Representation Learning 4. Control Tasks
  • 7. 7 #ShowAndTell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015. Image Captioning with RNN
  • 8. 8 #DeepImageSent Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015 (Slides by Marc Bolaños) Image Captioning with RNN
  • 9. 9 Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015 Image Captioning with RNN & Attention
  • 10. 10 Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015 Image Captioning with RNN & Attention
  • 11. 11 Cornia, Marcella, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. "Meshed-Memory Transformer for Image Captioning." CVPR 2020. [tweet] Image Captioning with Transformers
  • 12. 12 Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 Dense Captioning
  • 13. 13 XAVI: “man has short hair”, “man with short hair” AMAIA:”a woman wearing a black shirt”, “ BOTH: “two men wearing black glasses” Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 Dense Captioning
  • 14. 14 Recipe Generation Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe Generation from Food Images." CVPR 2019.
  • 15. 15 Recipe Generation Title: Edamame corn salad Ingredients pepper, corn, onion, edamame, salt, vinegar, cilantro, avocado, oil Instructions - In a large bowl, combine edamame, corn, red onion, cilantro, avocado, and red bell pepper. - In a small bowl, whisk together olive oil, vinegar, salt, and pepper. - Pour dressing over edamame mixture and toss to coat. - Cover and refrigerate for at least 1 hour before serving. Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe Generation from Food Images." CVPR 2019.
  • 16. 16 #Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming Bias in Captioning Models." ECCV 2018. Fighting Data Bias in Captioning
  • 17. 17 #Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming Bias in Captioning Models." ECCV 2018. Fighting Data Bias in Captioning
  • 18. 18 Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code Video Captioning
  • 19. 19 (Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning, CVPR 2016. LSTM unit (2nd layer) Time Image t = 1 t = T hidden state at t = T first chunk of data Captioning: Video
  • 20. 20 Multimodal Machine Translation Challenge on Multimodal Image Translation: http://www.statmt.org/wmt17/multimodal-task.html#task1
  • 21. 21 Multimodal Machine Translation Sulubacak, Umut, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, and Jörg Tiedemann. "Multimodal machine translation through visuals and speech." Machine Translation (2020): 1-51. [tweet]
  • 22. 22 Sign Language Translation with RNN+Att Camgoz, Necati Cihan, et al. Neural Sign Language Translation. CVPR 2018.
  • 23. 23 Sign Language Translation with Transformers Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, Richard Bowden, “Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation” CVPR 2020.
  • 24. 24 Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading." (2016).
  • 25. 25 Lip Reading Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading." (2016).
  • 26. 26 Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
  • 27. 27 Lipreading: Watch, Listen, Attend & Spell Audio features Image features Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
  • 28. 28 Lipreading: Watch, Listen, Attend & Spell Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017 Attention over output states from audio and video is computed at each timestep
  • 29. 29 Lipreading Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "Deep Lip Reading: a comparison of models and an online application." Interspeech 2018.
  • 30. 30 Image Captioning Grounded on Detected Objects Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]
  • 31. 31Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code] Image Captioning Grounded on Detected Objects
  • 32. 32Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code] Image Captioning Grounded on Heatmaps
  • 33. 33 Outline 1. Generative Models a. Text b. Vision 2. Discriminative Models a. Text b. Vision 3. Representation Learning 4. Control Tasks
  • 35. 35 Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016. Image Generation
  • 36. 36 Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016. [code] Image Generation
  • 37. 37 Image Synthesis #StackGAN Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. "Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code]
  • 38. 38 Image Synthesis with Cycle Consistency #MirroGAN Qiao, Tingting, Jing Zhang, Duanqing Xu, and Dacheng Tao. "Mirrorgan: Learning text-to-image generation by redescription." CVPR 2019. [code]
  • 39. 39 Image Synthesis with Cycle Consistency #MirroGAN Qiao, Tingting, Jing Zhang, Duanqing Xu, and Dacheng Tao. "Mirrorgan: Learning text-to-image generation by redescription." CVPR 2019. [code]
  • 40. 40Justin Johnson, Agrim Gupta, Li Fei-Fei, “Image Generation from Scene Graphs” CVPR 2018 Image Generation via Scene Graphs
  • 41. 41 #Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual Descriptions." CVPR 2019 [blog].
  • 42. 42 #CRAFT Gupta, Tanmay, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to compositions to videos." ECCV 2018 Video Generation by Composition
  • 43. 43 Saunders, B., Camgoz, N. C., & Bowden, R. (2020). Progressive Transformers for End-to-End Sign Language Production. ECCV 2020. Sign Language Generation with Transformers
  • 44. 44 Lucas Ventura, Amanda Duarte, Xavier Giro-i-Nieto, “Can Everybody Sign Now ? Exploring Sign Language Video Generation from 2D Poses”. ECCV SLRTP Workshop 2020. Sign Language Generation (pose 2 pixels)
  • 45. 45 Outline 1. Generative Models a. Text b. Vision 2. Discriminative Models a. Text b. Vision 3. Representation Learning 4. Control Tasks
  • 47. 47 Visual Question Answering Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA: Visual question answering." CVPR 2015.
  • 48. 48 Visual Question Answering (VQA) Francisco Roldán, Issey Masuda, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Visual Question-Answering 2.0." ETSETB UPC TelecomBCN (2017).
  • 49. 49 Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter prediction. CVPR 2016 Dynamic Parameter Prediction Network (DPPnet) Visual Question Answering (VQA)
  • 50. 50 VQA: Dynamic Memory Networks (Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for Visual and Textual Question Answering." ICML 2016
  • 51. 51 Visual Reasoning #Clevr Johnson, Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. "CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning." CVPR 2017
  • 52. 52 Visual Reasoning: Programming (Slides by Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Fei-Fei Li, Larry Zitnick, Ross Girshick , “Inferring and Executing Programs for Visual Reasoning”. ICCV 2017 Program Generator Execution Engine
  • 53. 53 Visual Reasoning: Relation Networks #RN Santoro, Adam, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. "A simple neural network module for relational reasoning." NIPS 2017. Relation Networks concatenate all possible pairs of objects with the an encoded question to later find the answer with a MLP.
  • 54. 54 Visual Dialog Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. "Visual Dialog." CVPR 2017 [Project]
  • 55. 55 Visual Dialog Caros, Mariona, Maite Garolera, Petia Radeva, and Xavier Giro-i-Nieto. "Automatic Reminiscence Therapy for Dementia." ICMR 2020. [talk] Demo @ ICMR 2020 (Wednesday 11:00am)
  • 56. 56 Visual Dialog Caros, Mariona, Maite Garolera, Petia Radeva, and Xavier Giro-i-Nieto. "Automatic Reminiscence Therapy for Dementia." ICMR 2020. [talk]
  • 57. 57 Hate Speech Detection in Memes Benet Oriol, Cristian Canton, Xavier Giro-i-Nieto, “Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation”. NeurIPS 2019 AI for Good Workshop. [code] Hate Speech Detection
  • 58. 58 Outline 1. Generative Models a. Text b. Image 2. Discriminative Models a. Text b. Image 3. Representation Learning 4. Control Tasks
  • 60. 60 Niu, Yulei, Hanwang Zhang, Zhiwu Lu, and Shih-Fu Chang. "Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions." arXiv preprint arXiv:1907.03609 (2019). Objects from Referring Expressions
  • 61. 61 Video Objects from Referring Expressions Li, Zhenyang, Ran Tao, Efstratios Gavves, Cees GM Snoek, and Arnold WM Smeulders. "Tracking by natural language specification." CVPR 2017. [code]
  • 62. 62 Video Object Detection with Transformers Sadhu, A., Chen, K., & Nevatia, R. (2020). Video Object Grounding using Semantic Roles in Language Description. CVPR 2020.
  • 63. 63 #Mattnet Yu, Licheng, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. "Mattnet: Modular attention network for referring expression comprehension." CVPR 2018. [code] Segments from Referring Expressions
  • 64. 64 Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. "Video object segmentation with language referring expressions." ACCV 2018. Video Objects from Referring Expressions
  • 65. 65 Herrera-Palacio, Alba, Carles Ventura, and Xavier Giro-i-Nieto. "Video object linguistic grounding." ACM Multimedia Workshops 2019. Video Objects from Referring Expressions
  • 66. 66 #RefVOS Bellver, Miriam, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, and Xavier Giro-i-Nieto. "RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation." arXiv preprint arXiv:2010.00263 (2020). Video Objects from Referring Expressions
  • 67. 67 #RefVOS Bellver, Miriam, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, and Xavier Giro-i-Nieto. "RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation." arXiv preprint arXiv:2010.00263 (2020). Video Objects from Referring Expressions
  • 68. 68 #SynthRef Ioannis Kazakos, Bellver, Miriam, Carles Ventura, Carina Silberer, and Xavier Giro-i-Nieto, “Generation of Synthetic Referring Expressions for Object Segmentation” (submitted) Synthetic Expressions w/ Scene Graphs
  • 69. 69 #SynthRef Ioannis Kazakos, Bellver, Miriam, Carles Ventura, Carina Silberer, and Xavier Giro-i-Nieto, “Generation of Synthetic Referring Expressions for Object Segmentation” (submitted)
  • 70. Segments from Questions Gan, Chuang, Yandong Li, Haoxiang Li, Chen Sun, and Boqing Gong. "VQS: Linking segmentations to questions and answers for supervised attention in vqa and question-focused semantic segmentation." ICCV 2017.
  • 71. 71 Outline 1. Generative Models a. Text b. Image 2. Discriminative Models a. Text b. Image 3. Representation Learning 4. Control Tasks
  • 73. 73 Joint Representations (Embeddings) #Devise Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A deep visual-semantic embedding model." NIPS 2013
  • 74. 74 Zero-shot learning Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code] No images from “cat” in the training set... ...but they can still be recognised as “cats” thanks to the representations learned from text .
  • 75. 75 Multimodal Retrieval Kiros, Ryan, Ruslan Salakhutdinov, and Richard S. Zemel. "Unifying visual-semantic embeddings with multimodal neural language models." NeurIPS 2014 Deep Learning Workshop.
  • 76. 76 Multimodal Retrieval Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." CVPR 2016.
  • 77. 77 Multimodal Retrieval Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." CVPR 2016.
  • 78. 78 Image and text retrieval with joint embeddings. Joint Neural Embeddings #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 [video]
  • 79. 79 #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 [video] Joint Neural Embeddings
  • 80. 80 Joint Neural Embeddings joint embedding LSTM Bidirectional LSTM #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017
  • 81. 81 Representations Sariyildiz, Mert Bulent, Julien Perez, and Diane Larlus. "Learning Visual Representations with Caption Annotations." ECCV 2020. [tweet]
  • 82. 82 Representations #ViLBERT Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks." NeurIPS 2019. [MIT talk by Devih Parikh] [demo] Visual Task: Predict the visual categories for the masked video frame Language Task: Predict the masked word (same as in language-only BERT).
  • 83. 83 Representations #ViLBERT Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks." NeurIPS 2019. [MIT talk by Devih Parikh] [demo] Multimodal Task: Predict whether the video frames correspond to the caption.
  • 84. 84 Representations #VideoBERT Sun, Chen, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. "Videobert: A joint model for video and language representation learning." ICCV 2019.
  • 85. 85 Representations #VideoBERT Sun, Chen, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. "Videobert: A joint model for video and language representation learning." ICCV 2019. Rich representations can be used to retrieve matching video frames, which are encoded after vector quantization.
  • 86. 86 Representations #VirTEX Karan Desai, Justin Johnson, “VirTex: Learning Visual Representations from Textual Annotations” arXiv 2020 [tweet]
  • 87. 87 Learning Language from Video Doughty, Hazel, Ivan Laptev, Walterio Mayol-Cuevas, and Dima Damen. "Action Modifiers: Learning from Adverbs in Instructional Videos." CVPR 2020..
  • 88. 88 Learning Language from Video Surís, Dídac, Dave Epstein, Heng Ji, Shih-Fu Chang, and Carl Vondrick. "Learning to Learn Words from Visual Scenes." ECCV 2020.
  • 89. 89 Outline 1. Generative Models a. Text b. Image 2. Discriminative Models a. Text b. Image 3. Representation Learning 4. Control Tasks
  • 90. 90 Platforms for Embodied AI #Habitat Savva, Manolis, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub et al. "Habitat: A platform for embodied ai research." ICCV 2019. [site]
  • 91. 91 Navigation Fried, Daniel, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. "Speaker-Follower Models for Vision-and-Language Navigation." NeurIPS 2018.
  • 92. 92 Navigation #R2R Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., ... & van den Hengel, A. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. CVPR 2018. [tweet]
  • 93. 93 Navigation #RxR Alexander Ku and Peter Anderson and Roma Patel and Eugene Ie and Jason Baldridge, “Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding” EMNLP 2020.
  • 94. 94 Navigation Ünal, Emre, Ozan Arkan Can, and Yücel Yemez. "Visually Grounded Language Learning For Robot Navigation." ACMMM Workshops 2019.
  • 95. 95 Object manipulation Hill, F., Lampinen, A. K., Schneider, R., Clark, S., Botvinick, M., McClelland, J. L., & Santoro, A. Environmental drivers of systematicity and generalization in a situated agent. ICLR 2020. [talk]
  • 96. 96 Outline 1. Generative Models a. Text b. Image 2. Discriminative Models a. Text b. Image 3. Representation Learning 4. Control Tasks
  • 97. 97 My take home message 1. Generative Models a. Text b. Vision 2. Discriminative Models a. Text b. Vision 3. Feature Learning 4. Control Tasks
  • 98. Xavier Giro-i-Nieto @DocXavi xavier.giro@upc.edu Associate Professor Universitat Politècnica de Catalunya Barcelona Supercomputing Center Was this tutorial helpful ? Please consider citing: Go raibh maith agat / Thank you Giro-i-Nieto, X. One Perceptron to Rule Them All: Language, Vision, Audio and Speech. In Proceedings of the 2020 International Conference on Multimedia Retrieval (pp. 7-8).