SlideShare a Scribd company logo
1 of 93
Download to read offline
Xavier Giro-i-Nieto
Associate Professor
Universitat Politecnica de Catalunya
@DocXavi
xavier.giro@upc.edu
Self-Supervised Audio-Visual Learning
Lecture 16
[course site]
Video-lecture
Acknowledgments
Amaia
Salvador
Jordi
Pons
Amanda
Duarte
Dídac
Surís
Margarita
Geleta
Cristina
Puntí
4
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
4. Embodied AI
5
Encoder Decoder
Representation
6
Vision
Audio
Video
Synchronization among modalities captured by video is
exploited in a self-supervised manner.
Self-supervised Learning
7
Encoder Decoder
Representation
Learn
Self-supervised Learning
8
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
9
Self-supervised Feature Learning
Reference: Andrew Zisserman (PAISS 2018)
Self-supervised feature learning is a form of unsupervised learning where the
raw data provides the supervision.
● A pretext (or surrogate) task must be designed.
● By defining a proxy loss, the NN learns representations, which should be
valuable for the actual downstream task.
Unlabeled data
(X)
Representations learned without labels
10
Self-supervised Feature Learning
Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”
11
Outline
1. Motivation
2. Feature Learning
a. Generative / Predictive Methods
b. Contrastive Methods
3. Cross-modal Translation
4. Embodied AI
12
Self-supervised Feature Learning
Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”
13
Encoder Encoder
Representation
14
Prediction of Audio Features (stats)
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Based on the assumption that ambient sound in video is related to the visual
semantics.
15
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Pretext task: Use videos to train a CNN that predicts the audio statistics of
a frame.
Prediction of Audio Features (stats)
16
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Downstream Task: Use the predicted audio stats to clusters images. Audio clusters
built with K-means algorithm over the training set
Cluster assignments at test time (one row=one cluster)
Prediction of Audio Features (stats)
17
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Although the CNN was not trained with class labels, local units with semantic
meaning emerge.
Prediction of Audio Features (stats)
18
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
Pretext task: Predict the cochleagram given a video frame.
Prediction of Audio Features (cochleagram)
19
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
Downstream task: Retrieve matching sounds for videos of people hitting objects
with a drumstick.
Prediction of Audio Features (cochleagram)
20
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
The Greatest Hits Dataset
Prediction of Audio Features (cochleagram)
21
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
Audio Clip
Retrieval
Prediction of Audio Features (cochleagram)
22
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
23
Encoder Encoder
Representation
24
#SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Teacher network: Visual Recognition (object & scenes)
Prediction of Image Labels (distillation)
25
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Student network: Learn audio features for environmental sound recognition.
Prediction of Image Labels (distillation)
26
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS
2016.
27
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Learned audio features are good for environmental sound recognition.
Prediction of Image Labels (distillation)
28
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Prediction of Image Labels (distillation)
29
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Prediction of Image Labels (distillation)
30
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Visualize video frames that mostly activate a neuron in a late layer (conv7)
Prediction of Image Labels (distillation)
31
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Visualize video frames that mostly activate a neuron in a late layer (conv7)
Prediction of Image Labels (distillation)
32
Acoustic images are aligned in space and synchronized in time during learning.
Prediction of Acoustic Images (distillation)
Perez, A., Sanguineti, V., Morerio, P., & Murino, V. Audio-visual model distillation using acoustic images. WACV 2020. [code]
Acoustic-optical
camera
33
Perez, A., Sanguineti, V., Morerio, P., & Murino, V. (2020). Audio-visual model distillation using acoustic images. WACV 2020.
[code]
Teacher
Network
(RGB &
Acoustic Images)
Student
Network
(Audio)
34
Encoder Encoder
Match?✔or
❌
35Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Binary Verification
36
Each mini-column shows five images that most activate a particular unit
of the 512 in pool4 of the vision subnetwork, and the corresponding
heatmap layer.
Binary Verification
Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
37Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Visual features used to train a linear classifier on ImageNet.
Contrastive Learning (verification)
38
Each mini-column shows sounds that most activate a particular unit of
the 512 in pool4 of the audio subnetwork.
Binary Verification
Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
39
Audio clips containing the five concatenated 1s samples sound that mostly
activate a particular unit of the 512 in pool4 of the audio subnetwork.
Binary Verification
Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
40Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Audio features achieve state of the art performance.
Contrastive Learning (verification)
41Owens, A., & Efros, A. A. Audio-visual scene analysis with self-supervised multisensory features. ECCV 2018.
Binary Verification
42
Yang, Karren, Bryan Russell, and Justin Salamon. "Telling Left From Right: Learning Spatial Correspondence of Sight and Sound."
CVPR 2020. [tweet]
Flipped
right &
left audio
channels
Binary Verification
43
#XCD Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., & Tran, D. Self-Supervised Learning by Cross-Modal Audio-Video
Clustering. NeurIPS 2020
Binary verification (clustering)
Iterative training of:
● K-means clustering to improve pseudo-labels.
● Backprop training of the visual (Ev
) & audio (Ea
) encoders.
44
#XCD Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., & Tran, D. Self-Supervised Learning by Cross-Modal Audio-Video
Clustering. NeurIPS 2020
Binary verification (clustering)
45
#XCD Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., & Tran, D. Self-Supervised Learning by Cross-Modal Audio-Video
Clustering. NeurIPS 2020
Binary verification (clustering)
Downstream task: Video action recognition
Self-supervised models pretrained with XCD outperform supervised ones.
46
Outline
1. Motivation
2. Feature Learning
a. Generative / Predictive Methods
b. Contrastive Methods
3. Cross-modal Translation
4. Embodied AI
47
Self-supervised Feature Learning
Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”
48
Encoder Encoder
Representation
49
Contrastive Learning
Source: Raul Gómez, “Understanding Ranking Loss, Contrastive Loss, Margin Loss, Triplet Loss, Hinge Loss and all
those confusing names” (2019)
50
Contrastive Learning (cross-modal)
#AVID Morgado, Pedro, Nuno Vasconcelos, and Ishan Misra. "Audio-visual instance discrimination with cross-modal
agreement." arXiv preprint arXiv:2004.12943 (2020). [code]
51
Contrastive Learning (cosine similarity+class)
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal
Embeddings for Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
52
Best
match
Audio feature
Contrastive Learning (cosine similarity+class)
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal
Embeddings for Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
53
Best
match
Visual feature Audio feature
Contrastive Learning (cosine similarity+class)
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal
Embeddings for Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
54
#AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from
Self-Supervised Synchronization." NIPS 2018.
Contrastive Learning (pairwise L2 hinge loss)
Positive A/V pair Negative A/V pair
55
#AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from
Self-Supervised Synchronization." NIPS 2018.
Contrastive Learning (pairwise hinge loss)
56
#AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from
Self-Supervised Synchronization." NIPS 2018.
Contrastive Learning (pairwise hinge loss)
57
#AVID #CMA Morgado, Pedro, Nuno Vasconcelos, and Ishan Misra. "Audio-visual instance discrimination with cross-modal
agreement." arXiv preprint arXiv:2004.12943 (2020). [code]
Contrastive Learning (within-modal)
58
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
a. Sound to Vision
b. Vision to Sound
4. Embodied AI
59
Encoder Decoder
Representation
60
Image hallucination from sound
Lyu, Jeonghyun, Takashi Shinozaki, and Kaoru Amano. "Generating Images from Sounds Using Multimodal Features and
GANs." (2018).
61
Image hallucination from sound
Chih Wen Lin, , “Generating Images from Audio” NeurIPS 2018 Creativity Workshop.
Conditional image generation based on StackGAN (stage I).
62
Image hallucination from sound
Wan, C. H., Chuang, S. P., & Lee, H. Y. (2019, May). Towards audio to scene image synthesis using generative adversarial
network. ICASSP 2019.
63
Video hallucination from sound
#Sound2Sight Cherian, Anoop, Moitreya Chatterjee, and Narendra Ahuja. "Sound2sight: Generating visual dynamics
from sound and context." ECCV 2020.
64
Video hallucination from sound
#Sound2Sight Cherian, Anoop, Moitreya Chatterjee, and Narendra Ahuja. "Sound2sight: Generating visual dynamics
from sound and context." ECCV 2020.
65
Avatar animation with music (skeletons)
Shlizerman, E., Dery, L., Schoen, H., & Kemelmacher-Shlizerman, I. (2018). Audio to body dynamics. CVPR 2018.
66Shlizerman, E., Dery, L., Schoen, H., & Kemelmacher-Shlizerman, I. (2018). Audio to body dynamics. CVPR 2018.
67
Encoder
Decoder
Representation
Encoder
68
Sound Source Localization
Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual
Scenes." CVPR 2018.
69
Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual Scenes."
CVPR 2018.
70
Sound Source Localization
Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." ECCV 2018.
71Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." ECCV 2018.
72
Depth Prediction by echoes
Gao, R., Chen, C., Al-Halah, Z., Schissler, C., & Grauman, K. (2020). VisualEchoes: Spatial Image Representation
Learning through Echolocation. ECCV 2020.
73
Gao, R., Chen, C., Al-Halah, Z., Schissler, C., & Grauman, K. VisualEchoes: Spatial Image Representation Learning
through Echolocation. ECCV 2020.
74
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
a. Sound to Vision
b. Vision to Sound
4. Embodied AI
75
Encoder Decoder
Representation
76
Piano Transcription (MIDI)
AS Koepke, O Wiles, Y Moses, A Zisserman, “SIGHT TO SOUND: AN END-TO-END APPROACH FOR VISUAL PIANO
TRANSCRIPTION”. ICASSP 2020.
77
AS Koepke, O Wiles, Y Moses, A Zisserman, “SIGHT TO SOUND: AN END-TO-END APPROACH FOR VISUAL PIANO
TRANSCRIPTION”. ICASSP 2020.
78
Silent Video Sonorization (MIDI)
#FoleyMusic Gan, Chuang, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, and Antonio Torralba. "Foley Music:
Learning to Generate Music from Videos." ECCV 2020.
79
#FoleyMusic Gan, Chuang, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, and Antonio Torralba. "Foley Music:
Learning to Generate Music from Videos." ECCV 2020.
80
Encoder
Decoder
Representation
Encoder
81
Sound Separation
Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba, “Music Gesture for Visual Sound
Separation” CVPR 2020.
82
Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba, “Music Gesture for Visual Sound
Separation” CVPR 2020.
83
Encoder
Decoder
Representation
Encoder
Decoder
84
Source Separation + Segmentation
Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., & Torralba, A. Self-supervised audio-visual co-segmentation. ICASSP 2019..
85
Source Separation + Segmentation
Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., & Torralba, A. Self-supervised audio-visual co-segmentation. ICASSP 2019.
86
Source Separation + Segmentation
Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., & Torralba, A. Self-supervised audio-visual co-segmentation. ICASSP 2019.
87
Visual & Audio Generation (cycle)
#CMCGAN Hao, W., Zhang, Z., & Guan, H. Cmcgan: A uniform framework for cross-modal visual-audio mutual generation. AAAI
2018.
88
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
a. Sound to Vision
b. Vision to Sound
4. Embodied AI
89
Audio-visual Navigation with Deep RL
#SoundSpaces Chen, Changan, Unnat Jain, Carl Schissler, Sebastia Vicenc, Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu,
Philip Robinson, and Kristen Grauman. "SoundSpaces: Audio-visual navigation in 3d environments." ECCV 2020.
90
#SoundSpaces Chen, Changan, Unnat Jain, Carl Schissler, Sebastia Vicenc, Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu,
Philip Robinson, and Kristen Grauman. "SoundSpaces: Audio-visual navigation in 3d environments." ECCV 2020.
91
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
a. Sound to Vision
b. Vision to Sound
4. Embodied AI
92
Take home message
1. Motivation
2. Feature Learning
3. Cross-modal Translation
a. Sound to Vision
b. Vision to Sound
4. Embodied AI
93
Questions ?

More Related Content

What's hot

Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Universitat Politècnica de Catalunya
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaUniversitat Politècnica de Catalunya
 
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019Universitat Politècnica de Catalunya
 
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Universitat Politècnica de Catalunya
 
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...Universitat Politècnica de Catalunya
 
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...Universitat Politècnica de Catalunya
 
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Universitat Politècnica de Catalunya
 
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 

What's hot (20)

Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
 
Deep Learning from Videos (UPC 2018)
Deep Learning from Videos (UPC 2018)Deep Learning from Videos (UPC 2018)
Deep Learning from Videos (UPC 2018)
 
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
 
Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)
 
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
 
Neural Architectures for Video Encoding
Neural Architectures for Video EncodingNeural Architectures for Video Encoding
Neural Architectures for Video Encoding
 
Multimodal Deep Learning
Multimodal Deep LearningMultimodal Deep Learning
Multimodal Deep Learning
 
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
 
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
 
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
 
Deep Learning for Video: Language (UPC 2018)
Deep Learning for Video: Language (UPC 2018)Deep Learning for Video: Language (UPC 2018)
Deep Learning for Video: Language (UPC 2018)
 
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
 
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
 
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
 
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
 
Deep Learning for Video: Object Tracking (UPC 2018)
Deep Learning for Video: Object Tracking (UPC 2018)Deep Learning for Video: Object Tracking (UPC 2018)
Deep Learning for Video: Object Tracking (UPC 2018)
 
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
 
Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)
Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)
Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)
 

Similar to Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020

Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)Universitat Politècnica de Catalunya
 
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)Universitat Politècnica de Catalunya
 
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018Universitat Politècnica de Catalunya
 
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Universitat Politècnica de Catalunya
 
Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Net...
Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Net...Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Net...
Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Net...multimediaeval
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Universitat Politècnica de Catalunya
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
A Survey on Cross-Modal Embedding
A Survey on Cross-Modal EmbeddingA Survey on Cross-Modal Embedding
A Survey on Cross-Modal EmbeddingYasuhide Miura
 
lec_11_self_supervised_learning.pdf
lec_11_self_supervised_learning.pdflec_11_self_supervised_learning.pdf
lec_11_self_supervised_learning.pdfAlamgirAkash3
 
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...maranlar
 
Visual recognition of human communications
Visual recognition of human communicationsVisual recognition of human communications
Visual recognition of human communicationsNAVER Engineering
 
Predicting Media Memorability with Audio, Video, and Text representations
Predicting Media Memorability with Audio, Video, and Text representationsPredicting Media Memorability with Audio, Video, and Text representations
Predicting Media Memorability with Audio, Video, and Text representationsAlison Reboud
 
QoMEX2014 - Analysing the Quality of Experience of Multisensory Media from Me...
QoMEX2014 - Analysing the Quality of Experience of Multisensory Media from Me...QoMEX2014 - Analysing the Quality of Experience of Multisensory Media from Me...
QoMEX2014 - Analysing the Quality of Experience of Multisensory Media from Me...Jacob Donley
 
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016Universitat Politècnica de Catalunya
 

Similar to Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020 (20)

Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
 
Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)
Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)
Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)
 
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
 
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
 
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
 
Deep Learning Summit (DLS01-7)
Deep Learning Summit (DLS01-7)Deep Learning Summit (DLS01-7)
Deep Learning Summit (DLS01-7)
 
Once Perceptron to Rule Them all: Deep Learning for Multimedia
Once Perceptron to Rule Them all: Deep Learning for MultimediaOnce Perceptron to Rule Them all: Deep Learning for Multimedia
Once Perceptron to Rule Them all: Deep Learning for Multimedia
 
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
 
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
 
Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Net...
Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Net...Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Net...
Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Net...
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
 
A Survey on Cross-Modal Embedding
A Survey on Cross-Modal EmbeddingA Survey on Cross-Modal Embedding
A Survey on Cross-Modal Embedding
 
lec_11_self_supervised_learning.pdf
lec_11_self_supervised_learning.pdflec_11_self_supervised_learning.pdf
lec_11_self_supervised_learning.pdf
 
Deep Learning for Computer Vision: Video Analytics (UPC 2016)
Deep Learning for Computer Vision: Video Analytics (UPC 2016)Deep Learning for Computer Vision: Video Analytics (UPC 2016)
Deep Learning for Computer Vision: Video Analytics (UPC 2016)
 
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
 
Visual recognition of human communications
Visual recognition of human communicationsVisual recognition of human communications
Visual recognition of human communications
 
Predicting Media Memorability with Audio, Video, and Text representations
Predicting Media Memorability with Audio, Video, and Text representationsPredicting Media Memorability with Audio, Video, and Text representations
Predicting Media Memorability with Audio, Video, and Text representations
 
QoMEX2014 - Analysing the Quality of Experience of Multisensory Media from Me...
QoMEX2014 - Analysing the Quality of Experience of Multisensory Media from Me...QoMEX2014 - Analysing the Quality of Experience of Multisensory Media from Me...
QoMEX2014 - Analysing the Quality of Experience of Multisensory Media from Me...
 
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
 

More from Universitat Politècnica de Catalunya

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoUniversitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Universitat Politècnica de Catalunya
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Universitat Politècnica de Catalunya
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Universitat Politècnica de Catalunya
 

More from Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
 

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 

Recently uploaded (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 

Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020

  • 1. Xavier Giro-i-Nieto Associate Professor Universitat Politecnica de Catalunya @DocXavi xavier.giro@upc.edu Self-Supervised Audio-Visual Learning Lecture 16 [course site]
  • 4. 4 Outline 1. Motivation 2. Feature Learning 3. Cross-modal Translation 4. Embodied AI
  • 6. 6 Vision Audio Video Synchronization among modalities captured by video is exploited in a self-supervised manner. Self-supervised Learning
  • 8. 8 Outline 1. Motivation 2. Feature Learning 3. Cross-modal Translation
  • 9. 9 Self-supervised Feature Learning Reference: Andrew Zisserman (PAISS 2018) Self-supervised feature learning is a form of unsupervised learning where the raw data provides the supervision. ● A pretext (or surrogate) task must be designed. ● By defining a proxy loss, the NN learns representations, which should be valuable for the actual downstream task. Unlabeled data (X) Representations learned without labels
  • 10. 10 Self-supervised Feature Learning Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”
  • 11. 11 Outline 1. Motivation 2. Feature Learning a. Generative / Predictive Methods b. Contrastive Methods 3. Cross-modal Translation 4. Embodied AI
  • 12. 12 Self-supervised Feature Learning Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”
  • 14. 14 Prediction of Audio Features (stats) Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Based on the assumption that ambient sound in video is related to the visual semantics.
  • 15. 15 Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Pretext task: Use videos to train a CNN that predicts the audio statistics of a frame. Prediction of Audio Features (stats)
  • 16. 16 Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Downstream Task: Use the predicted audio stats to clusters images. Audio clusters built with K-means algorithm over the training set Cluster assignments at test time (one row=one cluster) Prediction of Audio Features (stats)
  • 17. 17 Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Although the CNN was not trained with class labels, local units with semantic meaning emerge. Prediction of Audio Features (stats)
  • 18. 18 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Pretext task: Predict the cochleagram given a video frame. Prediction of Audio Features (cochleagram)
  • 19. 19 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Downstream task: Retrieve matching sounds for videos of people hitting objects with a drumstick. Prediction of Audio Features (cochleagram)
  • 20. 20 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. The Greatest Hits Dataset Prediction of Audio Features (cochleagram)
  • 21. 21 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Audio Clip Retrieval Prediction of Audio Features (cochleagram)
  • 22. 22 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016.
  • 24. 24 #SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Teacher network: Visual Recognition (object & scenes) Prediction of Image Labels (distillation)
  • 25. 25 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Student network: Learn audio features for environmental sound recognition. Prediction of Image Labels (distillation)
  • 26. 26 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.
  • 27. 27 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Learned audio features are good for environmental sound recognition. Prediction of Image Labels (distillation)
  • 28. 28 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Visualization of the 1D filters over raw audio in conv1. Prediction of Image Labels (distillation)
  • 29. 29 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Visualization of the 1D filters over raw audio in conv1. Prediction of Image Labels (distillation)
  • 30. 30 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Visualize video frames that mostly activate a neuron in a late layer (conv7) Prediction of Image Labels (distillation)
  • 31. 31 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Visualize video frames that mostly activate a neuron in a late layer (conv7) Prediction of Image Labels (distillation)
  • 32. 32 Acoustic images are aligned in space and synchronized in time during learning. Prediction of Acoustic Images (distillation) Perez, A., Sanguineti, V., Morerio, P., & Murino, V. Audio-visual model distillation using acoustic images. WACV 2020. [code] Acoustic-optical camera
  • 33. 33 Perez, A., Sanguineti, V., Morerio, P., & Murino, V. (2020). Audio-visual model distillation using acoustic images. WACV 2020. [code] Teacher Network (RGB & Acoustic Images) Student Network (Audio)
  • 35. 35Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017. Binary Verification
  • 36. 36 Each mini-column shows five images that most activate a particular unit of the 512 in pool4 of the vision subnetwork, and the corresponding heatmap layer. Binary Verification Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
  • 37. 37Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017. Visual features used to train a linear classifier on ImageNet. Contrastive Learning (verification)
  • 38. 38 Each mini-column shows sounds that most activate a particular unit of the 512 in pool4 of the audio subnetwork. Binary Verification Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
  • 39. 39 Audio clips containing the five concatenated 1s samples sound that mostly activate a particular unit of the 512 in pool4 of the audio subnetwork. Binary Verification Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
  • 40. 40Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017. Audio features achieve state of the art performance. Contrastive Learning (verification)
  • 41. 41Owens, A., & Efros, A. A. Audio-visual scene analysis with self-supervised multisensory features. ECCV 2018. Binary Verification
  • 42. 42 Yang, Karren, Bryan Russell, and Justin Salamon. "Telling Left From Right: Learning Spatial Correspondence of Sight and Sound." CVPR 2020. [tweet] Flipped right & left audio channels Binary Verification
  • 43. 43 #XCD Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., & Tran, D. Self-Supervised Learning by Cross-Modal Audio-Video Clustering. NeurIPS 2020 Binary verification (clustering) Iterative training of: ● K-means clustering to improve pseudo-labels. ● Backprop training of the visual (Ev ) & audio (Ea ) encoders.
  • 44. 44 #XCD Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., & Tran, D. Self-Supervised Learning by Cross-Modal Audio-Video Clustering. NeurIPS 2020 Binary verification (clustering)
  • 45. 45 #XCD Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., & Tran, D. Self-Supervised Learning by Cross-Modal Audio-Video Clustering. NeurIPS 2020 Binary verification (clustering) Downstream task: Video action recognition Self-supervised models pretrained with XCD outperform supervised ones.
  • 46. 46 Outline 1. Motivation 2. Feature Learning a. Generative / Predictive Methods b. Contrastive Methods 3. Cross-modal Translation 4. Embodied AI
  • 47. 47 Self-supervised Feature Learning Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”
  • 49. 49 Contrastive Learning Source: Raul Gómez, “Understanding Ranking Loss, Contrastive Loss, Margin Loss, Triplet Loss, Hinge Loss and all those confusing names” (2019)
  • 50. 50 Contrastive Learning (cross-modal) #AVID Morgado, Pedro, Nuno Vasconcelos, and Ishan Misra. "Audio-visual instance discrimination with cross-modal agreement." arXiv preprint arXiv:2004.12943 (2020). [code]
  • 51. 51 Contrastive Learning (cosine similarity+class) Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
  • 52. 52 Best match Audio feature Contrastive Learning (cosine similarity+class) Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
  • 53. 53 Best match Visual feature Audio feature Contrastive Learning (cosine similarity+class) Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
  • 54. 54 #AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization." NIPS 2018. Contrastive Learning (pairwise L2 hinge loss) Positive A/V pair Negative A/V pair
  • 55. 55 #AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization." NIPS 2018. Contrastive Learning (pairwise hinge loss)
  • 56. 56 #AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization." NIPS 2018. Contrastive Learning (pairwise hinge loss)
  • 57. 57 #AVID #CMA Morgado, Pedro, Nuno Vasconcelos, and Ishan Misra. "Audio-visual instance discrimination with cross-modal agreement." arXiv preprint arXiv:2004.12943 (2020). [code] Contrastive Learning (within-modal)
  • 58. 58 Outline 1. Motivation 2. Feature Learning 3. Cross-modal Translation a. Sound to Vision b. Vision to Sound 4. Embodied AI
  • 60. 60 Image hallucination from sound Lyu, Jeonghyun, Takashi Shinozaki, and Kaoru Amano. "Generating Images from Sounds Using Multimodal Features and GANs." (2018).
  • 61. 61 Image hallucination from sound Chih Wen Lin, , “Generating Images from Audio” NeurIPS 2018 Creativity Workshop. Conditional image generation based on StackGAN (stage I).
  • 62. 62 Image hallucination from sound Wan, C. H., Chuang, S. P., & Lee, H. Y. (2019, May). Towards audio to scene image synthesis using generative adversarial network. ICASSP 2019.
  • 63. 63 Video hallucination from sound #Sound2Sight Cherian, Anoop, Moitreya Chatterjee, and Narendra Ahuja. "Sound2sight: Generating visual dynamics from sound and context." ECCV 2020.
  • 64. 64 Video hallucination from sound #Sound2Sight Cherian, Anoop, Moitreya Chatterjee, and Narendra Ahuja. "Sound2sight: Generating visual dynamics from sound and context." ECCV 2020.
  • 65. 65 Avatar animation with music (skeletons) Shlizerman, E., Dery, L., Schoen, H., & Kemelmacher-Shlizerman, I. (2018). Audio to body dynamics. CVPR 2018.
  • 66. 66Shlizerman, E., Dery, L., Schoen, H., & Kemelmacher-Shlizerman, I. (2018). Audio to body dynamics. CVPR 2018.
  • 68. 68 Sound Source Localization Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual Scenes." CVPR 2018.
  • 69. 69 Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual Scenes." CVPR 2018.
  • 70. 70 Sound Source Localization Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." ECCV 2018.
  • 71. 71Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." ECCV 2018.
  • 72. 72 Depth Prediction by echoes Gao, R., Chen, C., Al-Halah, Z., Schissler, C., & Grauman, K. (2020). VisualEchoes: Spatial Image Representation Learning through Echolocation. ECCV 2020.
  • 73. 73 Gao, R., Chen, C., Al-Halah, Z., Schissler, C., & Grauman, K. VisualEchoes: Spatial Image Representation Learning through Echolocation. ECCV 2020.
  • 74. 74 Outline 1. Motivation 2. Feature Learning 3. Cross-modal Translation a. Sound to Vision b. Vision to Sound 4. Embodied AI
  • 76. 76 Piano Transcription (MIDI) AS Koepke, O Wiles, Y Moses, A Zisserman, “SIGHT TO SOUND: AN END-TO-END APPROACH FOR VISUAL PIANO TRANSCRIPTION”. ICASSP 2020.
  • 77. 77 AS Koepke, O Wiles, Y Moses, A Zisserman, “SIGHT TO SOUND: AN END-TO-END APPROACH FOR VISUAL PIANO TRANSCRIPTION”. ICASSP 2020.
  • 78. 78 Silent Video Sonorization (MIDI) #FoleyMusic Gan, Chuang, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, and Antonio Torralba. "Foley Music: Learning to Generate Music from Videos." ECCV 2020.
  • 79. 79 #FoleyMusic Gan, Chuang, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, and Antonio Torralba. "Foley Music: Learning to Generate Music from Videos." ECCV 2020.
  • 81. 81 Sound Separation Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba, “Music Gesture for Visual Sound Separation” CVPR 2020.
  • 82. 82 Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba, “Music Gesture for Visual Sound Separation” CVPR 2020.
  • 84. 84 Source Separation + Segmentation Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., & Torralba, A. Self-supervised audio-visual co-segmentation. ICASSP 2019..
  • 85. 85 Source Separation + Segmentation Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., & Torralba, A. Self-supervised audio-visual co-segmentation. ICASSP 2019.
  • 86. 86 Source Separation + Segmentation Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., & Torralba, A. Self-supervised audio-visual co-segmentation. ICASSP 2019.
  • 87. 87 Visual & Audio Generation (cycle) #CMCGAN Hao, W., Zhang, Z., & Guan, H. Cmcgan: A uniform framework for cross-modal visual-audio mutual generation. AAAI 2018.
  • 88. 88 Outline 1. Motivation 2. Feature Learning 3. Cross-modal Translation a. Sound to Vision b. Vision to Sound 4. Embodied AI
  • 89. 89 Audio-visual Navigation with Deep RL #SoundSpaces Chen, Changan, Unnat Jain, Carl Schissler, Sebastia Vicenc, Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, and Kristen Grauman. "SoundSpaces: Audio-visual navigation in 3d environments." ECCV 2020.
  • 90. 90 #SoundSpaces Chen, Changan, Unnat Jain, Carl Schissler, Sebastia Vicenc, Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, and Kristen Grauman. "SoundSpaces: Audio-visual navigation in 3d environments." ECCV 2020.
  • 91. 91 Outline 1. Motivation 2. Feature Learning 3. Cross-modal Translation a. Sound to Vision b. Vision to Sound 4. Embodied AI
  • 92. 92 Take home message 1. Motivation 2. Feature Learning 3. Cross-modal Translation a. Sound to Vision b. Vision to Sound 4. Embodied AI