Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
9. 9
Self-supervised Feature Learning
Reference: Andrew Zisserman (PAISS 2018)
Self-supervised feature learning is a form of unsupervised learning where the
raw data provides the supervision.
● A pretext (or surrogate) task must be designed.
● By defining a proxy loss, the NN learns representations, which should be
valuable for the actual downstream task.
Unlabeled data
(X)
Representations learned without labels
14. 14
Prediction of Audio Features (stats)
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Based on the assumption that ambient sound in video is related to the visual
semantics.
15. 15
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Pretext task: Use videos to train a CNN that predicts the audio statistics of
a frame.
Prediction of Audio Features (stats)
16. 16
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Downstream Task: Use the predicted audio stats to clusters images. Audio clusters
built with K-means algorithm over the training set
Cluster assignments at test time (one row=one cluster)
Prediction of Audio Features (stats)
17. 17
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Although the CNN was not trained with class labels, local units with semantic
meaning emerge.
Prediction of Audio Features (stats)
18. 18
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
Pretext task: Predict the cochleagram given a video frame.
Prediction of Audio Features (cochleagram)
19. 19
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
Downstream task: Retrieve matching sounds for videos of people hitting objects
with a drumstick.
Prediction of Audio Features (cochleagram)
20. 20
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
The Greatest Hits Dataset
Prediction of Audio Features (cochleagram)
21. 21
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
Audio Clip
Retrieval
Prediction of Audio Features (cochleagram)
22. 22
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
24. 24
#SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Teacher network: Visual Recognition (object & scenes)
Prediction of Image Labels (distillation)
25. 25
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Student network: Learn audio features for environmental sound recognition.
Prediction of Image Labels (distillation)
26. 26
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS
2016.
27. 27
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Learned audio features are good for environmental sound recognition.
Prediction of Image Labels (distillation)
28. 28
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Prediction of Image Labels (distillation)
29. 29
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Prediction of Image Labels (distillation)
30. 30
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Visualize video frames that mostly activate a neuron in a late layer (conv7)
Prediction of Image Labels (distillation)
31. 31
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Visualize video frames that mostly activate a neuron in a late layer (conv7)
Prediction of Image Labels (distillation)
32. 32
Acoustic images are aligned in space and synchronized in time during learning.
Prediction of Acoustic Images (distillation)
Perez, A., Sanguineti, V., Morerio, P., & Murino, V. Audio-visual model distillation using acoustic images. WACV 2020. [code]
Acoustic-optical
camera
33. 33
Perez, A., Sanguineti, V., Morerio, P., & Murino, V. (2020). Audio-visual model distillation using acoustic images. WACV 2020.
[code]
Teacher
Network
(RGB &
Acoustic Images)
Student
Network
(Audio)
36. 36
Each mini-column shows five images that most activate a particular unit
of the 512 in pool4 of the vision subnetwork, and the corresponding
heatmap layer.
Binary Verification
Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
37. 37Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Visual features used to train a linear classifier on ImageNet.
Contrastive Learning (verification)
38. 38
Each mini-column shows sounds that most activate a particular unit of
the 512 in pool4 of the audio subnetwork.
Binary Verification
Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
39. 39
Audio clips containing the five concatenated 1s samples sound that mostly
activate a particular unit of the 512 in pool4 of the audio subnetwork.
Binary Verification
Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
40. 40Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Audio features achieve state of the art performance.
Contrastive Learning (verification)
41. 41Owens, A., & Efros, A. A. Audio-visual scene analysis with self-supervised multisensory features. ECCV 2018.
Binary Verification
42. 42
Yang, Karren, Bryan Russell, and Justin Salamon. "Telling Left From Right: Learning Spatial Correspondence of Sight and Sound."
CVPR 2020. [tweet]
Flipped
right &
left audio
channels
Binary Verification
43. 43
#XCD Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., & Tran, D. Self-Supervised Learning by Cross-Modal Audio-Video
Clustering. NeurIPS 2020
Binary verification (clustering)
Iterative training of:
● K-means clustering to improve pseudo-labels.
● Backprop training of the visual (Ev
) & audio (Ea
) encoders.
44. 44
#XCD Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., & Tran, D. Self-Supervised Learning by Cross-Modal Audio-Video
Clustering. NeurIPS 2020
Binary verification (clustering)
45. 45
#XCD Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., & Tran, D. Self-Supervised Learning by Cross-Modal Audio-Video
Clustering. NeurIPS 2020
Binary verification (clustering)
Downstream task: Video action recognition
Self-supervised models pretrained with XCD outperform supervised ones.
46. 46
Outline
1. Motivation
2. Feature Learning
a. Generative / Predictive Methods
b. Contrastive Methods
3. Cross-modal Translation
4. Embodied AI
49. 49
Contrastive Learning
Source: Raul Gómez, “Understanding Ranking Loss, Contrastive Loss, Margin Loss, Triplet Loss, Hinge Loss and all
those confusing names” (2019)
51. 51
Contrastive Learning (cosine similarity+class)
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal
Embeddings for Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
52. 52
Best
match
Audio feature
Contrastive Learning (cosine similarity+class)
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal
Embeddings for Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
53. 53
Best
match
Visual feature Audio feature
Contrastive Learning (cosine similarity+class)
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal
Embeddings for Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
54. 54
#AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from
Self-Supervised Synchronization." NIPS 2018.
Contrastive Learning (pairwise L2 hinge loss)
Positive A/V pair Negative A/V pair
55. 55
#AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from
Self-Supervised Synchronization." NIPS 2018.
Contrastive Learning (pairwise hinge loss)
56. 56
#AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from
Self-Supervised Synchronization." NIPS 2018.
Contrastive Learning (pairwise hinge loss)
60. 60
Image hallucination from sound
Lyu, Jeonghyun, Takashi Shinozaki, and Kaoru Amano. "Generating Images from Sounds Using Multimodal Features and
GANs." (2018).
61. 61
Image hallucination from sound
Chih Wen Lin, , “Generating Images from Audio” NeurIPS 2018 Creativity Workshop.
Conditional image generation based on StackGAN (stage I).
62. 62
Image hallucination from sound
Wan, C. H., Chuang, S. P., & Lee, H. Y. (2019, May). Towards audio to scene image synthesis using generative adversarial
network. ICASSP 2019.
63. 63
Video hallucination from sound
#Sound2Sight Cherian, Anoop, Moitreya Chatterjee, and Narendra Ahuja. "Sound2sight: Generating visual dynamics
from sound and context." ECCV 2020.
64. 64
Video hallucination from sound
#Sound2Sight Cherian, Anoop, Moitreya Chatterjee, and Narendra Ahuja. "Sound2sight: Generating visual dynamics
from sound and context." ECCV 2020.
65. 65
Avatar animation with music (skeletons)
Shlizerman, E., Dery, L., Schoen, H., & Kemelmacher-Shlizerman, I. (2018). Audio to body dynamics. CVPR 2018.
66. 66Shlizerman, E., Dery, L., Schoen, H., & Kemelmacher-Shlizerman, I. (2018). Audio to body dynamics. CVPR 2018.
68. 68
Sound Source Localization
Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual
Scenes." CVPR 2018.
69. 69
Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual Scenes."
CVPR 2018.
76. 76
Piano Transcription (MIDI)
AS Koepke, O Wiles, Y Moses, A Zisserman, “SIGHT TO SOUND: AN END-TO-END APPROACH FOR VISUAL PIANO
TRANSCRIPTION”. ICASSP 2020.
77. 77
AS Koepke, O Wiles, Y Moses, A Zisserman, “SIGHT TO SOUND: AN END-TO-END APPROACH FOR VISUAL PIANO
TRANSCRIPTION”. ICASSP 2020.
78. 78
Silent Video Sonorization (MIDI)
#FoleyMusic Gan, Chuang, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, and Antonio Torralba. "Foley Music:
Learning to Generate Music from Videos." ECCV 2020.
79. 79
#FoleyMusic Gan, Chuang, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, and Antonio Torralba. "Foley Music:
Learning to Generate Music from Videos." ECCV 2020.
81. 81
Sound Separation
Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba, “Music Gesture for Visual Sound
Separation” CVPR 2020.
82. 82
Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba, “Music Gesture for Visual Sound
Separation” CVPR 2020.
89. 89
Audio-visual Navigation with Deep RL
#SoundSpaces Chen, Changan, Unnat Jain, Carl Schissler, Sebastia Vicenc, Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu,
Philip Robinson, and Kristen Grauman. "SoundSpaces: Audio-visual navigation in 3d environments." ECCV 2020.
90. 90
#SoundSpaces Chen, Changan, Unnat Jain, Carl Schissler, Sebastia Vicenc, Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu,
Philip Robinson, and Kristen Grauman. "SoundSpaces: Audio-visual navigation in 3d environments." ECCV 2020.