1. Thessaloniki, October 2020
GAN-based Video Summarization
Vasileios Mezaris
CERTH-ITI
Presentation at the AI4Media
Workshop on GANs for Media
Content Generation
1
Joint work with
E. Apostolidis, E. Adamantidou,
A. Metsai (CERTH-ITI);
I. Patras (QMUL)
2. Thessaloniki, October 2020Vasileios Mezaris
2
Video summary: a short visual summary that encapsulates the flow of the story and
the essential parts of the full-length video
Original video
Video summary (storyboard)
Problem statement
3. Thessaloniki, October 2020Vasileios Mezaris
3
Problem statement
Applications of video summarization
Professional CMS: effective indexing,
browsing, retrieval & promotion of media
assets
Video sharing platforms: improved viewer
experience, enhanced viewer engagement &
increased content consumption
Other summarization scenarios: movie trailer production, sports highlights video generation,
video synopsis of 24h surveillance recordings
4. Thessaloniki, October 2020Vasileios Mezaris
4
Related work
Deep-learning approaches
Various supervised methods (i.e., learning from ground-truth manually-generated summaries)
Using feedforward neural nets (CNNs) for e.g. identifying semantically-important video parts
Exploiting video-level metadata
Capturing the story flow using recurrent neural nets (e.g. LSTMs)
…and many more
Unsupervised algorithms that do not rely on human-annotations, and build summaries
Using adversarial learning to: minimize the distance between videos and their summary-based
reconstructions; maximize the mutual information between summary and video; learn a mapping
from raw videos to human-like summaries based on online available summaries
…and a few more approaches (see tutorial at IEEE ICME 2020,
https://www.slideshare.net/VasileiosMezaris/icme2020-tutorial-videosummarizationpart1)
+ No need for training data (limited, hard to produce)
+ Avoid the subjectivity & biases of manually-generated summaries
+ Adaptability to different types of video
5. Thessaloniki, October 2020Vasileios Mezaris
GANs for unsupervised video summarization
Our starting point: the SUM-GAN architecture [1]
Main idea: build a keyframe selection mechanism
by minimizing the distance between the deep
representations of the original video and a
reconstructed version of it based on the selected
keyframes
Problem: how to define a good distance?
Solution: use a trainable discriminator network!
Goal: train the Summarizer to maximally confuse
the Discriminator when distinguishing the original
from the reconstructed video
5
SUM-GAN
[1] B. Mahasseni, M. Lam, S. Todorovic, "Unsupervised Video
Summarization with Adversarial LSTM Networks“, 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp.
2982–2991.
6. Thessaloniki, October 2020Vasileios Mezaris
Introduces two extensions [2]:
A linear compression layer that reduces the size
of the CNN feature vectors
An incremental and fine-grained approach to
train the model’s components
[2] E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-
based Approach for Improving the Adversarial Training in Unsupervised Video
Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
6
SUM-GAN-sl
GANs for unsupervised video summarization
7. Thessaloniki, October 2020Vasileios Mezaris
Incremental approach to train the model’s components
7
SUM-GAN-sl
GANs for unsupervised video summarization
8. Thessaloniki, October 2020Vasileios Mezaris 8
(regularization factor)
SUM-GAN-sl
GANs for unsupervised video summarization
Incremental approach to train the model’s components
9. Thessaloniki, October 2020Vasileios Mezaris 9
SUM-GAN-sl
GANs for unsupervised video summarization
Incremental approach to train the model’s components
10. Thessaloniki, October 2020Vasileios Mezaris 10
SUM-GAN-sl
GANs for unsupervised video summarization
Incremental approach to train the model’s components
11. Thessaloniki, October 2020Vasileios Mezaris
Adversarial learning driven by deterministic
attention auto-encoder
The VAE in previous architecture was entirely
replaced by an attention auto-encoder (AAE)
network, forming the SUM-GAN-AAE
architecture [3]
[3] E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised
Video Summarization via Attention-Driven Adversarial Learning", Proc. 26th Int.
Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Jan. 2020.
11
SUM-GAN-AAE
GANs for unsupervised video summarization
12. Thessaloniki, October 2020Vasileios Mezaris 12
Attention auto-encoder
Processing pipeline
SUM-GAN-AAE
GANs for unsupervised video summarization
13. Thessaloniki, October 2020Vasileios Mezaris 13
Processing pipeline
Weighted feature vectors fed to the Encoder
Attention auto-encoder
SUM-GAN-AAE
GANs for unsupervised video summarization
14. Thessaloniki, October 2020Vasileios Mezaris 14
Processing pipeline
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
For t > 1: use the hidden state of the previous
Decoder’s step (h1)
For t = 1: use the hidden state of the last
Encoder’s step (He)
Attention auto-encoder
SUM-GAN-AAE
GANs for unsupervised video summarization
15. Thessaloniki, October 2020Vasileios Mezaris 15
Processing pipeline
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Attention auto-encoder
SUM-GAN-AAE
GANs for unsupervised video summarization
16. Thessaloniki, October 2020Vasileios Mezaris
Processing pipeline
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
16
Attention auto-encoder
SUM-GAN-AAE
GANs for unsupervised video summarization
17. Thessaloniki, October 2020Vasileios Mezaris
Processing pipeline
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
αt multiplied with V and form Context Vector vt’
17
Attention auto-encoder
SUM-GAN-AAE
GANs for unsupervised video summarization
18. Thessaloniki, October 2020Vasileios Mezaris
Processing pipeline
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
αt multiplied with V and form Context Vector vt’
vt’ combined with Decoder’s previous output yt-1
18
Attention auto-encoder
SUM-GAN-AAE
GANs for unsupervised video summarization
19. Thessaloniki, October 2020Vasileios Mezaris 19
Attention auto-encoder
Processing pipeline
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
αt multiplied with V and form Context Vector vt’
vt’ combined with Decoder’s previous output yt-1
Decoder gradually reconstructs the video
SUM-GAN-AAE
GANs for unsupervised video summarization
20. Thessaloniki, October 2020Vasileios Mezaris
Video summarization practicalities
Input: The CNN feature vectors of the (sampled) video frames
Output: Frame-level importance scores
Summarization process:
CNN features pass through the linear compression layer and the frame selector importance
scores computed at frame-level
Given a video segmentation (using KTS) calculate fragment-level importance scores by averaging
the scores of each fragment's frames
Summary is created by selecting the fragments that maximize the total importance score provided
that summary length does not exceed 15% of video duration, by solving the 0/1 Knapsack problem
20
Model’s I/O and summarization process
21. Thessaloniki, October 2020Vasileios Mezaris
Experiments
21
Datasets
SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)
25 videos capturing multiple events (e.g. cooking and sports)
video length: 1 to 6 min
annotation: fragment-based video summaries
TVSum (https://github.com/yalesong/tvsum)
50 videos from 10 categories of TRECVid MED task
video length: 1 to 11 min
annotation: frame-level importance scores
22. Thessaloniki, October 2020Vasileios Mezaris
Experiments
22
Evaluation protocol
The generated summary should not exceed 15% of the video length
Similarity between automatically generated (A) and ground-truth (G) summary is expressed
by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| ||
means duration)
Typical metrics for computing Precision and Recall at the frame-level
23. Thessaloniki, October 2020Vasileios Mezaris
Experiments
23
Evaluation protocol
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach in the literature
24. Thessaloniki, October 2020Vasileios Mezaris
Experiments
24
Evaluation protocol
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach in the literature
25. Thessaloniki, October 2020Vasileios Mezaris
Experiments
25
Evaluation protocol
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach in the literature
F-Score1
26. Thessaloniki, October 2020Vasileios Mezaris
Experiments
26
Evaluation protocol
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach in the literature
F-Score2
F-Score1
27. Thessaloniki, October 2020Vasileios Mezaris
Experiments
27
Evaluation protocol
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach in the literature
F-ScoreN
F-Score2
F-Score1
28. Thessaloniki, October 2020Vasileios Mezaris
Experiments
28
Evaluation protocol
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach in the literature
F-ScoreN
F-Score2
F-Score1
SumMe: TVSum:
N
29. Thessaloniki, October 2020Vasileios Mezaris
Experiments
29
Evaluation protocol
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Alternative approach
30. Thessaloniki, October 2020Vasileios Mezaris
Experiments
30
Evaluation protocol
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Alternative approach
F-Score
31. Thessaloniki, October 2020Vasileios Mezaris
Videos were down-sampled to 2 fps
Feature extraction was based on the pool5 layer of GoogleNet trained on ImageNet
Linear compression layer reduces the size of these vectors from 1024 to 500
All components are 2-layer LSTMs with 500 hidden units; Frame selector is a bi-directional LSTM
Training based on the Adam optimizer; Summarizer’s learning rate = 10-4; Discriminator’s
learning rate = 10-5
Dataset was split into two non-overlapping sets; a training set having 80% of data and a testing
set having the remaining 20% of data
Ran experiments on 5 differently created random splits and report the average performance at
the training-epoch-level (i.e. for the same training epoch) over these runs
Experiments
31
Implementation details
32. Thessaloniki, October 2020Vasileios Mezaris
Comparison with SoA unsupervised approaches based on multiple user summaries
Outcomes
A few SoA methods are comparable (or even worse) with a random summary generator
Best method on TVSum shows random-level performance on SumMe
Best method on SumMe performs worse than SUM-GAN-AAE and is less competitive on TVSum
Variational attention reduces SUM-GAN-sl efficiency due to the difficulty in efficiently defining two
latent spaces in parallel to the continuous update of the model's components during the training
Replacement of VAE with AAE leads to a noticeable performance improvement over SUM-GAN-sl
Experiments
32
Note: SUM-GAN is not listed in this table as it follows
the single gt-summary evaluation protocol
33. Thessaloniki, October 2020Vasileios Mezaris
Evaluating the effect of the AAE component
Training efficiency: much faster and more stable training of the model
Experiments
33
Loss curves for the SUM-GAN-sl and SUM-GAN-AAE
34. Thessaloniki, October 2020Vasileios Mezaris
Comparison with SoA supervised approaches based on multiple user summaries
Outcomes
Best methods in TVSum (MAVS and Tessellationsup, respectively) seem adapted to this dataset, as
they exhibit random-level performance on SumMe
Only a few supervised methods surpass the performance of a random summary generator on both
datasets, with VASNet being the best among them
The performance of these methods ranges between 44.1 - 49.7 on SumMe, and 56.1 - 61.4 on TVSum
Τhe unsupervised SUM-GAN-AAE model is comparable with SoA supervised methods
Experiments
34
+/- indicate
better/worse
performance
compared to
SUM-GAN-AAE
35. Thessaloniki, October 2020Vasileios Mezaris
Adapting / re-purposing the content
Main requirements:
Target distribution platforms & devices have varying requirements (e.g. the optimal
duration of a video differs from one platform to another)
Target audiences have different preferences / information needs
Video summarization:
Create editions of the content that are adapted to different platforms and audiences
35
36. Thessaloniki, October 2020Vasileios Mezaris
Adapting / re-purposing the content
Web application [4] for video summarization (try it with your video!):
http://multimedia2.iti.gr/videosummarization/service/start.html
Demo video:
https://youtu.be/LbjPLJzeNII
36
[4] C. Collyda, K. Apostolidis, E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, "A
Web Service for Video Summarization", Proc. ACM Int. Conf. on Interactive Media
Experiences (IMX 2020), Barcelona, Spain, June 2020.
37. Thessaloniki, October 2020Vasileios Mezaris
Presented two new video summarization methods, making use of:
The learning efficiency of the generative adversarial networks for unsupervised training
The effectiveness of attention mechanisms in spotting the most important parts of the video
Experimental evaluations on two benchmarking datasets
Documented the positive contribution of the introduced attention auto-encoder component in the
model's training and summarization performance
Highlighted the competitiveness of the unsupervised SUM-GAN-AAE method against SoA video
summarization techniques
Used GANs in a new web application for video summarization
Keep in mind: complete automation is sometimes not desired! (AI + human symbiosis is key)
Conclusions
37
38. Thessaloniki, October 2020Vasileios Mezaris
Questions?
38
Contact: Dr. Vasileios Mezaris
Information Technologies Institute
Centre for Research and Technology Hellas
Thermi-Thessaloniki, Greece
Tel: +30 2311 257770
Email: bmezaris@iti.gr, web: http://www.iti.gr/~bmezaris/
This work was supported in part by the EU’s Horizon 2020 research and innovation programme under grant
agreement H2020-780656 ReTV.