GAN-based video summarization

Thessaloniki, October 2020
GAN-based Video Summarization
Vasileios Mezaris
CERTH-ITI
Presentation at the AI4Media
Workshop on GANs for Media
Content Generation
1
Joint work with
E. Apostolidis, E. Adamantidou,
A. Metsai (CERTH-ITI);
I. Patras (QMUL)

Thessaloniki, October 2020Vasileios Mezaris
2
Video summary: a short visual summary that encapsulates the flow of the story and
the essential parts of the full-length video
Original video
Video summary (storyboard)
Problem statement

3
Problem statement
Applications of video summarization
 Professional CMS: effective indexing,
browsing, retrieval & promotion of media
assets
 Video sharing platforms: improved viewer
experience, enhanced viewer engagement &
increased content consumption
 Other summarization scenarios: movie trailer production, sports highlights video generation,
video synopsis of 24h surveillance recordings

4
Related work
Deep-learning approaches
 Various supervised methods (i.e., learning from ground-truth manually-generated summaries)
 Using feedforward neural nets (CNNs) for e.g. identifying semantically-important video parts
 Exploiting video-level metadata
 Capturing the story flow using recurrent neural nets (e.g. LSTMs)
 …and many more
 Unsupervised algorithms that do not rely on human-annotations, and build summaries
 Using adversarial learning to: minimize the distance between videos and their summary-based
reconstructions; maximize the mutual information between summary and video; learn a mapping
from raw videos to human-like summaries based on online available summaries
 …and a few more approaches (see tutorial at IEEE ICME 2020,
https://www.slideshare.net/VasileiosMezaris/icme2020-tutorial-videosummarizationpart1)
+ No need for training data (limited, hard to produce)
+ Avoid the subjectivity & biases of manually-generated summaries
+ Adaptability to different types of video

GANs for unsupervised video summarization
 Our starting point: the SUM-GAN architecture [1]
 Main idea: build a keyframe selection mechanism
by minimizing the distance between the deep
representations of the original video and a
reconstructed version of it based on the selected
keyframes
 Problem: how to define a good distance?
 Solution: use a trainable discriminator network!
 Goal: train the Summarizer to maximally confuse
the Discriminator when distinguishing the original
from the reconstructed video
5
SUM-GAN
[1] B. Mahasseni, M. Lam, S. Todorovic, "Unsupervised Video
Summarization with Adversarial LSTM Networks“, 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp.
2982–2991.

 Introduces two extensions [2]:
 A linear compression layer that reduces the size
of the CNN feature vectors
 An incremental and fine-grained approach to
train the model’s components
[2] E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-
based Approach for Improving the Adversarial Training in Unsupervised Video
Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
6
SUM-GAN-sl

 Incremental approach to train the model’s components
7
SUM-GAN-sl

Thessaloniki, October 2020Vasileios Mezaris 8
(regularization factor)
SUM-GAN-sl

SUM-GAN-sl

 Adversarial learning driven by deterministic
attention auto-encoder
 The VAE in previous architecture was entirely
replaced by an attention auto-encoder (AAE)
network, forming the SUM-GAN-AAE
architecture [3]
[3] E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised
Video Summarization via Attention-Driven Adversarial Learning", Proc. 26th Int.
Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Jan. 2020.
11
SUM-GAN-AAE

Attention auto-encoder
Processing pipeline
SUM-GAN-AAE

Processing pipeline
 Weighted feature vectors fed to the Encoder
SUM-GAN-AAE

Processing pipeline
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 For t > 1: use the hidden state of the previous
Decoder’s step (h1)
 For t = 1: use the hidden state of the last
Encoder’s step (He)
SUM-GAN-AAE

Processing pipeline
 Attention weights (αt) computed using:
SUM-GAN-AAE

Processing pipeline
 Energy score function
 Soft-max function
16
SUM-GAN-AAE

Processing pipeline
 αt multiplied with V and form Context Vector vt’
17
SUM-GAN-AAE

Processing pipeline
 vt’ combined with Decoder’s previous output yt-1
18
SUM-GAN-AAE

Processing pipeline
 vt’ combined with Decoder’s previous output yt-1
 Decoder gradually reconstructs the video
SUM-GAN-AAE

Video summarization practicalities
 Input: The CNN feature vectors of the (sampled) video frames
 Output: Frame-level importance scores
 Summarization process:
 CNN features pass through the linear compression layer and the frame selector  importance
scores computed at frame-level
 Given a video segmentation (using KTS) calculate fragment-level importance scores by averaging
the scores of each fragment's frames
 Summary is created by selecting the fragments that maximize the total importance score provided
that summary length does not exceed 15% of video duration, by solving the 0/1 Knapsack problem
20
Model’s I/O and summarization process

Experiments
21
Datasets
 SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)
 25 videos capturing multiple events (e.g. cooking and sports)
 video length: 1 to 6 min
 annotation: fragment-based video summaries
 TVSum (https://github.com/yalesong/tvsum)
 50 videos from 10 categories of TRECVid MED task
 video length: 1 to 11 min
 annotation: frame-level importance scores

Experiments
22
Evaluation protocol
 The generated summary should not exceed 15% of the video length
 Similarity between automatically generated (A) and ground-truth (G) summary is expressed
by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| ||
means duration)
 Typical metrics for computing Precision and Recall at the frame-level

Experiments
23
Evaluation protocol
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach in the literature

Experiments
24
Evaluation protocol

Experiments
25
Evaluation protocol
F-Score1

Experiments
26
Evaluation protocol
F-Score2
F-Score1

Experiments
27
Evaluation protocol
F-ScoreN
F-Score2
F-Score1

Experiments
28
Evaluation protocol
F-ScoreN
F-Score2
F-Score1
SumMe: TVSum:
N

Experiments
29
Evaluation protocol
 Alternative approach

Experiments
30
Evaluation protocol
 Alternative approach
F-Score

 Videos were down-sampled to 2 fps
 Feature extraction was based on the pool5 layer of GoogleNet trained on ImageNet
 Linear compression layer reduces the size of these vectors from 1024 to 500
 All components are 2-layer LSTMs with 500 hidden units; Frame selector is a bi-directional LSTM
 Training based on the Adam optimizer; Summarizer’s learning rate = 10-4; Discriminator’s
learning rate = 10-5
 Dataset was split into two non-overlapping sets; a training set having 80% of data and a testing
set having the remaining 20% of data
 Ran experiments on 5 differently created random splits and report the average performance at
the training-epoch-level (i.e. for the same training epoch) over these runs
Experiments
31
Implementation details

 Comparison with SoA unsupervised approaches based on multiple user summaries
 Outcomes
 A few SoA methods are comparable (or even worse) with a random summary generator
 Best method on TVSum shows random-level performance on SumMe
 Best method on SumMe performs worse than SUM-GAN-AAE and is less competitive on TVSum
 Variational attention reduces SUM-GAN-sl efficiency due to the difficulty in efficiently defining two
latent spaces in parallel to the continuous update of the model's components during the training
 Replacement of VAE with AAE leads to a noticeable performance improvement over SUM-GAN-sl
Experiments
32
Note: SUM-GAN is not listed in this table as it follows
the single gt-summary evaluation protocol

 Evaluating the effect of the AAE component
 Training efficiency: much faster and more stable training of the model
Experiments
33
Loss curves for the SUM-GAN-sl and SUM-GAN-AAE

 Comparison with SoA supervised approaches based on multiple user summaries
 Outcomes
 Best methods in TVSum (MAVS and Tessellationsup, respectively) seem adapted to this dataset, as
they exhibit random-level performance on SumMe
 Only a few supervised methods surpass the performance of a random summary generator on both
datasets, with VASNet being the best among them
 The performance of these methods ranges between 44.1 - 49.7 on SumMe, and 56.1 - 61.4 on TVSum
 Τhe unsupervised SUM-GAN-AAE model is comparable with SoA supervised methods
Experiments
34
+/- indicate
better/worse
performance
compared to
SUM-GAN-AAE

Adapting / re-purposing the content
 Main requirements:
 Target distribution platforms & devices have varying requirements (e.g. the optimal
duration of a video differs from one platform to another)
 Target audiences have different preferences / information needs
 Video summarization:
 Create editions of the content that are adapted to different platforms and audiences
35

Adapting / re-purposing the content
Web application [4] for video summarization (try it with your video!):
http://multimedia2.iti.gr/videosummarization/service/start.html
Demo video:
https://youtu.be/LbjPLJzeNII
36
[4] C. Collyda, K. Apostolidis, E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, "A
Web Service for Video Summarization", Proc. ACM Int. Conf. on Interactive Media
Experiences (IMX 2020), Barcelona, Spain, June 2020.

 Presented two new video summarization methods, making use of:
 The learning efficiency of the generative adversarial networks for unsupervised training
 The effectiveness of attention mechanisms in spotting the most important parts of the video
 Experimental evaluations on two benchmarking datasets
 Documented the positive contribution of the introduced attention auto-encoder component in the
model's training and summarization performance
 Highlighted the competitiveness of the unsupervised SUM-GAN-AAE method against SoA video
summarization techniques
 Used GANs in a new web application for video summarization
 Keep in mind: complete automation is sometimes not desired! (AI + human symbiosis is key)
Conclusions
37

Questions?
38
Contact: Dr. Vasileios Mezaris
Information Technologies Institute
Centre for Research and Technology Hellas
Thermi-Thessaloniki, Greece
Tel: +30 2311 257770
Email: bmezaris@iti.gr, web: http://www.iti.gr/~bmezaris/
This work was supported in part by the EU’s Horizon 2020 research and innovation programme under grant
agreement H2020-780656 ReTV.

GAN-based video summarization

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to GAN-based video summarization

Similar to GAN-based video summarization (20)

More from VasileiosMezaris

More from VasileiosMezaris (20)

Recently uploaded

Recently uploaded (20)

GAN-based video summarization