PoR_evaluation_measure_acm_mm_2020

retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Title of presentation
Subtitle
Name of presenter
Date
Performance over Random: A Robust Evaluation Protocol for
Video Summarization Methods
E. Apostolidis1,2, E. Adamantidou1, A. I. Metsai1, V. Mezaris1, I. Patras2
1 CERTH-ITI, Thermi - Thessaloniki, Greece
2 School of EECS, Queen Mary University of London, London, UK
28th ACM Int. Conf. on Multimedia
Seattle, WA, USA, October 2020

Outline
2
 What’s the goal of video summarization?
 How to evaluate video summarization?
 Established evaluation protocol and its weaknesses
 Proposed approach: Performance over Random
 Experiments
 Conclusions

3
Video summary: a short visual synopsis that encapsulates the flow of the story and
the essential parts of the full-length video
Original video
1. Video storyboard
What’s the goal of video summarization?
2. Video skim
Summary

How to evaluate video summarization?
4
 An evaluation approach along with a benchmark dataset for video summarization was
introduced in [11]
 SumMe dataset (https://gyglim.github.io/me/vsum/index.html#benchmark)
 25 videos capturing multiple events (e.g. cooking and sports)
 video length: 1 to 6 min
 annotation: fragment-based video summaries (15-18 per video)
Evaluating video skims
[11] M. Gygli, H. Grabner, H. Riemenschneider, L. Van Gool. 2014. Creating Summaries from User Videos. In Proc. of the 2014 European
Conf. on Computer Vision (ECCV), D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.). Springer International Publishing, Cham, 505–520.

5
 Agreement between automatically-generated (A) and user-defined (U) summary is expressed
by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩)
 Typical metrics for computing Precision and Recall at the frame-level
 80% of video samples are used for training and the remaining 20% for testing
 Typically, the generated summary should not exceed 15% of the video length

6
 This protocol was used to evaluate summarization based on another benchmark dataset [12]
 TVSum dataset (https://github.com/yalesong/tvsum)
 50 videos from 10 categories of TRECVid MED task
 video length: 1 to 11 min
 annotation: frame-level importance scores (20 per video)
[12] Y. Song, J. Vallmitjana, A. Stent, A. Jaimes. 2015. TVSum: Summarizing Web Videos Using Titles. In Proc. of the 2015 IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR). 5179–5187.

Established evaluation protocol
7
 Mostly used benchmark datasets: SumMe and TVSum
 Alignment between automatically-created and user-defined summaries quantified by F-Score
 Max of the computed values is kept for SumMe; Average of these values is kept for TVSum
 Summary length should be less than 15% of the video duration
 80% of data is used for training (plus validation) and the remaining 20% for testing
 Most works perform evaluations using 5 different randomly-created data splits and report the
average performance
 Though variations of this setting (1-split, 10-splits, “few”-splits, 5-fold cross validation) exist
Typical setting in bibliography

8
Setting of the study
Studying the established protocol
 Considered aspects
 Representativeness of results when evaluation relies on a small set of randomly-created splits
 Reliability of performance comparisons that use different data splits for each algorithm
 Used algorithms
 Supervised dppLSTM [14] and VASNet [15] methods
 Unsupervised DR-DSN [16], SUM-GAN-sl [17] and SUM-GAN-AAE [18] methods
 First experiment: performance evaluation using a fixed set of 5 randomly-created data splits
of SumMe and TVSum
 Second experiment: performance evaluation using a fixed set of 50 randomly-created data
splits of SumMe and TVSum
 Plus: comparison with the reported values in the corresponding papers

9
 Noticeable difference of evaluation results on
5 and 50 splits
 Differences between 5 and 50 splits are often
larger than differences between methods
 Methods' rankings are different on 5 and 50
splits; plus they do not match the ranking
based on the reported results
Outcomes
Values denote F-Score (%)
Rep. is the reported value from the relevant paper
Best score → bold, Second-best → underlined

10
 Noticeable difference of evaluation results on
5 and 50 splits
 Differences between 5 and 50 splits are often
larger than differences between methods
 Methods' rankings are different on 5 and 50
splits; plus they do not match the ranking
based on the reported results
Outcomes
Values denote F-Score (%)
Rep. is the reported value from the relevant paper
Best score → bold, Second-best → underlined
Serious lack of reliability of comparisons that
rely on a limited number of data splits
Limited representativeness of results when the
evaluation relies on a few data splits

11
 Noticeable variability of
performance over the set
of splits
 Variability follows a quite
similar pattern for all
methods

12
 Noticeable variability of
performance over the set
of splits
 Variability follows a quite
similar pattern for all
methods
Hypothesis: different
levels of difficulty for the
used splits

13
How to mitigate the observed weaknesses?
 Check potential association between the method’s performance and a measure of how
challenging each data split is
 Use these data splits and examine the performance of:
 Random Summarizer
 Average Human Summarizer
Reduce the impact of the used data splits

14
Estimate random performance
For a given video of a test set:
1) Random frame-level importance scores based on a uniform distribution

15
2) Fragment-level importance scores

16
2) Fragment-level importance scores
3) Summary of the random summarizer Knapsack

17
4) Compare the random summary with the user-generated summaries

18
F-Score1

19
F-Score1
F-Score2

20
F-Score1
F-ScoreN
F-Score2

21
F-Score1
F-Score2
F-ScoreN
F-Score for Video #1
=max{F-Score}i=1
N
=avg{F-Score}i=1
N

22
For the entire test set of a data split:
F-Score1
F-Score2
F-ScoreN
F-Score for Video #M
***
Calculate F-Score
for Video #M
F-Score
for test set
Average

23
For the entire test set of a data split:
F-Score1
F-Score2
F-ScoreN
***
Calculate F-Score
for Video #M
F-Score
for test set
Average
x 100 times

24
Estimate average human performance
Performance of User #1 on a given video of a test set:

25
F-Score12

26
F-Score12
F-Score13

27
F-Score12
F-Score13
F-Score1N

28
F-Score12
F-Score13
F-Score1N
F-Score1User 1 -

29
Performance of User #N on a given video of a test set:
F-Score12
F-Score13
F-Score1N
F-Score1
F-ScoreN2
F-ScoreN3
F-ScoreN(N-1)
F-ScoreNUser 1 - User N -

30
Calculate the average human performance on a given video of a test set:
F-Score12
F-Score13
F-Score1N
F-Score1
F-ScoreN2
F-ScoreN3
F-ScoreN(N-1)
Average

31
Calculate the average human performance on the entire test set:
F-Score12
F-Score13
F-Score1N
F-Score1
F-ScoreN2
F-ScoreN3
F-ScoreN(N-1)
Average
***
Calculate F-Score
for Video #M

32
Calculate the average human performance on the entire test set:
F-Score12
F-Score13
F-Score1N
F-Score1
F-ScoreN2
F-ScoreN3
F-ScoreN(N-1)
Average
***
Calculate F-Score
for Video #M
Final
F-Score
Average

33
Updated performance curve
 Noticeable variance in the
performance of random
and human summarizer
Different levels of difficulty
for the used splits

34
How to decide on the most suitable measure?
 Covariance: measure of the joint variability of two random variables
For two jointly distributed real-valued random variables X and Y with finite second moments:
 Pearson Correlation Coefficient: normalized version of Covariance that indicates (via its
magnitude) the strength of the linear relation (values in [0,1])
Correlation with the performance of random and human summarizers

35
How to decide on the most suitable measure?
Correlation with the performance of random and human summarizers
In terms of performance there is a clearly stronger correlation of the tested
methods with the random summarizer

36
Proposed approach: Performance over Random (PoR)
Core idea
 Estimate the difficulty of a data split by computing the performance of a random summarizer
 Exploit this information when using the data split to assess a video summarization algorithm
Main targets
 Reduce the impact of the used data splits in the performance evaluation
 Increase the representativeness of evaluation outcomes
 Enhance the reliability of comparisons based on different data splits

37
Computing steps
For a given summarization method and a data split:
1) Compute Ƒ, the performance of a random summarizer for this split

38
Computing steps
2) Compute the method's performance S on the data split
F-Score (%)

39
Computing steps
2) Compute the method's performance S on the data split
3) Compute "Performance over Random" as:
F-Score (%) based on the
established evaluation
protocol
𝑃𝑜𝑅 =
𝑆
Ƒ
∙ 100
PoR < 100 : performance worse than baseline (random)
PoR > 100 : performance better than baseline (random)

40
Experiments
Representativeness of performance evaluation
 Considered evaluation approaches:
 Estimate performance using F-Score
 Estimate performance using Performance over Random (PoR)
 Estimate performance using Performance over Human (PoH) 
 Methods’ performance was examined on:
 The large-scale setting of 50 fixed splits
 20 fixed split-sets of 5 data splits each
 Main focus: to which the extent the methods’ performance varies across the different data
splits / split-sets
 Used measure: Relative Standard Deviation (RSD) 
𝑃𝑜𝑅 =
𝑆
H
∙ 100
𝑅𝑆𝐷(𝑥) =
𝑆𝑇𝐷(𝑥)
Mean(x)

41
Experiments
 Similar RSD values for F-Score and PoH in most cases
 Remarkably smaller RSD values for PoR
 Reminder: the results need to vary as little as possible!

42
Experiments
 Similar RSD values for F-Score and PoH in most cases
 Remarkably smaller RSD values for PoR
 Reminder: the results need to vary as little as possible!
PoR is more representative of
an algorithm's performance

43
Experiments
Reliability of performance comparisons
 But the data splits can affect the
evaluation outcomes!
 Assess the robustness of each evaluation
protocol to such comparisons
 Simulate 20 such comparisons by
creating 20 mixed split-sets
 Rank methods from best to worst
Generation of mixed split-sets
 Performance comparisons in the bibliography rely on the reported values in the relevant
papers and the used data splits are completely unknown

44
Experiments
 For each method, we studied: i) the overall ranking and ii) the variation of its ranking
when using: i) 20 fixed split-sets and ii) 20 mixed split-sets
 Variation quantified by computing the STD of a method’s ranking over the group of split-sets

45
Experiments

46
Experiments

47
Experiments
PoR is much more robust than F-Score

48
Experiments
Using the same (fixed) split-sets
 Same average ranking for all methods for
both evaluation protocols
Using different (mixed) split-sets:
 Average ranking may differ as PoR
considers the difficulty of each split-set
 STD of average ranking differs
significantly between F-Score and PoR
 Lower STD values for PoR

49
Experiments

50
Experiments

51
Experiments
PoR is more suitable for comparing methods ran on
different split-sets

 Early experiments documented the varying difficulty of the different randomly-created data
splits of the established benchmarking datasets
 Most SoA works use just a handful of different splits for evaluation
 The varying difficulty significantly affects the evaluation results and the reliability of
performance comparisons that rely on the reported values
 New evaluation protocol: Performance Over Random (PoR), which takes under consideration
estimates about the level of difficulty of each used data split
 Experiments documented the increased robustness of PoR over F-Score and its suitability for
comparing methods ran on different split-sets
Conclusions
52

1. S. E. F. de Avila, A. da Luz Jr., A. de A. Araújo, M. Cord. 2008. VSUMM: An Approach for Automatic Video Summarization and Quantitative Evaluation. In Proc.
of the 2008 XXI Brazilian Symposium on Computer Graphics and Image Processing. 103–110.
2. N. Ejaz, I. Mehmood, S. W. Baik. 2014. Feature Aggregation Based Visual Attention model for Video Summarization. Computers and Electrical Engineering 40,
3 (2014), 993 – 1005. Special Issue on Image and Video Processing.
3. V. Chasanis, A. Likas, N. Galatsanos. 2008. Efficient Video Shot Summarization Using an Enhanced Spectral Clustering Approach. In Proc. of the Artificial
Neural Networks - ICANN 2008, V. Kurková, R. Neruda, J. Koutník (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 847–856.
4. S. E. F. de Avila, A. P. B. Lopes, A. da Luz Jr., A. de A. Araújo. 2011. VSUMM: A Mechanism Designed to Produce Static Video Summaries and a Novel Evaluation
Method. Pattern Recognition Letters 32, 1 (Jan. 2011), 56–68.
5. J. Almeida, N. J. Leite, R. da S. Torres. 2012. VISON: VIdeo Summarization for ONline Applications. Pattern Recogn. Lett. 33, 4 (March 2012), 397–409.
6. E. J. Y. C. Cahuina G. C. Chavez. 2013. A New Method for Static Video Summarization Using Local Descriptors and Video Temporal Segmentation. In Proc. of
the 2013 XXVI Conf. on Graphics, Patterns and Images. 226–233.
7. N. Ejaz, T. Bin Tariq, S. W. Baik. 2012. Adaptive Key Frame Extraction for Video Summarization Using an Aggregation Mechanism. Journal of Visual
Communication and Image Representation 23, 7 (Oct. 2012), 1031–1040.
8. H. Jacob, F. L. Pádua, A. Lacerda, A. C. Pereira. 2017. A Video Summarization Approach Based on the Emulation of Bottom-up Mechanisms of Visual Attention.
Journal of Intelligent Information Systems 49, 2 (Oct. 2017), 193–211.
9. K. M. Mahmoud, N. M. Ghanem, M. A. Ismail. 2013. Unsupervised Video Summarization via Dynamic Modeling-Based Hierarchical Clustering. In Proc. of the
12th Int. Conf. on Machine Learning and Applications, Vol. 2. 303–308.
10. B. Gong, W.-L. Chao, K. Grauman, F. Sha. 2014. Diverse Sequential Subset Selection for Supervised Video Summarization. In Advances in Neural Information
Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, K. Q.Weinberger (Eds.). Curran Associates, Inc., 2069–2077.
References
53

11. M. Gygli, H. Grabner, H. Riemenschneider, L. Van Gool. 2014. Creating Summaries from User Videos. In Proc. of the 2014 European Conf. on Computer Vision
(ECCV), D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.). Springer International Publishing, Cham, 505–520.
12. Y. Song, J. Vallmitjana, A. Stent, A. Jaimes. 2015. TVSum: Summarizing Web Videos Using Titles. In Proc. of the 2015 IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR). 5179–5187.
13. E. Rahtu, M. Otani, Y. Nakahima, J. Heikkilä. 2019. Rethinking the Evaluation of Video Summaries. In Proc. of the 2019 IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR).
14. K. Zhang, W.-L. Chao, F. Sha, K. Grauman. 2016. Video Summarization with Long Short-Term Memory. In Proc. of the 2016 European Conf. on Computer
Vision (ECCV), B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.). Springer International Publishing, Cham, 766–782.
15. J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, P. Remagnino. 2019. Summarizing Videos with Attention. In Proc. of the 2018 Asian Conf. on Computer Vision
(ACCV) Workshops, G. Carneiro, S. You (Eds.). Springer International Publishing, Cham, 39–54.
16. K. Zhou, Y. Qiao, T. Xiang. 2018. Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward. In Proc. of
the 2018 AAAI Conf. on Artificial Intelligence
17. E. Apostolidis, A. I. Metsai, E. Adamantidou, V. Mezaris, I. Patras. 2019. A Stepwise, Label-based Approach for Improving the Adversarial Training in
Unsupervised Video Summarization. In Proc. Of the 1st Int. Workshop on AI for Smart TV Content Production, Access and Delivery (Nice, France) (AI4TV ’19).
Association for Computing Machinery, New York, NY, USA, 17–25.
18. E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, I. Patras. 2020. Unsupervised Video Summarization via Attention-Driven Adversarial Learning. In Proc.
of the MultiMedia Modeling 2020, Y. M. Ro, W.-H. Cheng, J. Kim, W.-T. Chu, P. Cui, J.-W. Choi, M.-C. Hu, W. De Neve (Eds.). Springer International Publishing,
Cham, 492–504.
References
54

55
Thank you for your attention!
Questions?
Evlampios Apostolidis, apostolid@iti.gr
Vasileios Mezaris, bmezaris@iti.gr
Code and documentation publicly available at:
https://github.com/e-apostolidis/PoR-Summarization-Measure
This work was supported by the EUs Horizon 2020 research and innovation
programme under grant agreement H2020-780656 ReTV. The work of Ioannis
Patras has been supported by EPSRC under grant No. EP/R026424/1.

PoR_evaluation_measure_acm_mm_2020

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to PoR_evaluation_measure_acm_mm_2020

Similar to PoR_evaluation_measure_acm_mm_2020 (20)

More from VasileiosMezaris

More from VasileiosMezaris (20)

Recently uploaded

Recently uploaded (20)

PoR_evaluation_measure_acm_mm_2020