Slides for the paper "Performance over Random: A robust evaluation protocol for video summarization methods", authored by E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, and I. Patras, published in the Proceedings of ACM Multimedia 2020 (ACM MM), Seattle, WA, USA, Oct. 2020.
1. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Title of presentation
Subtitle
Name of presenter
Date
Performance over Random: A Robust Evaluation Protocol for
Video Summarization Methods
E. Apostolidis1,2, E. Adamantidou1, A. I. Metsai1, V. Mezaris1, I. Patras2
1 CERTH-ITI, Thermi - Thessaloniki, Greece
2 School of EECS, Queen Mary University of London, London, UK
28th ACM Int. Conf. on Multimedia
Seattle, WA, USA, October 2020
2. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Outline
2
What’s the goal of video summarization?
How to evaluate video summarization?
Established evaluation protocol and its weaknesses
Proposed approach: Performance over Random
Experiments
Conclusions
3. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
3
Video summary: a short visual synopsis that encapsulates the flow of the story and
the essential parts of the full-length video
Original video
1. Video storyboard
What’s the goal of video summarization?
2. Video skim
Summary
4. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
How to evaluate video summarization?
4
An evaluation approach along with a benchmark dataset for video summarization was
introduced in [11]
SumMe dataset (https://gyglim.github.io/me/vsum/index.html#benchmark)
25 videos capturing multiple events (e.g. cooking and sports)
video length: 1 to 6 min
annotation: fragment-based video summaries (15-18 per video)
Evaluating video skims
[11] M. Gygli, H. Grabner, H. Riemenschneider, L. Van Gool. 2014. Creating Summaries from User Videos. In Proc. of the 2014 European
Conf. on Computer Vision (ECCV), D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.). Springer International Publishing, Cham, 505–520.
5. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
How to evaluate video summarization?
5
Agreement between automatically-generated (A) and user-defined (U) summary is expressed
by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩)
Typical metrics for computing Precision and Recall at the frame-level
80% of video samples are used for training and the remaining 20% for testing
Typically, the generated summary should not exceed 15% of the video length
Evaluating video skims
6. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
How to evaluate video summarization?
6
This protocol was used to evaluate summarization based on another benchmark dataset [12]
TVSum dataset (https://github.com/yalesong/tvsum)
50 videos from 10 categories of TRECVid MED task
video length: 1 to 11 min
annotation: frame-level importance scores (20 per video)
Evaluating video skims
[12] Y. Song, J. Vallmitjana, A. Stent, A. Jaimes. 2015. TVSum: Summarizing Web Videos Using Titles. In Proc. of the 2015 IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR). 5179–5187.
7. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Established evaluation protocol
7
Mostly used benchmark datasets: SumMe and TVSum
Alignment between automatically-created and user-defined summaries quantified by F-Score
Max of the computed values is kept for SumMe; Average of these values is kept for TVSum
Summary length should be less than 15% of the video duration
80% of data is used for training (plus validation) and the remaining 20% for testing
Most works perform evaluations using 5 different randomly-created data splits and report the
average performance
Though variations of this setting (1-split, 10-splits, “few”-splits, 5-fold cross validation) exist
Typical setting in bibliography
8. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
8
Setting of the study
Studying the established protocol
Considered aspects
Representativeness of results when evaluation relies on a small set of randomly-created splits
Reliability of performance comparisons that use different data splits for each algorithm
Used algorithms
Supervised dppLSTM [14] and VASNet [15] methods
Unsupervised DR-DSN [16], SUM-GAN-sl [17] and SUM-GAN-AAE [18] methods
First experiment: performance evaluation using a fixed set of 5 randomly-created data splits
of SumMe and TVSum
Second experiment: performance evaluation using a fixed set of 50 randomly-created data
splits of SumMe and TVSum
Plus: comparison with the reported values in the corresponding papers
9. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
9
Noticeable difference of evaluation results on
5 and 50 splits
Differences between 5 and 50 splits are often
larger than differences between methods
Methods' rankings are different on 5 and 50
splits; plus they do not match the ranking
based on the reported results
Outcomes
Values denote F-Score (%)
Rep. is the reported value from the relevant paper
Best score → bold, Second-best → underlined
Studying the established protocol
10. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
10
Noticeable difference of evaluation results on
5 and 50 splits
Differences between 5 and 50 splits are often
larger than differences between methods
Methods' rankings are different on 5 and 50
splits; plus they do not match the ranking
based on the reported results
Outcomes
Values denote F-Score (%)
Rep. is the reported value from the relevant paper
Best score → bold, Second-best → underlined
Serious lack of reliability of comparisons that
rely on a limited number of data splits
Studying the established protocol
Limited representativeness of results when the
evaluation relies on a few data splits
11. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
11
Studying the established protocol
Noticeable variability of
performance over the set
of splits
Variability follows a quite
similar pattern for all
methods
12. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
12
Studying the established protocol
Noticeable variability of
performance over the set
of splits
Variability follows a quite
similar pattern for all
methods
Hypothesis: different
levels of difficulty for the
used splits
13. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
13
How to mitigate the observed weaknesses?
Check potential association between the method’s performance and a measure of how
challenging each data split is
Use these data splits and examine the performance of:
Random Summarizer
Average Human Summarizer
Reduce the impact of the used data splits
14. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
14
Estimate random performance
For a given video of a test set:
1) Random frame-level importance scores based on a uniform distribution
15. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
15
Estimate random performance
For a given video of a test set:
1) Random frame-level importance scores based on a uniform distribution
2) Fragment-level importance scores
16. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
16
Estimate random performance
For a given video of a test set:
1) Random frame-level importance scores based on a uniform distribution
2) Fragment-level importance scores
3) Summary of the random summarizer Knapsack
17. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
17
Estimate random performance
For a given video of a test set:
4) Compare the random summary with the user-generated summaries
18. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
18
Estimate random performance
For a given video of a test set:
4) Compare the random summary with the user-generated summaries
F-Score1
19. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
19
Estimate random performance
For a given video of a test set:
4) Compare the random summary with the user-generated summaries
F-Score1
F-Score2
20. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
20
Estimate random performance
For a given video of a test set:
4) Compare the random summary with the user-generated summaries
F-Score1
F-ScoreN
F-Score2
21. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
21
Estimate random performance
For a given video of a test set:
4) Compare the random summary with the user-generated summaries
F-Score1
F-Score2
F-ScoreN
F-Score for Video #1
=max{F-Score}i=1
N
=avg{F-Score}i=1
N
22. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
22
Estimate random performance
For the entire test set of a data split:
4) Compare the random summary with the user-generated summaries
F-Score1
F-Score2
F-ScoreN
F-Score for Video #1
F-Score for Video #M
***
Calculate F-Score
for Video #M
F-Score
for test set
Average
23. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
23
Estimate random performance
For the entire test set of a data split:
4) Compare the random summary with the user-generated summaries
F-Score1
F-Score2
F-ScoreN
F-Score for Video #1
F-Score for Video #M
***
Calculate F-Score
for Video #M
F-Score
for test set
Average
x 100 times
24. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
24
Estimate average human performance
Performance of User #1 on a given video of a test set:
25. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
25
Estimate average human performance
Performance of User #1 on a given video of a test set:
F-Score12
26. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
26
Estimate average human performance
Performance of User #1 on a given video of a test set:
F-Score12
F-Score13
27. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
27
Estimate average human performance
Performance of User #1 on a given video of a test set:
F-Score12
F-Score13
F-Score1N
28. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
28
Estimate average human performance
Performance of User #1 on a given video of a test set:
F-Score12
F-Score13
F-Score1N
F-Score1User 1 -
29. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
29
Estimate average human performance
Performance of User #N on a given video of a test set:
F-Score12
F-Score13
F-Score1N
F-Score1
F-ScoreN2
F-ScoreN3
F-ScoreN(N-1)
F-ScoreNUser 1 - User N -
30. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
30
Estimate average human performance
Calculate the average human performance on a given video of a test set:
F-Score12
F-Score13
F-Score1N
F-Score1
F-ScoreN2
F-ScoreN3
F-ScoreN(N-1)
F-ScoreNUser 1 - User N -
Average
F-Score for Video #1
31. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
31
Estimate average human performance
Calculate the average human performance on the entire test set:
F-Score12
F-Score13
F-Score1N
F-Score1
F-ScoreN2
F-ScoreN3
F-ScoreN(N-1)
F-ScoreNUser 1 - User N -
Average
F-Score for Video #1
F-Score for Video #M
***
Calculate F-Score
for Video #M
32. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
32
Estimate average human performance
Calculate the average human performance on the entire test set:
F-Score12
F-Score13
F-Score1N
F-Score1
F-ScoreN2
F-ScoreN3
F-ScoreN(N-1)
F-ScoreNUser 1 - User N -
Average
F-Score for Video #1
F-Score for Video #M
***
Calculate F-Score
for Video #M
Final
F-Score
Average
33. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
33
Updated performance curve
Noticeable variance in the
performance of random
and human summarizer
Different levels of difficulty
for the used splits
34. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
34
How to decide on the most suitable measure?
Covariance: measure of the joint variability of two random variables
For two jointly distributed real-valued random variables X and Y with finite second moments:
Pearson Correlation Coefficient: normalized version of Covariance that indicates (via its
magnitude) the strength of the linear relation (values in [0,1])
Correlation with the performance of random and human summarizers
35. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
35
How to decide on the most suitable measure?
Correlation with the performance of random and human summarizers
In terms of performance there is a clearly stronger correlation of the tested
methods with the random summarizer
36. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
36
Proposed approach: Performance over Random (PoR)
Core idea
Estimate the difficulty of a data split by computing the performance of a random summarizer
Exploit this information when using the data split to assess a video summarization algorithm
Main targets
Reduce the impact of the used data splits in the performance evaluation
Increase the representativeness of evaluation outcomes
Enhance the reliability of comparisons based on different data splits
37. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
37
Proposed approach: Performance over Random (PoR)
Computing steps
For a given summarization method and a data split:
1) Compute Ƒ, the performance of a random summarizer for this split
38. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
38
Proposed approach: Performance over Random (PoR)
Computing steps
For a given summarization method and a data split:
1) Compute Ƒ, the performance of a random summarizer for this split
2) Compute the method's performance S on the data split
F-Score (%)
39. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
39
Proposed approach: Performance over Random (PoR)
Computing steps
For a given summarization method and a data split:
1) Compute Ƒ, the performance of a random summarizer for this split
2) Compute the method's performance S on the data split
3) Compute "Performance over Random" as:
F-Score (%) based on the
established evaluation
protocol
𝑃𝑜𝑅 =
𝑆
Ƒ
∙ 100
PoR < 100 : performance worse than baseline (random)
PoR > 100 : performance better than baseline (random)
40. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
40
Experiments
Representativeness of performance evaluation
Considered evaluation approaches:
Estimate performance using F-Score
Estimate performance using Performance over Random (PoR)
Estimate performance using Performance over Human (PoH)
Methods’ performance was examined on:
The large-scale setting of 50 fixed splits
20 fixed split-sets of 5 data splits each
Main focus: to which the extent the methods’ performance varies across the different data
splits / split-sets
Used measure: Relative Standard Deviation (RSD)
𝑃𝑜𝑅 =
𝑆
H
∙ 100
𝑅𝑆𝐷(𝑥) =
𝑆𝑇𝐷(𝑥)
Mean(x)
41. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
41
Experiments
Representativeness of performance evaluation
Similar RSD values for F-Score and PoH in most cases
Remarkably smaller RSD values for PoR
Reminder: the results need to vary as little as possible!
42. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
42
Experiments
Representativeness of performance evaluation
Similar RSD values for F-Score and PoH in most cases
Remarkably smaller RSD values for PoR
Reminder: the results need to vary as little as possible!
PoR is more representative of
an algorithm's performance
43. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
43
Experiments
Reliability of performance comparisons
But the data splits can affect the
evaluation outcomes!
Assess the robustness of each evaluation
protocol to such comparisons
Simulate 20 such comparisons by
creating 20 mixed split-sets
Rank methods from best to worst
Generation of mixed split-sets
Performance comparisons in the bibliography rely on the reported values in the relevant
papers and the used data splits are completely unknown
44. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
44
Experiments
Reliability of performance comparisons
For each method, we studied: i) the overall ranking and ii) the variation of its ranking
when using: i) 20 fixed split-sets and ii) 20 mixed split-sets
Variation quantified by computing the STD of a method’s ranking over the group of split-sets
45. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
45
Experiments
Reliability of performance comparisons
For each method, we studied: i) the overall ranking and ii) the variation of its ranking
when using: i) 20 fixed split-sets and ii) 20 mixed split-sets
Variation quantified by computing the STD of a method’s ranking over the group of split-sets
46. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
46
Experiments
Reliability of performance comparisons
For each method, we studied: i) the overall ranking and ii) the variation of its ranking
when using: i) 20 fixed split-sets and ii) 20 mixed split-sets
Variation quantified by computing the STD of a method’s ranking over the group of split-sets
47. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
47
Experiments
Reliability of performance comparisons
For each method, we studied: i) the overall ranking and ii) the variation of its ranking
when using: i) 20 fixed split-sets and ii) 20 mixed split-sets
Variation quantified by computing the STD of a method’s ranking over the group of split-sets
PoR is much more robust than F-Score
48. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
48
Experiments
Reliability of performance comparisons
Using the same (fixed) split-sets
Same average ranking for all methods for
both evaluation protocols
Using different (mixed) split-sets:
Average ranking may differ as PoR
considers the difficulty of each split-set
STD of average ranking differs
significantly between F-Score and PoR
Lower STD values for PoR
49. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
49
Experiments
Reliability of performance comparisons
Using the same (fixed) split-sets
Same average ranking for all methods for
both evaluation protocols
Using different (mixed) split-sets:
Average ranking may differ as PoR
considers the difficulty of each split-set
STD of average ranking differs
significantly between F-Score and PoR
Lower STD values for PoR
50. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
50
Experiments
Reliability of performance comparisons
Using the same (fixed) split-sets
Same average ranking for all methods for
both evaluation protocols
Using different (mixed) split-sets:
Average ranking may differ as PoR
considers the difficulty of each split-set
STD of average ranking differs
significantly between F-Score and PoR
Lower STD values for PoR
51. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
51
Experiments
Reliability of performance comparisons
Using the same (fixed) split-sets
Same average ranking for all methods for
both evaluation protocols
Using different (mixed) split-sets:
Average ranking may differ as PoR
considers the difficulty of each split-set
STD of average ranking differs
significantly between F-Score and PoR
Lower STD values for PoR
PoR is more suitable for comparing methods ran on
different split-sets
52. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Early experiments documented the varying difficulty of the different randomly-created data
splits of the established benchmarking datasets
Most SoA works use just a handful of different splits for evaluation
The varying difficulty significantly affects the evaluation results and the reliability of
performance comparisons that rely on the reported values
New evaluation protocol: Performance Over Random (PoR), which takes under consideration
estimates about the level of difficulty of each used data split
Experiments documented the increased robustness of PoR over F-Score and its suitability for
comparing methods ran on different split-sets
Conclusions
52
53. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
1. S. E. F. de Avila, A. da Luz Jr., A. de A. Araújo, M. Cord. 2008. VSUMM: An Approach for Automatic Video Summarization and Quantitative Evaluation. In Proc.
of the 2008 XXI Brazilian Symposium on Computer Graphics and Image Processing. 103–110.
2. N. Ejaz, I. Mehmood, S. W. Baik. 2014. Feature Aggregation Based Visual Attention model for Video Summarization. Computers and Electrical Engineering 40,
3 (2014), 993 – 1005. Special Issue on Image and Video Processing.
3. V. Chasanis, A. Likas, N. Galatsanos. 2008. Efficient Video Shot Summarization Using an Enhanced Spectral Clustering Approach. In Proc. of the Artificial
Neural Networks - ICANN 2008, V. Kurková, R. Neruda, J. Koutník (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 847–856.
4. S. E. F. de Avila, A. P. B. Lopes, A. da Luz Jr., A. de A. Araújo. 2011. VSUMM: A Mechanism Designed to Produce Static Video Summaries and a Novel Evaluation
Method. Pattern Recognition Letters 32, 1 (Jan. 2011), 56–68.
5. J. Almeida, N. J. Leite, R. da S. Torres. 2012. VISON: VIdeo Summarization for ONline Applications. Pattern Recogn. Lett. 33, 4 (March 2012), 397–409.
6. E. J. Y. C. Cahuina G. C. Chavez. 2013. A New Method for Static Video Summarization Using Local Descriptors and Video Temporal Segmentation. In Proc. of
the 2013 XXVI Conf. on Graphics, Patterns and Images. 226–233.
7. N. Ejaz, T. Bin Tariq, S. W. Baik. 2012. Adaptive Key Frame Extraction for Video Summarization Using an Aggregation Mechanism. Journal of Visual
Communication and Image Representation 23, 7 (Oct. 2012), 1031–1040.
8. H. Jacob, F. L. Pádua, A. Lacerda, A. C. Pereira. 2017. A Video Summarization Approach Based on the Emulation of Bottom-up Mechanisms of Visual Attention.
Journal of Intelligent Information Systems 49, 2 (Oct. 2017), 193–211.
9. K. M. Mahmoud, N. M. Ghanem, M. A. Ismail. 2013. Unsupervised Video Summarization via Dynamic Modeling-Based Hierarchical Clustering. In Proc. of the
12th Int. Conf. on Machine Learning and Applications, Vol. 2. 303–308.
10. B. Gong, W.-L. Chao, K. Grauman, F. Sha. 2014. Diverse Sequential Subset Selection for Supervised Video Summarization. In Advances in Neural Information
Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, K. Q.Weinberger (Eds.). Curran Associates, Inc., 2069–2077.
References
53
54. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
11. M. Gygli, H. Grabner, H. Riemenschneider, L. Van Gool. 2014. Creating Summaries from User Videos. In Proc. of the 2014 European Conf. on Computer Vision
(ECCV), D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.). Springer International Publishing, Cham, 505–520.
12. Y. Song, J. Vallmitjana, A. Stent, A. Jaimes. 2015. TVSum: Summarizing Web Videos Using Titles. In Proc. of the 2015 IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR). 5179–5187.
13. E. Rahtu, M. Otani, Y. Nakahima, J. Heikkilä. 2019. Rethinking the Evaluation of Video Summaries. In Proc. of the 2019 IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR).
14. K. Zhang, W.-L. Chao, F. Sha, K. Grauman. 2016. Video Summarization with Long Short-Term Memory. In Proc. of the 2016 European Conf. on Computer
Vision (ECCV), B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.). Springer International Publishing, Cham, 766–782.
15. J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, P. Remagnino. 2019. Summarizing Videos with Attention. In Proc. of the 2018 Asian Conf. on Computer Vision
(ACCV) Workshops, G. Carneiro, S. You (Eds.). Springer International Publishing, Cham, 39–54.
16. K. Zhou, Y. Qiao, T. Xiang. 2018. Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward. In Proc. of
the 2018 AAAI Conf. on Artificial Intelligence
17. E. Apostolidis, A. I. Metsai, E. Adamantidou, V. Mezaris, I. Patras. 2019. A Stepwise, Label-based Approach for Improving the Adversarial Training in
Unsupervised Video Summarization. In Proc. Of the 1st Int. Workshop on AI for Smart TV Content Production, Access and Delivery (Nice, France) (AI4TV ’19).
Association for Computing Machinery, New York, NY, USA, 17–25.
18. E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, I. Patras. 2020. Unsupervised Video Summarization via Attention-Driven Adversarial Learning. In Proc.
of the MultiMedia Modeling 2020, Y. M. Ro, W.-H. Cheng, J. Kim, W.-T. Chu, P. Cui, J.-W. Choi, M.-C. Hu, W. De Neve (Eds.). Springer International Publishing,
Cham, 492–504.
References
54
55. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
55
Thank you for your attention!
Questions?
Evlampios Apostolidis, apostolid@iti.gr
Vasileios Mezaris, bmezaris@iti.gr
Code and documentation publicly available at:
https://github.com/e-apostolidis/PoR-Summarization-Measure
This work was supported by the EUs Horizon 2020 research and innovation
programme under grant agreement H2020-780656 ReTV. The work of Ioannis
Patras has been supported by EPSRC under grant No. EP/R026424/1.