SlideShare a Scribd company logo
1 of 25
Download to read offline
Combining Global and Local Attention with Positional Encoding
for Video Summarization
E. Apostolidis1,2, G. Balaouras1, V. Mezaris1, I. Patras2
1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece
2 School of EECS, Queen Mary University of London, London, UK
23rd IEEE International Symposium
on Multimedia (ISM 2021)
Outline
• Problem statement
• Related work
• Developed approach
• Experiments
• Conclusions
1
Problem statement
2
Video is everywhere!
• Captured by smart devices and
instantly shared online
• Constantly and rapidly increasing
volumes of video content on the Web
Hours of video content uploaded on
YouTube every minute
Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new-age-video-
sharing-apps-like-tiktok/1767354/ (left) & https://www.statista.com/ (right)
Problem statement
3
But how to find what we are looking for in endless collections of video content?
Quickly inspect a video’s
content by checking its
synopsis!
Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
Goal of video summarization technologies
4
“Generate a short visual synopsis
that summarizes the video content
by selecting the most informative
and important parts of it”
Video title: “Susan Boyle's First Audition - I Dreamed a
Dream - Britain's Got Talent 2009”
Video source: OVP dataset (video also available at:
https://www.youtube.com/watch?v=deRF9oEbRso)
Static video summary
(made of selected key-frames)
Dynamic video summary
(made of selected key-fragments)
Video content
Key-fragment
Key-frame
Analysis outcomes
This synopsis can be:
• Static, composed of a set representative
video (key-)frames (a.k.a. video storyboard)
• Dynamic, formed of a set of representative
video (key-)fragments (a.k.a. video skim)
Related work (supervised approaches)
• Methods modeling the variable-range temporal dependence among frames, using:
• Structures of Recurrent Neural Networks (RNNs)
• Combinations of LSTMs with external storage or memory layers
• Attention mechanisms combined with classic or seq2seq RNN-based architectures
• Methods modeling the spatiotemporal structure of the video, using:
• 3D-Convolutional Neural Networks
• Combinations of CNNs with convolutional LSTMs, GRUs and optical flow maps
• Methods learning summarization using Generative Adversarial Networks
• Summarizer aims to fool Discriminator when distinguishing machine- from human-generated summary
• Methods modeling frames’ dependence by combining self-attention mechanisms, with:
• Trainable Regressor Networks that estimate frames’ importance
• Fragmentations of the video content in hierarchical key-frame selection strategies
• Multiple representations of the visual content (CNN-based & Inflated 3D ConvNet)
5
Developed approach
Starting point
• VASNet model (Fajtl et al., 2018)
• Seq2seq transformation using a soft
self-attention mechanism
Main weaknesses
• Lack of knowledge about the temporal
position of video frames
• Growing difficulty to estimate frames’
importance as video duration increases
6
J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, P. Remagnino, “Summarizing Videos with Attention,” in
Asian Conference on Computer Vision 2018 Workshops. Cham: Springer Int. Publishing, 2018, pp. 39-54.
Image source: Fajtl et al., “Summarizing Videos with Attention”,
ACCV 2018 Workshops
Developed approach
7
New network architecture (PGL-SUM)
• Uses a multi-head attention mechanism
to model frames’ dependence
according to the entire frame sequence
• Uses multiple multi-head attention
mechanisms to model short-term
dependencies over smaller video parts
• Enhances these mechanisms by adding
a component that encodes the
temporal position of video frames
Developed approach
7
Global attention
• Uses a multi-head attention mechanism
to model frames’ dependence
according to the entire frame sequence
Developed approach
Global attention
8
Developed approach
Global attention
9
Developed approach
10
Local attention
• Uses multiple multi-head attention
mechanisms to model short-term
dependencies over smaller video parts
Developed approach
11
Feature fusion & importance estimation
• New representation that carries
information about each frame’s global
and local dependencies
• Residual skip connection aims to
facilitate backpropagation
• Regressor produces a set of frame-level
scores that indicate frames’ importance
Developed approach
12
Training time
• Compute MSE between machine and
human frame-level importance scores
Inference time
• Compute fragment-level importance
based on a video segmentation
• Create the summary by selecting the
key-fragments based on a time budget
and by solving the Knapsack problem
Experiments
13
Datasets
• SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)
• 25 videos capturing multiple events (e.g. cooking and sports)
• Video length: 1 to 6 min
• Annotation: fragment-based video summaries (15-18 per video)
• TVSum (https://github.com/yalesong/tvsum)
• 50 videos from 10 categories of TRECVid MED task
• Video length: 1 to 11 min
• Annotation: frame-level importance scores (20 per video)
Experiments
14
Evaluation approach
• The generated summary should not exceed 15% of the video length
• Agreement between automatically-generated (A) and user-defined (U) summary is
expressed by the F-Score (%), with (P)recision and (R)ecall measuring the temporal
overlap (∩) (|| || means duration)
• 80% of video samples are used for training and the remaining 20% for testing
• Model selection is based on a criterion that focuses on rapid changes in the curve of the
training loss values or (if such changes do not occur) the minimization of the loss value
• Summarization performance is formed by averaging the experimental results on five
randomly-created splits of data
Experiments
15
Implementation details
• Videos were down-sampled to 2 fps
• Feature extraction: pool5 layer of GoogleNet trained on ImageNet (D = 1024)
• # local attention mechanisms = # video segments M: 4
• # global and local attention heads: 8 and 4 respectively
• Learning rate and L2 regularization factor: 5 x 10-5 and 10-5 respectively
• Dropout rate: 0.5
• Network initialization approach: Xavier uniform (gain = √2; biases = 0.1)
• Training: full batch mode; Adam optimizer; 200 epochs
• Hardware: NVIDIA TITAN XP
Experiments
16
Sensitivity analysis
• # video segments (# local attention mech.)
& global-local data fusion approach
SumMe TVSum
# Segm.
Fusion
2 4 8 2 4 8
Addition 49.8 55.6 51.9 61.1 61.7 59.5
Average pooling 51.5 54.1 52.5 61.1 59.7 58.7
Max pooling 51.1 53.9 52.2 61.3 59.6 58.5
Multiplication 46.3 52.1 52.5 47.3 47.6 46.9
SumMe TVSum
Local
Global
2 4 8 2 4 8
2 52.4 52.4 52.8 61.4 61.6 61.4
4 54.9 58.5 49.2 60.9 61.3 60.4
8 55.8 58.8 57.1 60.3 61.1 60.9
16 56.7 57.7 54.9 61.3 60.9 60.1
• # local and global attention heads
Reported values represent F-Score (%)
Experiments
17
Performance comparisons with two attention-based methods
• Evaluation was made under the exact same experimental conditions
SumMe TVSum Average
Rank
Data
splits
F1 Rank F1 Rank
VASNet (Fajtl et al., 2018) 50.0 3 62.5 2 2.5 5 Rand
MSVA (Ghauri et al., 2021) 54.0 2 62.4 3 2.5 5 Rand
PGL-SUM (Ours) 57.1 1 62.7 1 1 5 Rand
J. Fajtl et al., “Summarizing Videos with Attention,” in Asian Conf. on Comp. Vision 2018 Workshops.
Cham: Springer Int. Publishing, 2018, pp. 39–54.
J. Ghauri et al., “Supervised Video Summarization Via Multiple Feature Sets with Parallel Attention,”
in 2021 IEEE Int. Conf. on Multimedia and Expo. CA, USA: IEEE, 2021, pp. 1–6.
F1 denotes F-Score (%)
Experiments
Performance comparisons with SoA supervised methods
SumMe TVSum Avg
Rank
Data
splits
F1 Rank F1 Rank
Random 40.2 19 54.4 16 17.5 -
vsLSTM 37.6 22 54.2 17 19.5 1 Rand
dppLSTM 38.6 21 54.7 15 18 1 Rand
ActionRanking 40.1 20 56.3 14 17 1 Rand
*vsLSTM+Att 43.2 16 - - 16 1 Rand
*dppLSTM+Att 43.8 15 - - 15 1 Rand
H-RNN 42.1 17 57.9 12 14.5 -
*A-AVS 43.9 14 59.4 8 11 5 Rand
*SF-CVS 46.0 9 58.0 11 10 -
SUM-FCN 47.5 7 56.8 13 10 M Rand
HSA-RNN 44.1 13 59.8 7 10 -
SumMe TVSum Avg
Rank
Data
splits
F1 Rank F1 Rank
CRSum 47.3 8 58.0 11 9.5 5 FCV
MAVS 40.3 18 66.8 1 9.5 5 FCV
TTH-RNN 44.3 12 60.2 6 9 -
*M-AVS 44.4 11 61.0 4 7.5 5 Rand
SUM-DeepLab 48.8 5 58.4 10 7.5 M Rand
*DASP 45.5 10 63.6 3 6.5 5 Rand
*SUM-GDA 52.8 3 58.9 9 6 5 FCV
SMLD 47.6 6 61.0 4 5 5 FCV
*H-MAN 51.8 4 60.4 5 4.5 5 FCV
SMN 58.3 1 64.5 2 1.5 1 Rand
PGL-SUM 55.6 2 61.0 4 3 5 Rand
18
Experiments
19
Ablation study
Core components
SumMe TVSum
Global
attention
Local
attention
Positional
encoding
Variant #1 X √ √ 46.9 52.4
Variant #2 √ X √ 46.7 59.9
Variant #3 √ √ X 53.1 61.0
PGL-SUM
(Proposed)
√ √ √ 55.6 61.0
Reported values represent F-Score (%)
Experiments
19
Ablation study
Core components
SumMe TVSum
Global
attention
Local
attention
Positional
encoding
Variant #1 X √ √ 46.9 52.4
Variant #2 √ X √ 46.7 59.9
Variant #3 √ √ X 53.1 61.0
PGL-SUM
(Proposed)
√ √ √ 55.6 61.0
Combing global and local attention
allows to learn a better modeling
of frames’ dependencies
Reported values represent F-Score (%)
Experiments
19
Ablation study
Core components
SumMe TVSum
Global
attention
Local
attention
Positional
encoding
Variant #1 X √ √ 46.9 52.4
Variant #2 √ X √ 46.7 59.9
Variant #3 √ √ X 53.1 61.0
PGL-SUM
(Proposed)
√ √ √ 55.6 61.0
Integrating knowledge about the
frames’ position positively affects
the summarization performance
Reported values represent F-Score (%)
Conclusions
20
• PGL-SUM model for supervised video summarization aims to improve
• modeling of long-range frames’ dependencies
• parallelization ability of the training process
• granularity level at which the temporal dependencies between frames are modeled
• PGL-SUM contains multiple attention mechanisms that encode frames’ position
• A global multi-head attention mechanism models frames’ dependence based on the entire video
• Multiple local multi-head attention mechanisms discover modelings of frames’ dependencies by
focusing on smaller parts of the video
• Experiments on two benchmark datasets (SumMe and TVSum)
• Showed PGL-SUM’s excellence against other methods that rely on self-attention mechanisms
• Demonstrated PGL-SUM’s competitiveness against other SoA supervised summarization approaches
• Documented the positive contribution of combining global and local multi-head attention with
absolute positional encoding
Thank you for your attention!
Questions?
Evlampios Apostolidis, apostolid@iti.gr
Code and documentation publicly available at:
https://github.com/e-apostolidis/PGL-SUM
This work was supported by the EUs Horizon 2020 research and innovation programme under
grant agreement H2020-832921 MIRROR, and by EPSRC under grant No. EP/R026424/1

More Related Content

What's hot

Bachelor-Verteidigung
Bachelor-VerteidigungBachelor-Verteidigung
Bachelor-Verteidigungwruge
 
FastDepth: Fast Monocular Depth Estimation on Embedded Systems
FastDepth: Fast Monocular Depth Estimation on Embedded SystemsFastDepth: Fast Monocular Depth Estimation on Embedded Systems
FastDepth: Fast Monocular Depth Estimation on Embedded Systemsharmonylab
 
정보공학(IE) 방법론.pptx
정보공학(IE) 방법론.pptx정보공학(IE) 방법론.pptx
정보공학(IE) 방법론.pptxSeong-Bok Lee
 
캡스톤 졸작 발표
캡스톤 졸작 발표캡스톤 졸작 발표
캡스톤 졸작 발표Kyuhwan Choi
 
【CVPR 2020 メタサーベイ】Computational Photography
【CVPR 2020 メタサーベイ】Computational Photography【CVPR 2020 メタサーベイ】Computational Photography
【CVPR 2020 メタサーベイ】Computational Photographycvpaper. challenge
 
Verteidigung Masterarbeit "Entwicklung eines E-Learning Programms zur Steiger...
Verteidigung Masterarbeit "Entwicklung eines E-Learning Programms zur Steiger...Verteidigung Masterarbeit "Entwicklung eines E-Learning Programms zur Steiger...
Verteidigung Masterarbeit "Entwicklung eines E-Learning Programms zur Steiger...Daniela Wolf
 
Verteidigung der MasterThesis
Verteidigung der MasterThesisVerteidigung der MasterThesis
Verteidigung der MasterThesischris2newz
 
Dataset for Semantic Urban Scene Understanding
Dataset for Semantic Urban Scene UnderstandingDataset for Semantic Urban Scene Understanding
Dataset for Semantic Urban Scene UnderstandingYosuke Shinya
 
フォトンマッピング入門
フォトンマッピング入門フォトンマッピング入門
フォトンマッピング入門Shuichi Hayashi
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...Sunghoon Joo
 
Multi-Modal Summarization
Multi-Modal SummarizationMulti-Modal Summarization
Multi-Modal SummarizationXiachongFeng
 
2조 프로젝트 보고서 김동현
2조 프로젝트 보고서 김동현2조 프로젝트 보고서 김동현
2조 프로젝트 보고서 김동현kdh24
 
Domain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveyDomain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveySangwoo Mo
 
Structured Business Process Modeling - Lavacon 2014
Structured Business Process Modeling - Lavacon 2014Structured Business Process Modeling - Lavacon 2014
Structured Business Process Modeling - Lavacon 2014Dr. Jackie Damrau, BPMN
 
졸업프로젝트 어플리케이션 발표자료
졸업프로젝트 어플리케이션 발표자료졸업프로젝트 어플리케이션 발표자료
졸업프로젝트 어플리케이션 발표자료Do Hyun Youn
 
Kolloqium Bachelorarbeit V1
Kolloqium Bachelorarbeit V1Kolloqium Bachelorarbeit V1
Kolloqium Bachelorarbeit V1Nils Meder
 
Präsentation der Bachelorarbeit
Präsentation der BachelorarbeitPräsentation der Bachelorarbeit
Präsentation der Bachelorarbeitalm13
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentationnirvdrum
 

What's hot (20)

Bachelor Thesis
Bachelor ThesisBachelor Thesis
Bachelor Thesis
 
Bachelor-Verteidigung
Bachelor-VerteidigungBachelor-Verteidigung
Bachelor-Verteidigung
 
Normal Mapping / Computer Graphics - IK
Normal Mapping / Computer Graphics - IKNormal Mapping / Computer Graphics - IK
Normal Mapping / Computer Graphics - IK
 
FastDepth: Fast Monocular Depth Estimation on Embedded Systems
FastDepth: Fast Monocular Depth Estimation on Embedded SystemsFastDepth: Fast Monocular Depth Estimation on Embedded Systems
FastDepth: Fast Monocular Depth Estimation on Embedded Systems
 
정보공학(IE) 방법론.pptx
정보공학(IE) 방법론.pptx정보공학(IE) 방법론.pptx
정보공학(IE) 방법론.pptx
 
캡스톤 졸작 발표
캡스톤 졸작 발표캡스톤 졸작 발표
캡스톤 졸작 발표
 
【CVPR 2020 メタサーベイ】Computational Photography
【CVPR 2020 メタサーベイ】Computational Photography【CVPR 2020 メタサーベイ】Computational Photography
【CVPR 2020 メタサーベイ】Computational Photography
 
Verteidigung Masterarbeit "Entwicklung eines E-Learning Programms zur Steiger...
Verteidigung Masterarbeit "Entwicklung eines E-Learning Programms zur Steiger...Verteidigung Masterarbeit "Entwicklung eines E-Learning Programms zur Steiger...
Verteidigung Masterarbeit "Entwicklung eines E-Learning Programms zur Steiger...
 
Verteidigung der MasterThesis
Verteidigung der MasterThesisVerteidigung der MasterThesis
Verteidigung der MasterThesis
 
Dataset for Semantic Urban Scene Understanding
Dataset for Semantic Urban Scene UnderstandingDataset for Semantic Urban Scene Understanding
Dataset for Semantic Urban Scene Understanding
 
フォトンマッピング入門
フォトンマッピング入門フォトンマッピング入門
フォトンマッピング入門
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
 
Multi-Modal Summarization
Multi-Modal SummarizationMulti-Modal Summarization
Multi-Modal Summarization
 
2조 프로젝트 보고서 김동현
2조 프로젝트 보고서 김동현2조 프로젝트 보고서 김동현
2조 프로젝트 보고서 김동현
 
Domain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveyDomain Transfer and Adaptation Survey
Domain Transfer and Adaptation Survey
 
Structured Business Process Modeling - Lavacon 2014
Structured Business Process Modeling - Lavacon 2014Structured Business Process Modeling - Lavacon 2014
Structured Business Process Modeling - Lavacon 2014
 
졸업프로젝트 어플리케이션 발표자료
졸업프로젝트 어플리케이션 발표자료졸업프로젝트 어플리케이션 발표자료
졸업프로젝트 어플리케이션 발표자료
 
Kolloqium Bachelorarbeit V1
Kolloqium Bachelorarbeit V1Kolloqium Bachelorarbeit V1
Kolloqium Bachelorarbeit V1
 
Präsentation der Bachelorarbeit
Präsentation der BachelorarbeitPräsentation der Bachelorarbeit
Präsentation der Bachelorarbeit
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentation
 

Similar to PGL SUM Video Summarization

CA-SUM Video Summarization
CA-SUM Video SummarizationCA-SUM Video Summarization
CA-SUM Video SummarizationVasileiosMezaris
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020VasileiosMezaris
 
ReTV AI4TV Summarization
ReTV AI4TV SummarizationReTV AI4TV Summarization
ReTV AI4TV SummarizationReTV project
 
UVM_Full_Print_n.pptx
UVM_Full_Print_n.pptxUVM_Full_Print_n.pptx
UVM_Full_Print_n.pptxnikitha992646
 
Video Stabilization using Python and open CV
Video Stabilization using Python and open CVVideo Stabilization using Python and open CV
Video Stabilization using Python and open CVIRJET Journal
 
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...Rashid Mijumbi
 
MediaEval 2016 - UNIFESP Predicting Media Interestingness Task
MediaEval 2016 - UNIFESP Predicting Media Interestingness TaskMediaEval 2016 - UNIFESP Predicting Media Interestingness Task
MediaEval 2016 - UNIFESP Predicting Media Interestingness Taskmultimediaeval
 
IRJET- Storage Optimization of Video Surveillance from CCTV Camera
IRJET- Storage Optimization of Video Surveillance from CCTV CameraIRJET- Storage Optimization of Video Surveillance from CCTV Camera
IRJET- Storage Optimization of Video Surveillance from CCTV CameraIRJET Journal
 
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATIONVISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATIONcscpconf
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web applicationVasileiosMezaris
 
Supply chain design and operation
Supply chain design and operationSupply chain design and operation
Supply chain design and operationAngelainBay
 
Parking Surveillance Footage Summarization
Parking Surveillance Footage SummarizationParking Surveillance Footage Summarization
Parking Surveillance Footage SummarizationIRJET Journal
 
SPC Implementation - Mark Harrison
SPC Implementation - Mark HarrisonSPC Implementation - Mark Harrison
SPC Implementation - Mark HarrisonMark Harrison
 
Advanced Verification Methodology for Complex System on Chip Verification
Advanced Verification Methodology for Complex System on Chip VerificationAdvanced Verification Methodology for Complex System on Chip Verification
Advanced Verification Methodology for Complex System on Chip VerificationVLSICS Design
 
DIVING PERFORMANCE ASSESSMENT BY MEANS OF VIDEO PROCESSING
DIVING PERFORMANCE ASSESSMENT BY MEANS OF VIDEO PROCESSINGDIVING PERFORMANCE ASSESSMENT BY MEANS OF VIDEO PROCESSING
DIVING PERFORMANCE ASSESSMENT BY MEANS OF VIDEO PROCESSINGcsandit
 
Secure IoT Systems Monitor Framework using Probabilistic Image Encryption
Secure IoT Systems Monitor Framework using Probabilistic Image EncryptionSecure IoT Systems Monitor Framework using Probabilistic Image Encryption
Secure IoT Systems Monitor Framework using Probabilistic Image EncryptionIJAEMSJORNAL
 
Key frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptorsKey frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptorseSAT Publishing House
 
Key frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptorsKey frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptorseSAT Journals
 

Similar to PGL SUM Video Summarization (20)

CA-SUM Video Summarization
CA-SUM Video SummarizationCA-SUM Video Summarization
CA-SUM Video Summarization
 
Video Thumbnail Selector
Video Thumbnail SelectorVideo Thumbnail Selector
Video Thumbnail Selector
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020
 
ReTV AI4TV Summarization
ReTV AI4TV SummarizationReTV AI4TV Summarization
ReTV AI4TV Summarization
 
Defense_20140625
Defense_20140625Defense_20140625
Defense_20140625
 
UVM_Full_Print_n.pptx
UVM_Full_Print_n.pptxUVM_Full_Print_n.pptx
UVM_Full_Print_n.pptx
 
Video Stabilization using Python and open CV
Video Stabilization using Python and open CVVideo Stabilization using Python and open CV
Video Stabilization using Python and open CV
 
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
 
MediaEval 2016 - UNIFESP Predicting Media Interestingness Task
MediaEval 2016 - UNIFESP Predicting Media Interestingness TaskMediaEval 2016 - UNIFESP Predicting Media Interestingness Task
MediaEval 2016 - UNIFESP Predicting Media Interestingness Task
 
IRJET- Storage Optimization of Video Surveillance from CCTV Camera
IRJET- Storage Optimization of Video Surveillance from CCTV CameraIRJET- Storage Optimization of Video Surveillance from CCTV Camera
IRJET- Storage Optimization of Video Surveillance from CCTV Camera
 
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATIONVISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web application
 
Supply chain design and operation
Supply chain design and operationSupply chain design and operation
Supply chain design and operation
 
Parking Surveillance Footage Summarization
Parking Surveillance Footage SummarizationParking Surveillance Footage Summarization
Parking Surveillance Footage Summarization
 
SPC Implementation - Mark Harrison
SPC Implementation - Mark HarrisonSPC Implementation - Mark Harrison
SPC Implementation - Mark Harrison
 
Advanced Verification Methodology for Complex System on Chip Verification
Advanced Verification Methodology for Complex System on Chip VerificationAdvanced Verification Methodology for Complex System on Chip Verification
Advanced Verification Methodology for Complex System on Chip Verification
 
DIVING PERFORMANCE ASSESSMENT BY MEANS OF VIDEO PROCESSING
DIVING PERFORMANCE ASSESSMENT BY MEANS OF VIDEO PROCESSINGDIVING PERFORMANCE ASSESSMENT BY MEANS OF VIDEO PROCESSING
DIVING PERFORMANCE ASSESSMENT BY MEANS OF VIDEO PROCESSING
 
Secure IoT Systems Monitor Framework using Probabilistic Image Encryption
Secure IoT Systems Monitor Framework using Probabilistic Image EncryptionSecure IoT Systems Monitor Framework using Probabilistic Image Encryption
Secure IoT Systems Monitor Framework using Probabilistic Image Encryption
 
Key frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptorsKey frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptors
 
Key frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptorsKey frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptors
 

More from VasileiosMezaris

Multi-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationMulti-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationVasileiosMezaris
 
CERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskCERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskVasileiosMezaris
 
Spatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosSpatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosVasileiosMezaris
 
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...VasileiosMezaris
 
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022VasileiosMezaris
 
TAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsTAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsVasileiosMezaris
 
Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchVasileiosMezaris
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersVasileiosMezaris
 
Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...VasileiosMezaris
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...VasileiosMezaris
 
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalHard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalVasileiosMezaris
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AIVasileiosMezaris
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarizationVasileiosMezaris
 
Migration-related video retrieval
Migration-related video retrievalMigration-related video retrieval
Migration-related video retrievalVasileiosMezaris
 
Fractional step discriminant pruning
Fractional step discriminant pruningFractional step discriminant pruning
Fractional step discriminant pruningVasileiosMezaris
 
Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...VasileiosMezaris
 
Subclass deep neural networks
Subclass deep neural networksSubclass deep neural networks
Subclass deep neural networksVasileiosMezaris
 
Video & AI: capabilities and limitations of AI in detecting video manipulations
Video & AI: capabilities and limitations of AI in detecting video manipulationsVideo & AI: capabilities and limitations of AI in detecting video manipulations
Video & AI: capabilities and limitations of AI in detecting video manipulationsVasileiosMezaris
 

More from VasileiosMezaris (20)

Multi-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationMulti-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and Localization
 
CERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskCERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages Task
 
Spatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosSpatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees Videos
 
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
 
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
 
TAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsTAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for Explanations
 
Gated-ViGAT
Gated-ViGATGated-ViGAT
Gated-ViGAT
 
Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video Search
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
 
Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...
 
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalHard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AI
 
LSTM Structured Pruning
LSTM Structured PruningLSTM Structured Pruning
LSTM Structured Pruning
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarization
 
Migration-related video retrieval
Migration-related video retrievalMigration-related video retrieval
Migration-related video retrieval
 
Fractional step discriminant pruning
Fractional step discriminant pruningFractional step discriminant pruning
Fractional step discriminant pruning
 
Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...
 
Subclass deep neural networks
Subclass deep neural networksSubclass deep neural networks
Subclass deep neural networks
 
Video & AI: capabilities and limitations of AI in detecting video manipulations
Video & AI: capabilities and limitations of AI in detecting video manipulationsVideo & AI: capabilities and limitations of AI in detecting video manipulations
Video & AI: capabilities and limitations of AI in detecting video manipulations
 

Recently uploaded

Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 

Recently uploaded (20)

Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 

PGL SUM Video Summarization

  • 1. Combining Global and Local Attention with Positional Encoding for Video Summarization E. Apostolidis1,2, G. Balaouras1, V. Mezaris1, I. Patras2 1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece 2 School of EECS, Queen Mary University of London, London, UK 23rd IEEE International Symposium on Multimedia (ISM 2021)
  • 2. Outline • Problem statement • Related work • Developed approach • Experiments • Conclusions 1
  • 3. Problem statement 2 Video is everywhere! • Captured by smart devices and instantly shared online • Constantly and rapidly increasing volumes of video content on the Web Hours of video content uploaded on YouTube every minute Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new-age-video- sharing-apps-like-tiktok/1767354/ (left) & https://www.statista.com/ (right)
  • 4. Problem statement 3 But how to find what we are looking for in endless collections of video content? Quickly inspect a video’s content by checking its synopsis! Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
  • 5. Goal of video summarization technologies 4 “Generate a short visual synopsis that summarizes the video content by selecting the most informative and important parts of it” Video title: “Susan Boyle's First Audition - I Dreamed a Dream - Britain's Got Talent 2009” Video source: OVP dataset (video also available at: https://www.youtube.com/watch?v=deRF9oEbRso) Static video summary (made of selected key-frames) Dynamic video summary (made of selected key-fragments) Video content Key-fragment Key-frame Analysis outcomes This synopsis can be: • Static, composed of a set representative video (key-)frames (a.k.a. video storyboard) • Dynamic, formed of a set of representative video (key-)fragments (a.k.a. video skim)
  • 6. Related work (supervised approaches) • Methods modeling the variable-range temporal dependence among frames, using: • Structures of Recurrent Neural Networks (RNNs) • Combinations of LSTMs with external storage or memory layers • Attention mechanisms combined with classic or seq2seq RNN-based architectures • Methods modeling the spatiotemporal structure of the video, using: • 3D-Convolutional Neural Networks • Combinations of CNNs with convolutional LSTMs, GRUs and optical flow maps • Methods learning summarization using Generative Adversarial Networks • Summarizer aims to fool Discriminator when distinguishing machine- from human-generated summary • Methods modeling frames’ dependence by combining self-attention mechanisms, with: • Trainable Regressor Networks that estimate frames’ importance • Fragmentations of the video content in hierarchical key-frame selection strategies • Multiple representations of the visual content (CNN-based & Inflated 3D ConvNet) 5
  • 7. Developed approach Starting point • VASNet model (Fajtl et al., 2018) • Seq2seq transformation using a soft self-attention mechanism Main weaknesses • Lack of knowledge about the temporal position of video frames • Growing difficulty to estimate frames’ importance as video duration increases 6 J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, P. Remagnino, “Summarizing Videos with Attention,” in Asian Conference on Computer Vision 2018 Workshops. Cham: Springer Int. Publishing, 2018, pp. 39-54. Image source: Fajtl et al., “Summarizing Videos with Attention”, ACCV 2018 Workshops
  • 8. Developed approach 7 New network architecture (PGL-SUM) • Uses a multi-head attention mechanism to model frames’ dependence according to the entire frame sequence • Uses multiple multi-head attention mechanisms to model short-term dependencies over smaller video parts • Enhances these mechanisms by adding a component that encodes the temporal position of video frames
  • 9. Developed approach 7 Global attention • Uses a multi-head attention mechanism to model frames’ dependence according to the entire frame sequence
  • 12. Developed approach 10 Local attention • Uses multiple multi-head attention mechanisms to model short-term dependencies over smaller video parts
  • 13. Developed approach 11 Feature fusion & importance estimation • New representation that carries information about each frame’s global and local dependencies • Residual skip connection aims to facilitate backpropagation • Regressor produces a set of frame-level scores that indicate frames’ importance
  • 14. Developed approach 12 Training time • Compute MSE between machine and human frame-level importance scores Inference time • Compute fragment-level importance based on a video segmentation • Create the summary by selecting the key-fragments based on a time budget and by solving the Knapsack problem
  • 15. Experiments 13 Datasets • SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark) • 25 videos capturing multiple events (e.g. cooking and sports) • Video length: 1 to 6 min • Annotation: fragment-based video summaries (15-18 per video) • TVSum (https://github.com/yalesong/tvsum) • 50 videos from 10 categories of TRECVid MED task • Video length: 1 to 11 min • Annotation: frame-level importance scores (20 per video)
  • 16. Experiments 14 Evaluation approach • The generated summary should not exceed 15% of the video length • Agreement between automatically-generated (A) and user-defined (U) summary is expressed by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| || means duration) • 80% of video samples are used for training and the remaining 20% for testing • Model selection is based on a criterion that focuses on rapid changes in the curve of the training loss values or (if such changes do not occur) the minimization of the loss value • Summarization performance is formed by averaging the experimental results on five randomly-created splits of data
  • 17. Experiments 15 Implementation details • Videos were down-sampled to 2 fps • Feature extraction: pool5 layer of GoogleNet trained on ImageNet (D = 1024) • # local attention mechanisms = # video segments M: 4 • # global and local attention heads: 8 and 4 respectively • Learning rate and L2 regularization factor: 5 x 10-5 and 10-5 respectively • Dropout rate: 0.5 • Network initialization approach: Xavier uniform (gain = √2; biases = 0.1) • Training: full batch mode; Adam optimizer; 200 epochs • Hardware: NVIDIA TITAN XP
  • 18. Experiments 16 Sensitivity analysis • # video segments (# local attention mech.) & global-local data fusion approach SumMe TVSum # Segm. Fusion 2 4 8 2 4 8 Addition 49.8 55.6 51.9 61.1 61.7 59.5 Average pooling 51.5 54.1 52.5 61.1 59.7 58.7 Max pooling 51.1 53.9 52.2 61.3 59.6 58.5 Multiplication 46.3 52.1 52.5 47.3 47.6 46.9 SumMe TVSum Local Global 2 4 8 2 4 8 2 52.4 52.4 52.8 61.4 61.6 61.4 4 54.9 58.5 49.2 60.9 61.3 60.4 8 55.8 58.8 57.1 60.3 61.1 60.9 16 56.7 57.7 54.9 61.3 60.9 60.1 • # local and global attention heads Reported values represent F-Score (%)
  • 19. Experiments 17 Performance comparisons with two attention-based methods • Evaluation was made under the exact same experimental conditions SumMe TVSum Average Rank Data splits F1 Rank F1 Rank VASNet (Fajtl et al., 2018) 50.0 3 62.5 2 2.5 5 Rand MSVA (Ghauri et al., 2021) 54.0 2 62.4 3 2.5 5 Rand PGL-SUM (Ours) 57.1 1 62.7 1 1 5 Rand J. Fajtl et al., “Summarizing Videos with Attention,” in Asian Conf. on Comp. Vision 2018 Workshops. Cham: Springer Int. Publishing, 2018, pp. 39–54. J. Ghauri et al., “Supervised Video Summarization Via Multiple Feature Sets with Parallel Attention,” in 2021 IEEE Int. Conf. on Multimedia and Expo. CA, USA: IEEE, 2021, pp. 1–6. F1 denotes F-Score (%)
  • 20. Experiments Performance comparisons with SoA supervised methods SumMe TVSum Avg Rank Data splits F1 Rank F1 Rank Random 40.2 19 54.4 16 17.5 - vsLSTM 37.6 22 54.2 17 19.5 1 Rand dppLSTM 38.6 21 54.7 15 18 1 Rand ActionRanking 40.1 20 56.3 14 17 1 Rand *vsLSTM+Att 43.2 16 - - 16 1 Rand *dppLSTM+Att 43.8 15 - - 15 1 Rand H-RNN 42.1 17 57.9 12 14.5 - *A-AVS 43.9 14 59.4 8 11 5 Rand *SF-CVS 46.0 9 58.0 11 10 - SUM-FCN 47.5 7 56.8 13 10 M Rand HSA-RNN 44.1 13 59.8 7 10 - SumMe TVSum Avg Rank Data splits F1 Rank F1 Rank CRSum 47.3 8 58.0 11 9.5 5 FCV MAVS 40.3 18 66.8 1 9.5 5 FCV TTH-RNN 44.3 12 60.2 6 9 - *M-AVS 44.4 11 61.0 4 7.5 5 Rand SUM-DeepLab 48.8 5 58.4 10 7.5 M Rand *DASP 45.5 10 63.6 3 6.5 5 Rand *SUM-GDA 52.8 3 58.9 9 6 5 FCV SMLD 47.6 6 61.0 4 5 5 FCV *H-MAN 51.8 4 60.4 5 4.5 5 FCV SMN 58.3 1 64.5 2 1.5 1 Rand PGL-SUM 55.6 2 61.0 4 3 5 Rand 18
  • 21. Experiments 19 Ablation study Core components SumMe TVSum Global attention Local attention Positional encoding Variant #1 X √ √ 46.9 52.4 Variant #2 √ X √ 46.7 59.9 Variant #3 √ √ X 53.1 61.0 PGL-SUM (Proposed) √ √ √ 55.6 61.0 Reported values represent F-Score (%)
  • 22. Experiments 19 Ablation study Core components SumMe TVSum Global attention Local attention Positional encoding Variant #1 X √ √ 46.9 52.4 Variant #2 √ X √ 46.7 59.9 Variant #3 √ √ X 53.1 61.0 PGL-SUM (Proposed) √ √ √ 55.6 61.0 Combing global and local attention allows to learn a better modeling of frames’ dependencies Reported values represent F-Score (%)
  • 23. Experiments 19 Ablation study Core components SumMe TVSum Global attention Local attention Positional encoding Variant #1 X √ √ 46.9 52.4 Variant #2 √ X √ 46.7 59.9 Variant #3 √ √ X 53.1 61.0 PGL-SUM (Proposed) √ √ √ 55.6 61.0 Integrating knowledge about the frames’ position positively affects the summarization performance Reported values represent F-Score (%)
  • 24. Conclusions 20 • PGL-SUM model for supervised video summarization aims to improve • modeling of long-range frames’ dependencies • parallelization ability of the training process • granularity level at which the temporal dependencies between frames are modeled • PGL-SUM contains multiple attention mechanisms that encode frames’ position • A global multi-head attention mechanism models frames’ dependence based on the entire video • Multiple local multi-head attention mechanisms discover modelings of frames’ dependencies by focusing on smaller parts of the video • Experiments on two benchmark datasets (SumMe and TVSum) • Showed PGL-SUM’s excellence against other methods that rely on self-attention mechanisms • Demonstrated PGL-SUM’s competitiveness against other SoA supervised summarization approaches • Documented the positive contribution of combining global and local multi-head attention with absolute positional encoding
  • 25. Thank you for your attention! Questions? Evlampios Apostolidis, apostolid@iti.gr Code and documentation publicly available at: https://github.com/e-apostolidis/PGL-SUM This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement H2020-832921 MIRROR, and by EPSRC under grant No. EP/R026424/1