SlideShare a Scribd company logo
1 of 12
Title of presentation
Subtitle
Name of presenter
Date
ObjectGraphs: Using Objects and a Graph Convolutional Network
for the Bottom-up Recognition and Explanation of Events in Video
N. Gkalelis, A. Goulas, D. Galanopoulos, V. Mezaris
CERTH-ITI, Thermi - Thessaloniki, Greece
IEEE/CVF Conference on Computer Vision
and Pattern Recognition Workshops,
2nd Int. Workshop on Large Scale Holistic
Video Understanding, June 2021
2
• The recognition of high-level events in unconstrained video is a major research
topic in multimedia understanding
Introduction
“Landing a fish”
(TRECVID Multimedia
Event Detection dataset)
• Most approaches are top-down: use event label to implicitly focus on frame
regions mostly related with event
• Bottom-up approaches: exploit discriminant information of semantic objects;
have shown promising performance, e.g., in visual question answering
Fish
Fishing pole
Hand
3
ObjectGraphs
• Assume an annotated training set of N videos and C classes
• Keyframe sampling: each video is represented with Q frames
• OD+CNN: derives K objects depicted in the frame (with highest DoC)
• Object: object class label, DoC, BB, feature vector xk ∈ RF
4
ObjectGraphs
• Construct S ∈ RK x K : element of l-th row, k-th column is computed using (Wang
& Gupta, ECCV 2018):
𝐒 𝑙,𝑘 = 𝐯𝑙
𝑇
𝐯𝑘 , 𝐯𝑙 = 𝐖𝐱𝑙 + 𝐛, 𝐯𝑘 = 𝐖𝐱𝑘 + 𝐛
• Ws ∈ RF x F, bs ∈ RF: learnable parameters
• Obtain the adjacency matrix A ∈ RK x K from S so that (Yang et al., CVPR 2020):
a) [A]l,k ∈ [0,1]
b) k[A]l,k =1 (all edge values from l-th object are normalized to sum to one)
𝐀 𝑙,𝑘 =
𝐒 𝑙,𝑘
2
𝑘=1
𝐾
𝐒 𝑙,𝑘
2
5
ObjectGraphs
• M-layer GCN exploits the frame-level object information
𝐗[𝑚] = ReLU LN 𝐀𝐗 𝑚−1 𝐖[𝑚] , 𝑚 = 1, … , 𝑀, X[0] = [x1 ,…, xK]T
• AVGPOOL layer derives local feature vector z’ at frame-level
• CNN applied to the entire frame derives a global feature vector z’’
• CONCAT layer: derives z as frame-level feature vector representation
• LSTM: processes sequence of frame-level feature vectors: 𝐡𝑗
= LSTM 𝐳𝑗
, 𝐡𝑗−1
, 𝑗 = 1 … , 𝑄
• Hidden state vector hQ at last time step used as video-level representation
• Stack of FC layers provides a score for each event
6
Explanation of event recognition results
• Network parameters are learned via CE loss and event labels as target labels
• The parameters of GCN’s adjacency matrix implicitly learn to amplify the
contribution of the objects mostly relevant to the event!
 How to use the adjacency matrix to derive the objects that mostly contributed to
network’s decision?
7
Explanation of event recognition results
• Resort to Weighted in-degree (WiD) of a vertex (used in other domains, e.g.
assess popularity of a person in social media)
• WiD of vertex k (corresponding to object k) in adjacency matrix j (corresponding
to frame j) can be computed using
𝛾𝑘
𝑗
=
𝑙=1
𝐾
𝐴𝑗
𝑙,𝑘
, 𝑘 = 1, … , 𝐾
• OD may detect several instances of the same object class in a frame/video
• Average WiD: computed for each object class p at frame- and video-level
Experiments
8
• YLI-MED: TRECVID-style video dataset, 10 event classes, 1000 training, 823
testing videos
• FCVID: multilabel YouTube video dataset, 239 classes (mostly real-world events),
45611 training, 45612 testing videos
• ObjectGraphs is compared against top-scoring methods in literature
Experimental results
9
ACC(%)
C3D+LSVM 65.61
3D-CNN 72.66
TSN 74.12
ActionVLAD 76.67
S2L 79.46
ObjectGraphs 83.60
mAP(%)
ST-VLAD 77.5
PivotCorrNN 77.6
LiteEval 80
AdaFrame 80.2
SCSampler 81
AR-Net (ResNet backbone) 81.3
AR-Net (EfficientNet backbone) 84.4
ObjectGraphs (ResNet backbone) 84.6
• Evaluation results on FCVID (left) and YLI-MED (right)
• Improve state-of-the-art performance by 0.2% (FCVID) and 4.14% (YLI-MED)
• Comparison with equivalent AR-Net variant (ResNet backbone): +3.3% gain
Explanation results
10
• Correctly recognized “Wedding ceremony” (BBs of most/least significant objects based on WiDs)
• High DoCs (right bar plot): general overview of the scene, but unrelated to the recognized event!
• High WiDs (middle bar plot): frame regions where the network focuses to recognize the event
Explanation results
11
• “Working on a woodworking project” but mis-recognized as “Person attempting a board trick”
• Objects with highest (video-level) WiDs: “Skate park” and “Skatepark”; respective regions
influence the most the network’s decision
• Note: wood construction’s roof highly resembles a skate park (detected as such by OD)!
12
Thank you for your attention!
Questions?
Nikolaos Gkalelis, gkalelis@iti.gr
Vasileios Mezaris, bmezaris@iti.gr
Code publicly available at:
https://github.com/bmezaris/ObjectGraphs
This work was supported by the EUs Horizon 2020 research and innovation programme under grant
agreements 832921 MIRROR and 951911 AI4Media

More Related Content

More from VasileiosMezaris

Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchVasileiosMezaris
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersVasileiosMezaris
 
Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...VasileiosMezaris
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...VasileiosMezaris
 
CA-SUM Video Summarization
CA-SUM Video SummarizationCA-SUM Video Summarization
CA-SUM Video SummarizationVasileiosMezaris
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web applicationVasileiosMezaris
 
PGL SUM Video Summarization
PGL SUM Video SummarizationPGL SUM Video Summarization
PGL SUM Video SummarizationVasileiosMezaris
 
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalHard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalVasileiosMezaris
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AIVasileiosMezaris
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020VasileiosMezaris
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarizationVasileiosMezaris
 
Migration-related video retrieval
Migration-related video retrievalMigration-related video retrieval
Migration-related video retrievalVasileiosMezaris
 
Fractional step discriminant pruning
Fractional step discriminant pruningFractional step discriminant pruning
Fractional step discriminant pruningVasileiosMezaris
 
Icme2020 tutorial video_summarization_part1
Icme2020 tutorial video_summarization_part1Icme2020 tutorial video_summarization_part1
Icme2020 tutorial video_summarization_part1VasileiosMezaris
 
Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...VasileiosMezaris
 
Unsupervised Video Summarization via Attention-Driven Adversarial Learning
Unsupervised Video Summarization via Attention-Driven Adversarial LearningUnsupervised Video Summarization via Attention-Driven Adversarial Learning
Unsupervised Video Summarization via Attention-Driven Adversarial LearningVasileiosMezaris
 
Subclass deep neural networks
Subclass deep neural networksSubclass deep neural networks
Subclass deep neural networksVasileiosMezaris
 
Video & AI: capabilities and limitations of AI in detecting video manipulations
Video & AI: capabilities and limitations of AI in detecting video manipulationsVideo & AI: capabilities and limitations of AI in detecting video manipulations
Video & AI: capabilities and limitations of AI in detecting video manipulationsVasileiosMezaris
 

More from VasileiosMezaris (20)

Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video Search
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
 
Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...
 
CA-SUM Video Summarization
CA-SUM Video SummarizationCA-SUM Video Summarization
CA-SUM Video Summarization
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web application
 
PGL SUM Video Summarization
PGL SUM Video SummarizationPGL SUM Video Summarization
PGL SUM Video Summarization
 
Video Thumbnail Selector
Video Thumbnail SelectorVideo Thumbnail Selector
Video Thumbnail Selector
 
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalHard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AI
 
LSTM Structured Pruning
LSTM Structured PruningLSTM Structured Pruning
LSTM Structured Pruning
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarization
 
Migration-related video retrieval
Migration-related video retrievalMigration-related video retrieval
Migration-related video retrieval
 
Fractional step discriminant pruning
Fractional step discriminant pruningFractional step discriminant pruning
Fractional step discriminant pruning
 
Icme2020 tutorial video_summarization_part1
Icme2020 tutorial video_summarization_part1Icme2020 tutorial video_summarization_part1
Icme2020 tutorial video_summarization_part1
 
Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...
 
Unsupervised Video Summarization via Attention-Driven Adversarial Learning
Unsupervised Video Summarization via Attention-Driven Adversarial LearningUnsupervised Video Summarization via Attention-Driven Adversarial Learning
Unsupervised Video Summarization via Attention-Driven Adversarial Learning
 
Subclass deep neural networks
Subclass deep neural networksSubclass deep neural networks
Subclass deep neural networks
 
Video & AI: capabilities and limitations of AI in detecting video manipulations
Video & AI: capabilities and limitations of AI in detecting video manipulationsVideo & AI: capabilities and limitations of AI in detecting video manipulations
Video & AI: capabilities and limitations of AI in detecting video manipulations
 

Recently uploaded

Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsCharlene Llagas
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxpriyankatabhane
 
Replisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdfReplisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdfAtiaGohar1
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionJadeNovelo1
 
projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and momentdonamiaquintan2
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxPayal Shrivastava
 
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...HafsaHussainp
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlshansessene
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxtuking87
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPRPirithiRaju
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfGABYFIORELAMALPARTID1
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxRitchAndruAgustin
 

Recently uploaded (20)

Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and Functions
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptx
 
Replisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdfReplisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdf
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPR
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and Function
 
projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and moment
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptx
 
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girls
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
 
Interferons.pptx.
Interferons.pptx.Interferons.pptx.
Interferons.pptx.
 

ObjectGraphs

  • 1. Title of presentation Subtitle Name of presenter Date ObjectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-up Recognition and Explanation of Events in Video N. Gkalelis, A. Goulas, D. Galanopoulos, V. Mezaris CERTH-ITI, Thermi - Thessaloniki, Greece IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2nd Int. Workshop on Large Scale Holistic Video Understanding, June 2021
  • 2. 2 • The recognition of high-level events in unconstrained video is a major research topic in multimedia understanding Introduction “Landing a fish” (TRECVID Multimedia Event Detection dataset) • Most approaches are top-down: use event label to implicitly focus on frame regions mostly related with event • Bottom-up approaches: exploit discriminant information of semantic objects; have shown promising performance, e.g., in visual question answering Fish Fishing pole Hand
  • 3. 3 ObjectGraphs • Assume an annotated training set of N videos and C classes • Keyframe sampling: each video is represented with Q frames • OD+CNN: derives K objects depicted in the frame (with highest DoC) • Object: object class label, DoC, BB, feature vector xk ∈ RF
  • 4. 4 ObjectGraphs • Construct S ∈ RK x K : element of l-th row, k-th column is computed using (Wang & Gupta, ECCV 2018): 𝐒 𝑙,𝑘 = 𝐯𝑙 𝑇 𝐯𝑘 , 𝐯𝑙 = 𝐖𝐱𝑙 + 𝐛, 𝐯𝑘 = 𝐖𝐱𝑘 + 𝐛 • Ws ∈ RF x F, bs ∈ RF: learnable parameters • Obtain the adjacency matrix A ∈ RK x K from S so that (Yang et al., CVPR 2020): a) [A]l,k ∈ [0,1] b) k[A]l,k =1 (all edge values from l-th object are normalized to sum to one) 𝐀 𝑙,𝑘 = 𝐒 𝑙,𝑘 2 𝑘=1 𝐾 𝐒 𝑙,𝑘 2
  • 5. 5 ObjectGraphs • M-layer GCN exploits the frame-level object information 𝐗[𝑚] = ReLU LN 𝐀𝐗 𝑚−1 𝐖[𝑚] , 𝑚 = 1, … , 𝑀, X[0] = [x1 ,…, xK]T • AVGPOOL layer derives local feature vector z’ at frame-level • CNN applied to the entire frame derives a global feature vector z’’ • CONCAT layer: derives z as frame-level feature vector representation • LSTM: processes sequence of frame-level feature vectors: 𝐡𝑗 = LSTM 𝐳𝑗 , 𝐡𝑗−1 , 𝑗 = 1 … , 𝑄 • Hidden state vector hQ at last time step used as video-level representation • Stack of FC layers provides a score for each event
  • 6. 6 Explanation of event recognition results • Network parameters are learned via CE loss and event labels as target labels • The parameters of GCN’s adjacency matrix implicitly learn to amplify the contribution of the objects mostly relevant to the event!  How to use the adjacency matrix to derive the objects that mostly contributed to network’s decision?
  • 7. 7 Explanation of event recognition results • Resort to Weighted in-degree (WiD) of a vertex (used in other domains, e.g. assess popularity of a person in social media) • WiD of vertex k (corresponding to object k) in adjacency matrix j (corresponding to frame j) can be computed using 𝛾𝑘 𝑗 = 𝑙=1 𝐾 𝐴𝑗 𝑙,𝑘 , 𝑘 = 1, … , 𝐾 • OD may detect several instances of the same object class in a frame/video • Average WiD: computed for each object class p at frame- and video-level
  • 8. Experiments 8 • YLI-MED: TRECVID-style video dataset, 10 event classes, 1000 training, 823 testing videos • FCVID: multilabel YouTube video dataset, 239 classes (mostly real-world events), 45611 training, 45612 testing videos • ObjectGraphs is compared against top-scoring methods in literature
  • 9. Experimental results 9 ACC(%) C3D+LSVM 65.61 3D-CNN 72.66 TSN 74.12 ActionVLAD 76.67 S2L 79.46 ObjectGraphs 83.60 mAP(%) ST-VLAD 77.5 PivotCorrNN 77.6 LiteEval 80 AdaFrame 80.2 SCSampler 81 AR-Net (ResNet backbone) 81.3 AR-Net (EfficientNet backbone) 84.4 ObjectGraphs (ResNet backbone) 84.6 • Evaluation results on FCVID (left) and YLI-MED (right) • Improve state-of-the-art performance by 0.2% (FCVID) and 4.14% (YLI-MED) • Comparison with equivalent AR-Net variant (ResNet backbone): +3.3% gain
  • 10. Explanation results 10 • Correctly recognized “Wedding ceremony” (BBs of most/least significant objects based on WiDs) • High DoCs (right bar plot): general overview of the scene, but unrelated to the recognized event! • High WiDs (middle bar plot): frame regions where the network focuses to recognize the event
  • 11. Explanation results 11 • “Working on a woodworking project” but mis-recognized as “Person attempting a board trick” • Objects with highest (video-level) WiDs: “Skate park” and “Skatepark”; respective regions influence the most the network’s decision • Note: wood construction’s roof highly resembles a skate park (detected as such by OD)!
  • 12. 12 Thank you for your attention! Questions? Nikolaos Gkalelis, gkalelis@iti.gr Vasileios Mezaris, bmezaris@iti.gr Code publicly available at: https://github.com/bmezaris/ObjectGraphs This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreements 832921 MIRROR and 951911 AI4Media