Video Search by Deep Learning: Concept Detection, Action Localization, and Text-to-Video Matching

•

1 like•2,189 views

Lezing van Cees Snoek bij VOGIN-IP-lezing. Over de toepassing van machine learning bij automatische beeldherkenning, met nadruk op videomateriaal

Internet

Video Search by Deep Learning
Cees Snoek

8
How difficult is the problem?
Human vision consumes 50% brain power…
Van Essen, Science 1992

9
Video recognition in a nutshell
Visualization by Jasper Schulte

10
NIST TRECVID Benchmark
Promote progress in video retrieval research
Big data, standardized tasks, independent evaluation and open innovation
International video search competition
http://trecvid.nist.gov/

11
Concept detection task
http://trecvid.nist.gov/
Aircraft
Beach
Mountain
People marching
Police/Security
Flower

12
From University-lab to spin-off and your mobile phone
• = 1000+ others
* = UvA / Euvision / Qualcomm
Universities win Start-ups win
Snoek et al., TRECVID 2004-2015

13
Latest jump due to deep learning
2006 2009 2015
Meanaverageprecision
Progress in video recognition

14
The more features the better
Typical shallow learning architecture
e.g.
SIFT
dense sampling
Local Feature
Extraction
Feature
Pooling
Feature
Encoding
Classification
avg/sum pooling
max pooling
BoW
Sparse coding
Fisher
VLAD
Linear / Non-linear SVM

15
The deeper the better
Typical deep learning architecture
Layer6
Loss
Layer7
Max pool. 2
224
224
3×3
4,096 4,096
Dropout
Dropout
3×33×35×511×11
Convolution Non-linearity Pooling
Krizhevsky et al., NIPS 2012

16
Video search demo’s
Social media Forensics Cultural heritage

17
Tomorrow: The Internet of things that video

18
Need to understand what is happening where and when?

20
Goal: obtain the red tube around the action
Jain et al., IJCV 2017

21
Method: Super-voxel segmentation of the video
Jain et al., IJCV 2017

22
Group voxels to generate action proposals
Jain et al., IJCV 2017
Unsupervised and class-agnostic

24
Encode video proposals as 15,000 object scores
Jain et al., CVPR 2015
Layer6
Loss
Layer7
Max
pool. 2
3×3
4,096 4,096
Dropout
Dropout
3×33×35×511×11

25
Actions have object preference, relation is generic
TypingPlaying Cello Bodyweight squats
Jain et al., CVPR 2015

26
We consider three object encodings
− Whole video
− Outside of tube only
− Inside of tube only
Where do objects aid actions the most?

27
Objects aid most close to the action
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
Whole video Outside tube Inside tube
Jain et al., CVPR 2015

28
Simple convex combination of known classifiers
Objects2action: Translate objects to an action
Object representationTest video Object/action affinities
where s() = word2vec
Mikolov et al., NIPS 2013
Jain et al., ICCV 2015

29
Objects2action localizes actions without examples
Retrieval results from action query only
Jain et al., ICCV15
Prediction Ground truth

30
So far we have considered video search from text only, what about text search from video?
That is: given a video, can we find the best matching sentence?
Matching sentences to videos

31
Word2VisualVec: Predicting the visual representation of text
Training time
Dong et al., ArXive17

32
Word2VisualVec: Predicting the visual representation of text
Testing time
Dong et al., ArXive17

34
‘Arithmetic’ with visual and textual query

35
Video search by deep learning is powerful, even without examples
Field is progressing rapidly
Precise spatiotemporal video understanding is next
Conclusion
www.ceessnoek.info

Similar to Video Search by Deep Learning: Concept Detection, Action Localization, and Text-to-Video Matching

Cees Snoek (UvA) @ CMC Video FormatsMedia Perspectives

Serena Yeung, PHD, Stanford, at MLconf Seattle 2017 MLconf

Visual Object Tracking: Action-Decision Networks (ADNets)Yehya Abouelnaga

Video + Language 2019Goergen Institute for Data Science

Video + LanguageGoergen Institute for Data Science

Video Search at TRECVID 2022George Awad

[2018 台灣人工智慧學校校友年會] 視訊畫面生成 / 林彥宇台灣資料科學年會

Video Thumbnail SelectorVasileiosMezaris

Sparse representation based human action recognition using an action region-a...Wesley De Neve

TVSum: Summarizing Web Videos Using TitlesNEERAJ BAGHEL

Research and implementation of smoke detection in video streams naveedakram@...Naveed Akram

Video+Language: From Classification to DescriptionGoergen Institute for Data Science

WUD Slovakia 2015: Experiment v UX class / Prof. Ing. Mária Bieliková, PhD.suxask

Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersSymeon Papadopoulos

Huawei STW 2018 publicAlan Smeaton

Crowdsourcing the Acquisition and Analysis of Mobile Videos for Disaster Resp...University of Southern California

TRECVID ECCV2018 Tutorial (Ad-hoc Video Search)George Awad

Exploring visual and motion saliency for automatic video object extractionMuthu Samy

On the Influence Propagation of Web Videosabidhavp

Recently uploaded

Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665

Elevate Your Business with Our IT Expertise in New Orleanscorenetworkseo

A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton

Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert

Git and Github workshop GDSC MLRITMgdsc13

Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb

Font Performance - NYC WebPerf Meetup April '24Paul Calvano

Intellectual property rightsand its types.pptxBipin Adhikari

Contact Rya Baby for Call Girls New Delhimiss dipika

定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs

定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs

Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan

SCM Symposium PPT Format Customer loyalty is predieusebiomeyer

Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1

Film cover research (1).pptxsdasdasdasdasdasa494f574xmv

Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther

Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard

办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco

办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss

Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan

Recently uploaded (20)

Call Girls South Delhi Delhi reach out to us at ☎ 9711199012

Elevate Your Business with Our IT Expertise in New Orleans

A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)

Top 10 Interactive Website Design Trends in 2024.pptx

Git and Github workshop GDSC MLRITM

Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作

Font Performance - NYC WebPerf Meetup April '24

Intellectual property rightsand its types.pptx

Contact Rya Baby for Call Girls New Delhi

定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一

定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一

Call Girls Near The Suryaa Hotel New Delhi 9873777170

SCM Symposium PPT Format Customer loyalty is predi

Blepharitis inflammation of eyelid symptoms cause everything included along w...

Film cover research (1).pptxsdasdasdasdasdasa

Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)

Magic exist by Marta Loveguard - presentation.pptx

办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书

办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一

Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170

Video Search by Deep Learning: Concept Detection, Action Localization, and Text-to-Video Matching

1. Video Search by Deep Learning Cees Snoek

2. 2 Which one is the plane?

3. 3 Which one is the plane?

4. 4 Which one is the bird?

5. 5 Which one is the bird?

6. 6 Which one is the Kentucky Warbler?

7. 7 Which one is the Kentucky Warbler?

8. 8 How difficult is the problem? Human vision consumes 50% brain power… Van Essen, Science 1992

9. 9 Video recognition in a nutshell Visualization by Jasper Schulte

10. 10 NIST TRECVID Benchmark Promote progress in video retrieval research Big data, standardized tasks, independent evaluation and open innovation International video search competition http://trecvid.nist.gov/

11. 11 Concept detection task http://trecvid.nist.gov/ Aircraft Beach Mountain People marching Police/Security Flower

12. 12 From University-lab to spin-off and your mobile phone • = 1000+ others * = UvA / Euvision / Qualcomm Universities win Start-ups win Snoek et al., TRECVID 2004-2015

13. 13 Latest jump due to deep learning 2006 2009 2015 Meanaverageprecision Progress in video recognition

14. 14 The more features the better Typical shallow learning architecture e.g. SIFT dense sampling Local Feature Extraction Feature Pooling Feature Encoding Classification avg/sum pooling max pooling BoW Sparse coding Fisher VLAD Linear / Non-linear SVM

15. 15 The deeper the better Typical deep learning architecture Layer6 Loss Layer7 Max pool. 2 224 224 3×3 4,096 4,096 Dropout Dropout 3×33×35×511×11 Convolution Non-linearity Pooling Krizhevsky et al., NIPS 2012

16. 16 Video search demo’s Social media Forensics Cultural heritage

17. 17 Tomorrow: The Internet of things that video

18. 18 Need to understand what is happening where and when?

19. 19 Examples Shaking handsKissing

20. 20 Goal: obtain the red tube around the action Jain et al., IJCV 2017

21. 21 Method: Super-voxel segmentation of the video Jain et al., IJCV 2017

22. 22 Group voxels to generate action proposals Jain et al., IJCV 2017 Unsupervised and class-agnostic

23. 23 Example proposals

24. 24 Encode video proposals as 15,000 object scores Jain et al., CVPR 2015 Layer6 Loss Layer7 Max pool. 2 3×3 4,096 4,096 Dropout Dropout 3×33×35×511×11

25. 25 Actions have object preference, relation is generic TypingPlaying Cello Bodyweight squats Jain et al., CVPR 2015

26. 26 We consider three object encodings − Whole video − Outside of tube only − Inside of tube only Where do objects aid actions the most?

27. 27 Objects aid most close to the action 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 Whole video Outside tube Inside tube Jain et al., CVPR 2015

28. 28 Simple convex combination of known classifiers Objects2action: Translate objects to an action Object representationTest video Object/action affinities where s() = word2vec Mikolov et al., NIPS 2013 Jain et al., ICCV 2015

29. 29 Objects2action localizes actions without examples Retrieval results from action query only Jain et al., ICCV15 Prediction Ground truth

30. 30 So far we have considered video search from text only, what about text search from video? That is: given a video, can we find the best matching sentence? Matching sentences to videos

31. 31 Word2VisualVec: Predicting the visual representation of text Training time Dong et al., ArXive17

32. 32 Word2VisualVec: Predicting the visual representation of text Testing time Dong et al., ArXive17

33. 33 Results Dong et al., ArXive17

34. 34 ‘Arithmetic’ with visual and textual query

35. 35 Video search by deep learning is powerful, even without examples Field is progressing rapidly Precise spatiotemporal video understanding is next Conclusion www.ceessnoek.info

Video Search by Deep Learning: Concept Detection, Action Localization, and Text-to-Video Matching

Recommended

Recommended

More Related Content

Similar to Video Search by Deep Learning: Concept Detection, Action Localization, and Text-to-Video Matching

Similar to Video Search by Deep Learning: Concept Detection, Action Localization, and Text-to-Video Matching (20)

More from voginip

More from voginip (20)

Recently uploaded

Recently uploaded (20)

Video Search by Deep Learning: Concept Detection, Action Localization, and Text-to-Video Matching