10. 10
NIST TRECVID Benchmark
Promote progress in video retrieval research
Big data, standardized tasks, independent evaluation and open innovation
International video search competition
http://trecvid.nist.gov/
12. 12
From University-lab to spin-off and your mobile phone
• = 1000+ others
* = UvA / Euvision / Qualcomm
Universities win Start-ups win
Snoek et al., TRECVID 2004-2015
13. 13
Latest jump due to deep learning
2006 2009 2015
Meanaverageprecision
Progress in video recognition
14. 14
The more features the better
Typical shallow learning architecture
e.g.
SIFT
dense sampling
Local Feature
Extraction
Feature
Pooling
Feature
Encoding
Classification
avg/sum pooling
max pooling
BoW
Sparse coding
Fisher
VLAD
Linear / Non-linear SVM
15. 15
The deeper the better
Typical deep learning architecture
Layer6
Loss
Layer7
Max pool. 2
224
224
3×3
4,096 4,096
Dropout
Dropout
3×33×35×511×11
Convolution Non-linearity Pooling
Krizhevsky et al., NIPS 2012
24. 24
Encode video proposals as 15,000 object scores
Jain et al., CVPR 2015
Layer6
Loss
Layer7
Max
pool. 2
3×3
4,096 4,096
Dropout
Dropout
3×33×35×511×11
25. 25
Actions have object preference, relation is generic
TypingPlaying Cello Bodyweight squats
Jain et al., CVPR 2015
26. 26
We consider three object encodings
− Whole video
− Outside of tube only
− Inside of tube only
Where do objects aid actions the most?
27. 27
Objects aid most close to the action
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
Whole video Outside tube Inside tube
Jain et al., CVPR 2015
28. 28
Simple convex combination of known classifiers
Objects2action: Translate objects to an action
Object representationTest video Object/action affinities
where s() = word2vec
Mikolov et al., NIPS 2013
Jain et al., ICCV 2015
30. 30
So far we have considered video search from text only, what about text search from video?
That is: given a video, can we find the best matching sentence?
Matching sentences to videos
35. 35
Video search by deep learning is powerful, even without examples
Field is progressing rapidly
Precise spatiotemporal video understanding is next
Conclusion
www.ceessnoek.info