Lecture given on January 28, 2019 to post-graduate students of the Computer Engineering and Media program, at the School of Journalism and Media, Aristotle University of Thessaloniki.
2. Our lab
Multimedia Knowledge and
Social Media Analytics Laboratory
• Part of Information Technologies Institute (ITI) -
Centre for Research and Technology Hellas (CERTH)
• 60+ researchers (20+ post-docs)
• key areas: multimedia, social media, computer vision,
data mining, machine learning
• applications: media, security, culture, environment
• involved in 60+ projects and published 600+ papers
https://mklab.iti.gr/
5. 500 hours of video per min =
720,000 hours per day >
82 years of video per day!
6. Pope Francis
Pope Benedict
2007: iPhone release
2008: Android release
2010: iPad release
http://petapixel.com/2013/03/14/a-starry-sea-of-cameras-at-the-unveiling-of-pope-francis/
13. Similarity-based media search
Two main problems
•How to compute similarity between two
items (in accordance with my needs)?
•How to search (using above similarity
function) in very large collections in
reasonable time?
15. What is similar?
• Variety of definitions and understandings regarding what
can be considered to be similar
• Near-duplicate videos: definition by Wu et al. (2007)
• photometric variations: gamma, contrast, brightness, etc.
• editing operations: resize, shift, crop, flip
• insertion of patterns: caption, logo, subtitles, sliding captions, etc.
• re-encoding: video format, compression
• video modifications: frame rate, frame insertion, deletion, swap
X. Wu, A. G. Hauptmann, and C. W. Ngo. Practical elimination of near-duplicates from web video search. In
Proceedings of the 15th ACM international conference on Multimedia, pp. 218-227, 2007
16. Hashing
• Cryptographic or checksum hashing: MD5, SHA1
• Input: bitstream (not just images or videos)
• Output: hash code 128-bit (MD5), 160-bit (SHA1), etc.
• Property: minor changes in input can lead to completely
different hash codes
https://jenssegers.com/61/perceptual-image-hashes
18. Perceptual hashing
• Generate a fingerprint that can be used to compare
images using the Hamming Distance
• Instance: Average Hashing (aHash)
• Reduce size 8x8 pixels
• Reduce colour RGB to grayscale
• Calculate average colour among 64 grayscale values
• Compute hash for each pixel, binary value depending
on whether it is higher or lower than average
64-bit signature
20. dHash and pHash
• dHash: Difference Hash
• same steps as aHash
• hash is generated based on whether left pixel is brighter
than the right one
• less false positives compared to aHash
• pHash: Perceptual Hash
• more complicated algorithm
• resize to 32x32
• DCT on luma (brightness) component
• top left 8x8 hash by comparing to median value
22. Pixel-based similarity doesn’t
match perception
All three variations of the first image are equidistant
from it in terms of L2 pixel distance!
http://cs231n.github.io/classification/
23. Global descriptors
• A single vector that attempts to capture the main
visual properties of an image, e.g. distribution of
colour, spatial layout of brightness, textures, etc.
• Popular choices include:
• GIST – spatial envelope (Oliva & Torralba, 2001)
• Color: Dominant Color, Scalable Color, Color Structure,
Color Layout Descriptor (MPEG-7, 2001)
• Texture: Texture Browsing, Homogeneous Texture, Edge
Histogram (MPEG-7, 2001)
A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation
of the spatial envelope. IJCV, 42(3):145–175, 2001
Text of ISO/IEC 15 938-3 Multimedia Content Description Interface—Part 3: Visual.
Final Committee Draft, ISO/IEC/JTC1/SC29/ WG11, Doc. N4062, Mar. 2001
24. GIST-based near-duplicate search
Douze, M., Jégou, H., Sandhawalia, H., Amsaleg, L., & Schmid, C. (2009, July). Evaluation of gist descriptors for web-
scale image search. In Proceedings of the ACM International Conference on Image and Video Retrieval (p. 19). ACM.
25. Local descriptors
• Basic scheme:
• Detect a set of features (i.e. interest points) in an image
• Extract one descriptor around each feature
• Plenty of options for both parts, e.g.:
• Feature detectors: Canny, Sobel, Harris, FAST, Laplacian
of Gaussian (LoG), Difference of Gaussians (DoG),
Determinant of Hessian (DoH), MSER
• Feature descriptors: SIFT, GLOH, SURF, ORB
• Much higher accuracy at the cost of increased
complexity
26. Scale-Invariant Feature Transforms (SIFT)
Set of descriptors
A single descriptor
(16 histograms of 8 bins
128 dims)
http://faculty.ucmerced.edu/mhyang/project/iccv13_exemplar/ICCV13_exemplarCut/vlfeat-0.9.14/doc/overview/sift.html
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International journal of computer
vision, 60(2), 91-110.
28. Bag of Visual Words (BoVW)
https://towardsdatascience.com/bag-of-visual-words-in-a-nutshell-9ceea97ce0fb
29. Bag of Visual Words (BoVW)
https://towardsdatascience.com/bag-of-visual-words-in-a-nutshell-9ceea97ce0fb
extract a set of local features from each image
30. Bag of Visual Words (BoVW)
• a representative
sample of features
selected
• features are clustered
• cluster centroids (or
medoids) are
considered to be the
visual codebook
https://towardsdatascience.com/bag-of-visual-words-in-a-nutshell-9ceea97ce0fb
31. Bag of Visual Words (BoVW)
https://towardsdatascience.com/bag-of-visual-words-in-a-nutshell-9ceea97ce0fb
32. Indexing and Querying
• tf-idf weighting of visual words
𝑤𝑡𝑑 = 𝑛 𝑡𝑑 ∙ log 𝐷 𝑏 /𝑛 𝑡
• Inverted file indexing structure for fast search
• Retrieve candidates with at least one common
visual word
• Rank candidates, e.g. based on cosine similarity
of their tf-idf representations
𝑠𝑖𝑚 𝑞, 𝑝 =
𝒘 𝒒 ∙ 𝒘 𝒑
𝒘 𝒒 𝒘 𝒑
33. BoVW Discussion
• BoVW is a sparse representation: each image is
associated with few visual words (compared to the
whole vocabulary)
• Convenient for indexing and look-up
• Completely misses spatial layout extensions
• Performance depends on:
• size of vocabulary
• dataset where vocabulary was learned
38. From Image to Video Similarity
• A video can be considered as a richer
representation compared to images:
• set of images (frames)
• frames and motion
• frames and motion and audio
• For efficiency purposes, we typically simplify or
discard part of the information:
• frames descriptors average
• frames visual words bag of frame-words
40. Video indexing calls
/index (HTTP GET request)
Add the provided video to the video index
• url: the URL of the video that is going to be indexed
• async: flag for asynchronous processing
/youtube (HTTP GET request)
Query YouTube API with either a video ID or a provided text query
and add the retrieved videos to the video index
• video_id: video ID to query YouTube API
• text: provided text to query YouTube API
• max: maximum number of videos to be add to the video index
/delete (HTTP DELETE request)
Delete the provided video from the video index
• url: the URL of the video that is going to be deleted
41. Video search calls
/search (HTTP GET request)
Video-level search: retrieve relevant video by calculating the
similarity between the entire videos
• url: URL of the query video
• t_sim: similarity threshold
• t_rank: rank threshold
/partial (HTTP GET request)
Shot-level search: retrieve relevant video segments from the indexed
videos in the database
• url: URL of the query video
• v_sim: video similarity threshold
• s_sim: shot similarity threshold
42. Combining CNNs and BoVW
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, Y. (2017, January). Near-duplicate video retrieval by
aggregating intermediate CNN layers. In International Conference on Multimedia Modeling (pp. 251-263). Springer
43. An improved setup
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, Y. (2017, January). Near-duplicate video retrieval by
aggregating intermediate CNN layers. In International Conference on Multimedia Modeling (pp. 251-263). Springer
44. Learning similarity
Before training
After training
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, Y. (2017, October). Near-Duplicate Video Retrieval with
Deep Metric Learning. In 2017 IEEE International Conference on Computer Vision Workshop (ICCVW), (pp. 347-356). IEEE
46. FIVR-200K
a dataset for evaluating NDVR
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, I. (2018).
FIVR: Fine-grained Incident Video Retrieval. arXiv preprint arXiv:1809.04094
47. FIVR-200K
• A video dataset to help research on the problem of
Fine-grained Incident Video Retrieval
• Duplicate Scene Videos (DSVs)
• Complementary Scene Videos (CSVs)
• Incident Scene Videos (ISVs)
• 225,960 videos around 4,687 news events from Jan
1st 2013 to Dec 31st 2017
57. Ideas
• Pick one video around one event between 2013
and 2017 and try to find similar versions of it
• Pick one of the clusters-events in the Browse
section and try to find some important videos that
cover the event
• Given an event of interest, identify in which sources
it is covered (language, country, type of channel)
• Add videos from a newer event and use them to
perform new searches
59. Papers
• Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, Y.
(2017, January). Near-duplicate video retrieval by aggregating
intermediate CNN layers. In International Conference on Multimedia
Modeling (pp. 251-263). Springer
• Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, Y.
(2017, October). Near-Duplicate Video Retrieval with Deep Metric
Learning. In 2017 IEEE International Conference on Computer Vision
Workshop (ICCVW), (pp. 347-356). IEEE
• Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, I.
(2018). FIVR: Fine-grained Incident Video Retrieval. arXiv preprint
arXiv:1809.04094
60. Acknowledgements
• Giorgos Kordopatis-Zilos / near-duplicate video
retrieval, back-end development, FIVR-200K
collection and annotation
• Lazaros Apostolidis / web front-end development
• Polichronis Charitidis / FIVR-200K annotation
61. Thank you for your attention!
Akis Papadopoulos papadop@iti.gr
@sympap