Similarity-based retrieval of multimedia content

Similarity-based retrieval of
multimedia content
Dr. Symeon Papadopoulos
Senior Researcher, CERTH-ITI
Monday Jan 28, 2019 @ Media AUTh

Our lab
Multimedia Knowledge and
Social Media Analytics Laboratory
• Part of Information Technologies Institute (ITI) -
Centre for Research and Technology Hellas (CERTH)
• 60+ researchers (20+ post-docs)
• key areas: multimedia, social media, computer vision,
data mining, machine learning
• applications: media, security, culture, environment
• involved in 60+ projects and published 600+ papers
https://mklab.iti.gr/

Related projects
2018-20212016-2018
https://www.invid-project.eu/ https://weverify.eu/

https://www.smartinsights.com/
internet-marketing-
statistics/happens-online-60-
seconds/

500 hours of video per min =
720,000 hours per day >
82 years of video per day!

Pope Francis
Pope Benedict
2007: iPhone release
2008: Android release
2010: iPad release
http://petapixel.com/2013/03/14/a-starry-sea-of-cameras-at-the-unveiling-of-pope-francis/

Detecting disinformation
Claim:
Hurricane Irma, Sep 2017
Fact:
Hurricane Dolores, May 2016

A shark thriving in hurricanes
https://www.snopes.com/photos/animals/puertorico.asp

Similarity-based media search
Two main problems
•How to compute similarity between two
items (in accordance with my needs)?
•How to search (using above similarity
function) in very large collections in
reasonable time?

visual similarity
an overview of approaches

What is similar?
• Variety of definitions and understandings regarding what
can be considered to be similar
• Near-duplicate videos: definition by Wu et al. (2007)
• photometric variations: gamma, contrast, brightness, etc.
• editing operations: resize, shift, crop, flip
• insertion of patterns: caption, logo, subtitles, sliding captions, etc.
• re-encoding: video format, compression
• video modifications: frame rate, frame insertion, deletion, swap
X. Wu, A. G. Hauptmann, and C. W. Ngo. Practical elimination of near-duplicates from web video search. In
Proceedings of the 15th ACM international conference on Multimedia, pp. 218-227, 2007

Hashing
• Cryptographic or checksum hashing: MD5, SHA1
• Input: bitstream (not just images or videos)
• Output: hash code 128-bit (MD5), 160-bit (SHA1), etc.
• Property: minor changes in input can lead to completely
different hash codes
https://jenssegers.com/61/perceptual-image-hashes

Example
EA6BF04059B4CB0D
889296F1788B321B
8435D4A072804237
308F9566508C963C
http://onlinemd5.com/

Perceptual hashing
• Generate a fingerprint that can be used to compare
images using the Hamming Distance
• Instance: Average Hashing (aHash)
• Reduce size  8x8 pixels
• Reduce colour  RGB to grayscale
• Calculate average colour  among 64 grayscale values
• Compute hash  for each pixel, binary value depending
on whether it is higher or lower than average
 64-bit signature

aHash: example
11001001011010010011110000011000
00001000000000000000011100111111
https://jenssegers.com/61/perceptual-image-hashes

dHash and pHash
• dHash: Difference Hash
• same steps as aHash
• hash is generated based on whether left pixel is brighter
than the right one
• less false positives compared to aHash
• pHash: Perceptual Hash
• more complicated algorithm
• resize to 32x32
• DCT on luma (brightness) component
• top left 8x8  hash by comparing to median value

pHash examples
Hamming distance = 0
https://www.phash.org/demo/ (select DCT hash)

Pixel-based similarity doesn’t
match perception
All three variations of the first image are equidistant
from it in terms of L2 pixel distance!
http://cs231n.github.io/classification/

Global descriptors
• A single vector that attempts to capture the main
visual properties of an image, e.g. distribution of
colour, spatial layout of brightness, textures, etc.
• Popular choices include:
• GIST – spatial envelope (Oliva & Torralba, 2001)
• Color: Dominant Color, Scalable Color, Color Structure,
Color Layout Descriptor (MPEG-7, 2001)
• Texture: Texture Browsing, Homogeneous Texture, Edge
Histogram (MPEG-7, 2001)
A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation
of the spatial envelope. IJCV, 42(3):145–175, 2001
Text of ISO/IEC 15 938-3 Multimedia Content Description Interface—Part 3: Visual.
Final Committee Draft, ISO/IEC/JTC1/SC29/ WG11, Doc. N4062, Mar. 2001

GIST-based near-duplicate search
Douze, M., Jégou, H., Sandhawalia, H., Amsaleg, L., & Schmid, C. (2009, July). Evaluation of gist descriptors for web-
scale image search. In Proceedings of the ACM International Conference on Image and Video Retrieval (p. 19). ACM.

Local descriptors
• Basic scheme:
• Detect a set of features (i.e. interest points) in an image
• Extract one descriptor around each feature
• Plenty of options for both parts, e.g.:
• Feature detectors: Canny, Sobel, Harris, FAST, Laplacian
of Gaussian (LoG), Difference of Gaussians (DoG),
Determinant of Hessian (DoH), MSER
• Feature descriptors: SIFT, GLOH, SURF, ORB
• Much higher accuracy at the cost of increased
complexity

Scale-Invariant Feature Transforms (SIFT)
Set of descriptors
A single descriptor
(16 histograms of 8 bins 
128 dims)
http://faculty.ucmerced.edu/mhyang/project/iccv13_exemplar/ICCV13_exemplarCut/vlfeat-0.9.14/doc/overview/sift.html
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International journal of computer
vision, 60(2), 91-110.

Example: SIFT matching
https://www.cc.gatech.edu/~hays/compvision/proj2/

Bag of Visual Words (BoVW)
https://towardsdatascience.com/bag-of-visual-words-in-a-nutshell-9ceea97ce0fb

extract a set of local features from each image

• a representative
sample of features
selected
• features are clustered
• cluster centroids (or
medoids) are
considered to be the
visual codebook

Indexing and Querying
• tf-idf weighting of visual words
𝑤𝑡𝑑 = 𝑛 𝑡𝑑 ∙ log 𝐷 𝑏 /𝑛 𝑡
• Inverted file indexing structure for fast search
• Retrieve candidates with at least one common
visual word
• Rank candidates, e.g. based on cosine similarity
of their tf-idf representations
𝑠𝑖𝑚 𝑞, 𝑝 =
𝒘 𝒒 ∙ 𝒘 𝒑
𝒘 𝒒 𝒘 𝒑

BoVW Discussion
• BoVW is a sparse representation: each image is
associated with few visual words (compared to the
whole vocabulary)
• Convenient for indexing and look-up
• Completely misses spatial layout  extensions
• Performance depends on:
• size of vocabulary
• dataset where vocabulary was learned

Neural network features
https://www.pnas.org/content/116/4/1074 (artist Lucy Reading-Ikkanda)

Popular CNN architectures
VGGNet (2014)
GoogleNet (2014)

https://cs.stanford.edu/people
/karpathy/cnnembed/

video search
towards building a reverse video
search engine

From Image to Video Similarity
• A video can be considered as a richer
representation compared to images:
• set of images (frames)
• frames and motion
• frames and motion and audio
• For efficiency purposes, we typically simplify or
discard part of the information:
• frames  descriptors  average
• frames  visual words  bag of frame-words

Video indexing calls
/index (HTTP GET request)
Add the provided video to the video index
• url: the URL of the video that is going to be indexed
• async: flag for asynchronous processing
/youtube (HTTP GET request)
Query YouTube API with either a video ID or a provided text query
and add the retrieved videos to the video index
• video_id: video ID to query YouTube API
• text: provided text to query YouTube API
• max: maximum number of videos to be add to the video index
/delete (HTTP DELETE request)
Delete the provided video from the video index
• url: the URL of the video that is going to be deleted

Video search calls
/search (HTTP GET request)
Video-level search: retrieve relevant video by calculating the
similarity between the entire videos
• url: URL of the query video
• t_sim: similarity threshold
• t_rank: rank threshold
/partial (HTTP GET request)
Shot-level search: retrieve relevant video segments from the indexed
videos in the database
• url: URL of the query video
• v_sim: video similarity threshold
• s_sim: shot similarity threshold

Combining CNNs and BoVW
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, Y. (2017, January). Near-duplicate video retrieval by
aggregating intermediate CNN layers. In International Conference on Multimedia Modeling (pp. 251-263). Springer

An improved setup
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, Y. (2017, January). Near-duplicate video retrieval by
aggregating intermediate CNN layers. In International Conference on Multimedia Modeling (pp. 251-263). Springer

Learning similarity
Before training
After training
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, Y. (2017, October). Near-Duplicate Video Retrieval with
Deep Metric Learning. In 2017 IEEE International Conference on Computer Vision Workshop (ICCVW), (pp. 347-356). IEEE

Support for partial duplicate search

FIVR-200K
a dataset for evaluating NDVR
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, I. (2018).
FIVR: Fine-grained Incident Video Retrieval. arXiv preprint arXiv:1809.04094

FIVR-200K
• A video dataset to help research on the problem of
Fine-grained Incident Video Retrieval
• Duplicate Scene Videos (DSVs)
• Complementary Scene Videos (CSVs)
• Incident Scene Videos (ISVs)
• 225,960 videos around 4,687 news events from Jan
1st 2013 to Dec 31st 2017

Wikipedia: current events
https://en.wikipedia.org/wiki/Portal:Current_events

Dataset statistics
Number of events
Number of videos

Dataset statistics
Video category
Video duration

Boston Marathon bombing
query near-duplicate
complementary view same incident

Las Vegas shootings
query near-duplicate
complementary view same incident

Our Video Search Tool
http://ndd.iti.gr/video_search/

Ideas
• Pick one video around one event between 2013
and 2017 and try to find similar versions of it
• Pick one of the clusters-events in the Browse
section and try to find some important videos that
cover the event
• Given an event of interest, identify in which sources
it is covered (language, country, type of channel)
• Add videos from a newer event and use them to
perform new searches

Source code
https://github.com/MKLab-ITI/intermediate-cnn-features
https://github.com/MKLab-ITI/ndvr-dml

Papers
• Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, Y.
(2017, January). Near-duplicate video retrieval by aggregating
intermediate CNN layers. In International Conference on Multimedia
Modeling (pp. 251-263). Springer
• Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, Y.
(2017, October). Near-Duplicate Video Retrieval with Deep Metric
Learning. In 2017 IEEE International Conference on Computer Vision
Workshop (ICCVW), (pp. 347-356). IEEE
• Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, I.
(2018). FIVR: Fine-grained Incident Video Retrieval. arXiv preprint
arXiv:1809.04094

Acknowledgements
• Giorgos Kordopatis-Zilos / near-duplicate video
retrieval, back-end development, FIVR-200K
collection and annotation
• Lazaros Apostolidis / web front-end development
• Polichronis Charitidis / FIVR-200K annotation

Thank you for your attention!
Akis Papadopoulos papadop@iti.gr
@sympap

Similarity-based retrieval of multimedia content

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Similarity-based retrieval of multimedia content

Similar to Similarity-based retrieval of multimedia content (20)

More from Symeon Papadopoulos

More from Symeon Papadopoulos (20)

Recently uploaded

Recently uploaded (20)

Similarity-based retrieval of multimedia content