Cross-modal learning has gained a lot of interest recently, and many applications of it, such as image-text retrieval, cross-modal video search, or video captioning have been proposed. In this work, we deal with the cross-modal video retrieval problem. The state-of-the-art approaches are based on deep network architectures, and rely on mining hard-negative samples during training to optimize the selection of the network’s parameters. Starting from a state-of-the-art cross-modal architecture that uses the improved marginal ranking loss function, we propose a simple strategy for hard-negative mining to identify which training samples are hard-negatives and which, although presently treated as hard-negatives, are likely not negative samples at all and shouldn’t be treated as such. Additionally, to take full advantage of network models trained using different design choices for hard-negative mining, we examine model combination strategies, and we design a hybrid one effectively combining large numbers of trained models.
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
1. The MIRROR project has received funding from the European Union’s Horizon 2020 research and innovation action program under grant agreement № 832921.
Hard-Negatives or Non-Negatives?
A Hard-Negative Selection Strategy for
Cross-Modal Retrieval Using the Improved
Marginal Ranking Loss
Damianos Galanopoulos, Vasileios Mezaris
2nd International Workshop on Video Retrieval Methods and Their Limits @
ICCV 2021 conference, 16 Oct. 2021
2. 2
Introduction
● Cross-modal learning has gained a lot of interest
● The improved marginal ranking loss is extensively used
● State-of-the-art approaches rely on hard-negative samples during training
● We aim on extracting actual hard-negative samples
○ We focus on samples that are semantical closeby to the anchor and should
not be considered as negatives
● We examine different strategies for efficient combination of multiple trained
models
3. 3
Problem statement
Sample A is the anchor
video-caption sample
Which one of B or C
should be considered as a
hard-negative sample?
Typical approaches will select B
(the nearest-to-anchor sample),
but this is a positive one!
C is clearly a negative sample
and should be used as the
hard-negative
4. 4
Baseline
● We utilized the attention based dual encoding network of [1]
● The improved marginal ranking loss is used to train the network
[1] D. Galanopoulos, V. Mezaris, "Attention Mechanisms, Signal Encodings and Fusion Strategies for Improved Ad-hoc Video Search
with Dual Encoding Networks", Proc. ACM Int. Conf. on Multimedia Retrieval (ICMR 2020), Dublin, Ireland, October 2020.
5. 5
Baseline
● The combination of multiple models boosts performance
● As in [1], combination of 24 different models by modifying parameters:
○ Attention mechanism in the textual or visual stream
○ Two textual encodings (BERT and W2V+BERT)
○ Two optimizers (Adam and RMSprop)
○ Three learning rates
6. 6
Hard-negative mining
● We designed an offline-online strategy to exclude potentially-positive samples
● At the offline stage, we estimate a threshold p for the similarity of samples, so
that only samples with similarity < p will be treated as hard-negative candidates
○ Randomly split the training dataset into batches (as done for training)
○ In each batch, the cosine similarity , between all possible caption pairs
is calculated
○ Within the entire set of calculated similarities for all batches, we make the
assumption that x% (e.g. 1%) of them indicate very similar samples (thus, one
could not be treated as a hard-negative for the other)
○ The similarity threshold p for which x% of the similarities are higher than p is
identified
7. 7
Hard-negative mining
● At the online stage (during training) we enforce the threshold value p
○ In every batch, for an anchor (vi,ci), every sample (vj,cj) (within the batch)
with > p is not considered as a negative at all
○ Every other sample is labeled as negative
○ Out of this subset of samples, the negative one with the highest is
selected as the hard-negative sample
8. 8
Fusion strategies
● Following the proposed hard-negative mining strategy for different plausible
assumptions about x% (e.g. 1%, 2%), thus different p values, the number of
available models can be quickly increased
● We study the combination of multiple trained models (late fusion)
● Every trained model, and for a given query, results in a ranking list of the most
relevant videos
● Three different strategies for combining them are examined:
○ AVG
○ MAX
○ Hybrid
9. 9
Fusion strategies
AVG
● We assume that every model as a well-performing one, and we treat them equally
● We average the rankings for a given video
MAX
● We assume that not all our models have very good recall
● But, we assume that at least the samples they place at the very top of their
ranking lists are true positives
● Thus, if a video appears very high in the ranking list generated by at least one
model, we trust this video to be a good answer to the query.
10. 10
Fusion strategies
Hybrid
● Neither of the previous two assumptions seems perfectly plausible
● In our Hybrid strategy, for a retrieved video, we select the Q’ ranking lists where
the video is ranked the highest among the Q in total ranking lists
● These top-Q’ rankings are averaged, to calculate the final ranking for this video
● All retrieved videos are re-ordered according to their final ranking
So, if at least Q’ models bring a video high in their ranking lists, we trust this to be a
good answer to the query. Special cases:
● If Q’=Q, the Hybrid approach is the same as the AVG
● If Q’=1, the Hybrid approach is the same as the MAX
11. 11
Experimental results
● Training datasets:
○ MSR-VTT, TGIF, ActivityNet Captions and Vatex
● Evaluation datasets:
○ V3C1 evaluated on TRECVID AVS 2019 and 2020 queries
● Evaluation metric:
○ Mean extended inferred average precision (MXinfAP)
● Keyframe representation:
○ ResNet-152 trained on Imagenet 11K
12. 12
Experimental results
● Results in MxinfAP of the combination of multiple models and different setups.
● Comparison between the baseline hard-negative mining strategy and the
proposed one with x=1% and x=2%
● The last row shows the results when all models from every mining strategy are
combined
13. ● Results on the AVS19 and AVS20 datasets for the Hybrid fusion strategy and
different values of Q’
13
Experimental results
14. 14
Conclusion
● New strategy for hard-negative mining to improve the performance of a cross-
modal video retrieval network
● We focused on excluding positive samples from being wrongfully utilized as
hard-negatives
● We proposed a hybrid strategy for model combination to take advantage of
the high number of trained models
● The new hard-negative mining strategy gives small improvements
● In combination with the Hybrid fusion strategy, the performance is further
boosted