Cross-Year Multi-Modal Image Retrieval Using Siamese Networks by Margarita Khokhlova, Research Scientist (Post-Doc) at LIRIS

CROSS-YEAR MULTI-MODAL IMAGE RETRIEVAL USING
SIAMESE NETWORKS
M. Khokhlova1,2
, V. Gouet-Brunet1
, N.Abadie1
, L. Chen2
, ICIP 2020.
IGN1
and LIRIS2
WiMLDS Paris meetup
Presentor: Margarita Khokhlova,somewhere in the world, September 10, 2020.
1

The research problematic
Alegoria project aims at facilitating the promotion of iconographic institutional funds
collections describing the French territory in various periods going from the interwar period
to our days.
Some examples from the vast archives of the project resources
Archive resources:
Cartothèque de la Reconstruction, Fonds Archives nationales (Fonds LAPIE (1955-65), Cartothèque de la
Reconstruction (1948-1976), aerial views:
https://www.siv.archives-nationales.culture.gouv.fr/siv/IR/FRAN_IR_050605
Fonds Musée Nicéphore Niépce (Fonds de l’entreprise CIM (1949-1974), Fonds Bouquet (1914-1918), etc),
cartes postales, aerial views.
2
Aerial
and
ground-view
images,
different angle.

Digitized archive data:
Aerial and ground-view images with and without corresponding metadata.
Multi-modal cross-temporal data covering all France:
BD TOPO and ORTHO from IGN (multiple versions starting from 2004)[1].
▪ Aerial photographs (100% vertical)
▪ Manually annotated industrial and natural objects in the vector form
Many are available as an Open Source.
Project benchmarks and data
1. IGN geo datal: https://geoservices.ign.fr/ 3

FR-0419
Multi-modal cross-temporal vertical images database:
Research question: can we retrieve the same geozone across time?
Annotations: semantic objects from the image, a department
Ground truth sources: matching geo zones from the aerial images and semantic
maps from IGN (database TOPO/ORTHO) to create the multi-modal dataset
FR-0419.
4

Research scope of this work
Goal:
To perform the multi-modal research of aerial images representing the same
geographic location across time:
▪ How do modern cross-view and standard image descriptors handle this task
▪ How can we use the semantic data to improve the search results
▪ Which modality is more important for the across-time search
2004-2019 changes
5

Multi-modal cross time database
3 selected departments in the East of France:
# department images
high-res BD
ORTHO
patches 2000x2000 pixels (1
square km)
1 Moselle 327 6000
2 Bas-Rhin 248 4430
3 Meurthe-et-Moselle 291 5855
6Coverage of Mozelle by BD
ORTHO (image)

Visual and Semantic data
2004
image
2004
semantic
label
2019
image
2019
semantic
label
7

Semantic categories selected
Table: most important selected semantic categories and time changes
8

Baseline: Evaluation Pipeline
MAP@N =
9

Baselines
Classical
Model: ResNet50 [2]
Backbone: ResNet50
Pooling: MaxPooling
Image resolution: 512x512
Pre-trained on: Imagenet
Final descriptor size (single dims): 2048
Cross-view image retrieval
Model: GEM [3]
Backbone: Resnet101
Pooling: GEM layer
Image resolution: 1024x1024
Pre-trained on: oxford5k, paris6k,
roxford5k, rparis6k
Final descriptor size (single dims): 2048
2.He, Kaiming, et al. "Identity mappings in deep residual networks." European conference on computer vision. Springer,
Cham, 2016.
3.Radenović, Filip, Giorgos Tolias, and Ondřej Chum. "Fine-tuning CNN image retrieval with no human annotation." IEEE
transactions on pattern analysis and machine intelligence 41.7 (2018): 1655-1668.
10

Multi-modal information usage
Data fusion
▪ Images
▪ Semantic labels
▪ Early fusion (concatenation)
▪ Late fusion
▪ Fusion by a convolutional layer
D2048
D2048
D4096
KNN
D2048
D2048
KNN KNN
similarity-based
ranking
D2048
KNN
11

Baseline results
Trends:
▪ Descriptors on semantic masks give better results than descriptors on natural images.
▪ Multi-modal data give better accuracy than the single modalities.
▪ Late fusion gives the best results for both baselines.
▪ Pre-trained off-the-shelf Resnet50 outperforms the more sophisticated GEM on our data
(low-res).
Table: map@5 off-the-shelf descriptor GEM
Table: map@5 off-the-shelf descriptor Resnet50
12

Method to improve the baseline and fine-tune on our data
Similar research problematics:
Single-shot learning - the case when very few training (or just a single one) are available to
train the model:
▪ Face recognition
▪ Object recognition [4,5]
▪ Omniglot symbols recognition dataset [6]
Common solutions:
Siamese and Triplet Networks:
▪ Cross Entropy Loss
▪ Contrastive Loss
▪ Triplet Loss
13
4. Vinyals, Oriol, et al. “Matching networks for one shot learning.” Advances in Neural Information Processing
Systems. 2016.
5.Qiao, Siyuan, et al. “Few-shot image recognition by predicting parameters from activations.” CoRR,
abs/1706.03466 1 (2017).
6.Lake, Brenden, et al. “One shot learning of simple visual concepts.” Proceedings of the Annual Meeting of the
Cognitive Science Society. Vol. 33. No. 33. 2011.

Siamese architecture for multi-modal data
Definitions:
X1
,S1
;X2
,S2
- input aerial image and a corresponding semantic map:
matching pairs and non-matching pairs.
Y - ground truth correspondence level
DR
- resulting descriptor
14

Loss function & training
A simple binary cross-entropy loss:
(3)
Training data:
Training set: Moselle
Validation: Bas-Rhin
Test: Meurthe-et-Moselle
Cross-Validation: (swapping the departments)
Hard-mining: each 5 epochs based on the wrong matches from the KNN algorithm.
15

Experiments
Hyperparameters:
Descriptor size: 128, 256, 512
Training: 100 epochs
Optimizer: Adam, lr = 8e-04 and a decay
BS: 12 (6 random and 6 hard)
Distance: L1 and L2
KNN distance metric: cosine and euclidean
Comparison with another method:
SimCLR* unsupervised descriptor learning [7]: NT-Xent loss and contrastive learning.
7. Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." arXiv
preprint arXiv:2002.05709 (2020).
16

Results
department best baseline partition map@5 cross-validation map@5
Moselle 0.76 training 0.96 validation 0.96
Bas-Rhin 0.75 validation 0.93 testing 0.91
M-et-Moselle 0.84 testing 0.97 training 0.97
Table: map@5
17
Fine-tuned the siamese network to get ~20% improvement in comparison to the best
baseline:
Descriptor
Conv
ResNet50
3conv
FC

Comparison with the NT-Xent Loss
18
method R map@5
training: Moselle validation:
Bas-Rhin
testing:
M et Moselle
Contrastive
NT-Xent loss
128 0.77 0.62 0.52
Contrastive
NT-Xent loss
2048 0.83 0.73 0.61
Our 128 0.96 0.93 0.97
Our 2048 0.61 0.71 0.58
Parameters:
Total epochs: 100
BS: 26 pairs
Image resolution: 256
NT-XENT τ (temperature): 100
Augmentation: Color jitter and random rotations

Ablation & error analysis
Final descriptor size:
Erroneous matches returned by the KNN algorithm. Most errors concern forested zones
R map@5 training:
Moselle
map@5 validation
Bas-Rhin
map@5 testing:
M et Moselle
128 0.96 0.93 0.97
256 0.96 0.92 0.97
512* 0.86 0.70 0.80
Table: map@5 with different size of a descriptor, in the brackets the latest results, *not stable training
19

Ablation & parameters
Color:
Grayscale semantic data vs RGB
Normalization:
Batchnormalisation in all the added layers vs none
Activation:
Tanh activation in all the added layers
Losses:
Focal loss vs BCE loss.
Image size:
256 & 512, however, the second one allowed only very small batches and leaded to
an unstable training. 20

Conclusion
▪ A novel approach for learning from multi-modal data using fusion to
ﬁne-tune any CNN-based image descriptor so any backbone can be
used.
▪ The resulting descriptor is powerful enough to distinguish between
images that are semantically close and is robust against evolutionary
landscape changes through time:
just 128 values in a single descriptor
map@5 averaged for test and validation sets is 0.94
▪ New multi-modal dataset extracted from the BD TOPO/ORTHO IGN
- a unique and rich source of information for geo exploration.
21

Questions & Answers
Thank you for your attention!
To find out more, please check out the publication:
1.Margarita Khokhlova, Valerie Gouet-Brunet, Nathalie Abadie and Liming Chen. Recherche
multimodale d’images aériennes multi-date à l’aide d’un réseau siamois, RFIAP 2020.
2. https://github.com/margokhokhlova/siamese_net
Or contact me:
margarita.khokhlova@ign.fr
22

Cross-Year Multi-Modal Image Retrieval Using Siamese Networks by Margarita Khokhlova, Research Scientist (Post-Doc) at LIRIS

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Cross-Year Multi-Modal Image Retrieval Using Siamese Networks by Margarita Khokhlova, Research Scientist (Post-Doc) at LIRIS

Similar to Cross-Year Multi-Modal Image Retrieval Using Siamese Networks by Margarita Khokhlova, Research Scientist (Post-Doc) at LIRIS (20)

More from Paris Women in Machine Learning and Data Science

More from Paris Women in Machine Learning and Data Science (20)

Recently uploaded

Recently uploaded (20)

Cross-Year Multi-Modal Image Retrieval Using Siamese Networks by Margarita Khokhlova, Research Scientist (Post-Doc) at LIRIS