"Cross-Year Multi-Modal Image Retrieval Using Siamese Networks" by Margarita Khokhlova, Research Scientist (Post-Doc) at LIRIS
Abstract: Alegoria project aims to create content-based image retrieval (CBIR) tools to help end-users accessing great volumes of archive images of French territories which were recently digitized. The difficulty is that many photographic materials are scarcely, or not at all annotated, which makes it hard to link them to modern photographic images of the same territory. In this talk, I am going to present a new custom Siamese architecture for a cross-time multi-modal aerial image retrieval scenario and talk about single-shot and contrastive learning approaches.
Speaker biography: Margarita Khokhlova is a postdoc researcher at the IGN Saint-Mande affiliated with LIRIS Lyon. Her primary area of expertise is computer vision. She is currently working on deep learning-based methods for unsupervised multi-modal image description and retrieval. She obtained a Ph.D. degree from the University of Burgundy in 2018, where her dissertation was dedicated to automatic gait analysis using 3D active sensors. She also holds two separate master's degrees. The first is a joint degree in computer vision from the University of Lyon, France and NTNU Norway. The second is in business management administration from the University of Burgundy Dijon. Her research interests include computer vision, deep learning, and data analysis.
Roadmap to Membership of RICS - Pathways and Routes
Cross-Year Multi-Modal Image Retrieval Using Siamese Networks by Margarita Khokhlova, Research Scientist (Post-Doc) at LIRIS
1. CROSS-YEAR MULTI-MODAL IMAGE RETRIEVAL USING
SIAMESE NETWORKS
M. Khokhlova1,2
, V. Gouet-Brunet1
, N.Abadie1
, L. Chen2
, ICIP 2020.
IGN1
and LIRIS2
WiMLDS Paris meetup
Presentor: Margarita Khokhlova,somewhere in the world, September 10, 2020.
1
2. The research problematic
Alegoria project aims at facilitating the promotion of iconographic institutional funds
collections describing the French territory in various periods going from the interwar period
to our days.
Some examples from the vast archives of the project resources
Archive resources:
Cartothèque de la Reconstruction, Fonds Archives nationales (Fonds LAPIE (1955-65), Cartothèque de la
Reconstruction (1948-1976), aerial views:
https://www.siv.archives-nationales.culture.gouv.fr/siv/IR/FRAN_IR_050605
Fonds Musée Nicéphore Niépce (Fonds de l’entreprise CIM (1949-1974), Fonds Bouquet (1914-1918), etc),
cartes postales, aerial views.
2
Aerial
and
ground-view
images,
different angle.
3. Digitized archive data:
Aerial and ground-view images with and without corresponding metadata.
Multi-modal cross-temporal data covering all France:
BD TOPO and ORTHO from IGN (multiple versions starting from 2004)[1].
▪ Aerial photographs (100% vertical)
▪ Manually annotated industrial and natural objects in the vector form
Many are available as an Open Source.
Project benchmarks and data
1. IGN geo datal: https://geoservices.ign.fr/ 3
4. FR-0419
Multi-modal cross-temporal vertical images database:
Research question: can we retrieve the same geozone across time?
Annotations: semantic objects from the image, a department
Ground truth sources: matching geo zones from the aerial images and semantic
maps from IGN (database TOPO/ORTHO) to create the multi-modal dataset
FR-0419.
4
5. Research scope of this work
Goal:
To perform the multi-modal research of aerial images representing the same
geographic location across time:
▪ How do modern cross-view and standard image descriptors handle this task
▪ How can we use the semantic data to improve the search results
▪ Which modality is more important for the across-time search
2004-2019 changes
5
6. Multi-modal cross time database
3 selected departments in the East of France:
# department images
high-res BD
ORTHO
patches 2000x2000 pixels (1
square km)
1 Moselle 327 6000
2 Bas-Rhin 248 4430
3 Meurthe-et-Moselle 291 5855
6Coverage of Mozelle by BD
ORTHO (image)
7. Visual and Semantic data
2004
image
2004
semantic
label
2019
image
2019
semantic
label
7
10. Baselines
Classical
Model: ResNet50 [2]
Backbone: ResNet50
Pooling: MaxPooling
Image resolution: 512x512
Pre-trained on: Imagenet
Final descriptor size (single dims): 2048
Cross-view image retrieval
Model: GEM [3]
Backbone: Resnet101
Pooling: GEM layer
Image resolution: 1024x1024
Pre-trained on: oxford5k, paris6k,
roxford5k, rparis6k
Final descriptor size (single dims): 2048
2.He, Kaiming, et al. "Identity mappings in deep residual networks." European conference on computer vision. Springer,
Cham, 2016.
3.Radenović, Filip, Giorgos Tolias, and Ondřej Chum. "Fine-tuning CNN image retrieval with no human annotation." IEEE
transactions on pattern analysis and machine intelligence 41.7 (2018): 1655-1668.
10
11. Multi-modal information usage
Data fusion
▪ Images
▪ Semantic labels
▪ Early fusion (concatenation)
▪ Late fusion
▪ Fusion by a convolutional layer
D2048
D2048
D4096
KNN
D2048
D2048
KNN KNN
similarity-based
ranking
D2048
KNN
11
12. Baseline results
Trends:
▪ Descriptors on semantic masks give better results than descriptors on natural images.
▪ Multi-modal data give better accuracy than the single modalities.
▪ Late fusion gives the best results for both baselines.
▪ Pre-trained off-the-shelf Resnet50 outperforms the more sophisticated GEM on our data
(low-res).
Table: map@5 off-the-shelf descriptor GEM
Table: map@5 off-the-shelf descriptor Resnet50
12
13. Method to improve the baseline and fine-tune on our data
Similar research problematics:
Single-shot learning - the case when very few training (or just a single one) are available to
train the model:
▪ Face recognition
▪ Object recognition [4,5]
▪ Omniglot symbols recognition dataset [6]
Common solutions:
Siamese and Triplet Networks:
▪ Cross Entropy Loss
▪ Contrastive Loss
▪ Triplet Loss
13
4. Vinyals, Oriol, et al. “Matching networks for one shot learning.” Advances in Neural Information Processing
Systems. 2016.
5.Qiao, Siyuan, et al. “Few-shot image recognition by predicting parameters from activations.” CoRR,
abs/1706.03466 1 (2017).
6.Lake, Brenden, et al. “One shot learning of simple visual concepts.” Proceedings of the Annual Meeting of the
Cognitive Science Society. Vol. 33. No. 33. 2011.
14. Siamese architecture for multi-modal data
Definitions:
X1
,S1
;X2
,S2
- input aerial image and a corresponding semantic map:
matching pairs and non-matching pairs.
Y - ground truth correspondence level
DR
- resulting descriptor
14
15. Loss function & training
A simple binary cross-entropy loss:
(3)
Training data:
Training set: Moselle
Validation: Bas-Rhin
Test: Meurthe-et-Moselle
Cross-Validation: (swapping the departments)
Hard-mining: each 5 epochs based on the wrong matches from the KNN algorithm.
15
16. Experiments
Hyperparameters:
Descriptor size: 128, 256, 512
Training: 100 epochs
Optimizer: Adam, lr = 8e-04 and a decay
BS: 12 (6 random and 6 hard)
Distance: L1 and L2
KNN distance metric: cosine and euclidean
Comparison with another method:
SimCLR* unsupervised descriptor learning [7]: NT-Xent loss and contrastive learning.
7. Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." arXiv
preprint arXiv:2002.05709 (2020).
16
17. Results
department best baseline partition map@5 cross-validation map@5
Moselle 0.76 training 0.96 validation 0.96
Bas-Rhin 0.75 validation 0.93 testing 0.91
M-et-Moselle 0.84 testing 0.97 training 0.97
Table: map@5
17
Fine-tuned the siamese network to get ~20% improvement in comparison to the best
baseline:
Descriptor
Conv
ResNet50
3conv
FC
18. Comparison with the NT-Xent Loss
18
method R map@5
training: Moselle validation:
Bas-Rhin
testing:
M et Moselle
Contrastive
NT-Xent loss
128 0.77 0.62 0.52
Contrastive
NT-Xent loss
2048 0.83 0.73 0.61
Our 128 0.96 0.93 0.97
Our 2048 0.61 0.71 0.58
Parameters:
Total epochs: 100
BS: 26 pairs
Image resolution: 256
NT-XENT τ (temperature): 100
Augmentation: Color jitter and random rotations
19. Ablation & error analysis
Final descriptor size:
Erroneous matches returned by the KNN algorithm. Most errors concern forested zones
R map@5 training:
Moselle
map@5 validation
Bas-Rhin
map@5 testing:
M et Moselle
128 0.96 0.93 0.97
256 0.96 0.92 0.97
512* 0.86 0.70 0.80
Table: map@5 with different size of a descriptor, in the brackets the latest results, *not stable training
19
20. Ablation & parameters
Color:
Grayscale semantic data vs RGB
Normalization:
Batchnormalisation in all the added layers vs none
Activation:
Tanh activation in all the added layers
Losses:
Focal loss vs BCE loss.
Image size:
256 & 512, however, the second one allowed only very small batches and leaded to
an unstable training. 20
21. Conclusion
▪ A novel approach for learning from multi-modal data using fusion to
fine-tune any CNN-based image descriptor so any backbone can be
used.
▪ The resulting descriptor is powerful enough to distinguish between
images that are semantically close and is robust against evolutionary
landscape changes through time:
just 128 values in a single descriptor
map@5 averaged for test and validation sets is 0.94
▪ New multi-modal dataset extracted from the BD TOPO/ORTHO IGN
- a unique and rich source of information for geo exploration.
21
22. Questions & Answers
Thank you for your attention!
To find out more, please check out the publication:
1.Margarita Khokhlova, Valerie Gouet-Brunet, Nathalie Abadie and Liming Chen. Recherche
multimodale d’images aériennes multi-date à l’aide d’un réseau siamois, RFIAP 2020.
2. https://github.com/margokhokhlova/siamese_net
Or contact me:
margarita.khokhlova@ign.fr
22