Humans are extremely good in perceiving and comparing visual patterns. This allows use to recognize previously seen or visited places and in consequence to localize ourselves. Technology-aided localization has reached robust and highly accurate performance levels outdoors, but is still very limited and cumbersome inside of buildings. However, more and more big buildings, in which it is hard and confusing to localize and navigate, are being digitized. The availability of these huge datasets with many images from indoor scenes, whose capture positions are known, lays the ground work for visual localization, similar to how humans localize.
We propose to use convolutional neural networks (CNNs) as global image descriptors. They map an input image to a lower-dimensional manifold, in which the Euclidean distance between two feature vectors represents the spatial proximity and visual similarity of their associated images. To evaluate the suitability of various approaches, we propose to generate a high quality ground truth from the geometric correspondences of the dataset images that are embedded in a 3D model of the building. Based on this ground truth, we compare a hand-crafted content-based image retrieval (CBIR) pipeline with off-the-shelf and specialized CNNs. Additionally, we train an off-the-shelf CNN with a multi-dimensional contrastive loss in the fashion of deep metric learning (DML). Finally, we propose a novel way to train CNNs with DML using a continuous ground truth. We show that using CNNs as global image descriptors outperforms classic CBIR pipelines and has yet untapped potential.