Skip to main navigation Skip to search Skip to main content

Geolocation in Cartographic Maps Using Street-Level Images

  • Mengjie Zhou

Student thesis: Doctoral ThesisDoctor of Philosophy (PhD)

Abstract

This thesis studies the geolocation problem, aiming to localise street-level images in cartographic maps without requiring GPS priors. Although GPS signal is cheap and readily available, its accuracy can diminish in challenging environments like urban canyons, motivating us to rely on visual information for more accurate geolocation. Humans often observe their surroundings and read cartographic maps to locate and navigate in unfamiliar places. Our task mimics this wayfinding ability and offers high scalability and robustness using maps’ compact and semantic representations. Specifically, we determine the geographic location where a street-level image was taken by querying it against a large-scale reference database comprising geo-tagged maps. This involves using a neural network to learn an embedded space where co-located images and maps present higher feature similarity than those from different locations, followed by localising with descriptors generated via this embedded space.
Cartographic maps can be processed into various forms, including 2D tiles and 2.5D models. The 2D map tiles are rendered images that depict small local regions, providing dense contextual information for geolocation. We build a Siamese-like network architecture with metric learning loss to learn an embedded space where descriptors of images and map tiles from the same location exhibit high similarity. When used in single-image based geolocation, the network produces discriminative descriptors for images and maps, achieving Top-1 recall rates between 42% and 56%. Furthermore, we apply polar transform on map tiles and incorporate a spatial-aware feature aggregation module in the network. This improvement mitigates the domain gap between images and maps, resulting in 15% to 20% performance gains. The 2.5D map model is an untextured 3D model constructed by augmenting a 2D map with height, offering explicit geometric information for geolocation. We develop a triplet-like architecture with a multi-modality fusion module to learn an image-map embedded space. By capturing complementary information, our method generates representative location embeddings from images and multi-modal maps, yielding Top-1 recall rates between 60% and 82%.
A single descriptor is not sufficiently discriminative for reliable geolocation in large cities with many repetitive scenes. It is the pattern of descriptors along routes that enables greater discrimination. Within a retrieval-based localisation framework, we adopt brute-force searching on route candidates defined over discrete locations with a connectivity graph. This framework can integrate single-modal or multi-modal descriptors to achieve 2DoF (Degrees of Freedom) pose estimation, i.e., geographic coordinates. After moving around 50 meters, our multi-modal method yields over 75% Top-1 recall rate, which is more than 10% higher than the single-modal method. For 3DoF pose estimation, we introduce a sequential Monte-Carlo localisation framework, which uses a particle filter with 2-D location and yaw in the state and comparisons between embedded features from images and maps as the likelihood. This approach achieves estimations with RMSE values below 5 meters and 10 degrees. Moreover, we propose learning sequential descriptors with a transformer-based network end to end, resulting in Top-1 recall gains of 6% to 13%.
Date of Award8 Aug 2024
Original languageEnglish
Awarding Institution
  • University of Bristol
SupervisorAndrew Calway (Supervisor)

Cite this

'