Unified and Real-Time Image Geo-Localization via Fine-Grained Overlap Estimation

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-09-09 DOI:10.1109/TIP.2024.3453008

Ze Song;Xudong Kang;Xiaohui Wei;Shutao Li;Haibo Liu

{"title":"Unified and Real-Time Image Geo-Localization via Fine-Grained Overlap Estimation","authors":"Ze Song;Xudong Kang;Xiaohui Wei;Shutao Li;Haibo Liu","doi":"10.1109/TIP.2024.3453008","DOIUrl":null,"url":null,"abstract":"Image geo-localization aims to locate a query image from source platform (e.g., drones, street vehicle) by matching it with Geo-tagged reference images from the target platforms (e.g., different satellites). Achieving cross-modal or cross-view real-time (>30fps) image localization with the guaranteed accuracy in a unified framework remains a challenge due to the huge differences in modalities and views between the two platforms. In order to solve this problem, a novel fine-grained overlap estimation based image geo-localization method is proposed in this paper, the core of which is to estimate the salient and subtle overlapping regions in image pairs to ensure correct matching. Specifically, the high-level semantic features of input images are extracted by a deep convolutional neural network. Then, a novel overlap scanning module (OSM) is presented to mine the long-range spatial and channel dependencies of semantic features in various subspaces, thereby identifying fine-grained overlapping regions. Finally, we adopt the triplet ranking loss to guide the proposed network optimization so that the matching regions are as close as possible and the most mismatched regions are as far away as possible. To demonstrate the effectiveness of our FOENet, comprehensive experiments are conducted on three cross-view benchmarks and one cross-modal benchmark. Our FOENet yields better performance in various metrics and the recall accuracy at top 1 (R@1) is significantly improved, with a maximum improvement of 70.6%. In addition, the proposed model runs fast on a single RTX 6000, reaching real-time inference speed on all datasets, with the fastest being 82.3 FPS.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5060-5072"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10670055/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Image geo-localization aims to locate a query image from source platform (e.g., drones, street vehicle) by matching it with Geo-tagged reference images from the target platforms (e.g., different satellites). Achieving cross-modal or cross-view real-time (>30fps) image localization with the guaranteed accuracy in a unified framework remains a challenge due to the huge differences in modalities and views between the two platforms. In order to solve this problem, a novel fine-grained overlap estimation based image geo-localization method is proposed in this paper, the core of which is to estimate the salient and subtle overlapping regions in image pairs to ensure correct matching. Specifically, the high-level semantic features of input images are extracted by a deep convolutional neural network. Then, a novel overlap scanning module (OSM) is presented to mine the long-range spatial and channel dependencies of semantic features in various subspaces, thereby identifying fine-grained overlapping regions. Finally, we adopt the triplet ranking loss to guide the proposed network optimization so that the matching regions are as close as possible and the most mismatched regions are as far away as possible. To demonstrate the effectiveness of our FOENet, comprehensive experiments are conducted on three cross-view benchmarks and one cross-modal benchmark. Our FOENet yields better performance in various metrics and the recall accuracy at top 1 (R@1) is significantly improved, with a maximum improvement of 70.6%. In addition, the proposed model runs fast on a single RTX 6000, reaching real-time inference speed on all datasets, with the fastest being 82.3 FPS.

查看原文本刊更多论文

通过细粒度重叠估计实现统一、实时的图像地理定位

图像地理定位旨在将源平台（如无人机、街道车辆）的查询图像与目标平台（如不同卫星）的地理标记参考图像进行匹配，从而确定图像的位置。由于两个平台在模式和视角上存在巨大差异，因此在统一的框架内实现跨模式或跨视角实时（>30fps）图像定位并保证准确性仍然是一项挑战。为了解决这一问题，本文提出了一种基于精细重叠估计的新型图像地理定位方法，其核心是估计图像对中的突出和微妙重叠区域，以确保匹配的正确性。具体来说，输入图像的高级语义特征由深度卷积神经网络提取。然后，我们提出了一个新颖的重叠扫描模块（OSM），用于挖掘不同子空间中语义特征的长程空间和通道依赖关系，从而识别细粒度重叠区域。最后，我们采用三重排序损失来指导拟议的网络优化，从而使匹配区域尽可能靠近，而最不匹配区域尽可能远离。为了证明我们的 FOENet 的有效性，我们在三个跨视角基准和一个跨模态基准上进行了综合实验。我们的 FOENet 在各种指标上都取得了更好的性能，前 1 位的召回准确率（R@1）显著提高，最高提高了 70.6%。此外，所提出的模型在单个 RTX 6000 上运行速度很快，在所有数据集上都能达到实时推理速度，最快可达 82.3 FPS。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量