CDM-Net: A Framework for Cross-View Geo-Localization With Multimodal Data

IF 8.6 1区地球科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Geoscience and Remote Sensing Pub Date : 2025-07-31 DOI:10.1109/TGRS.2025.3594544

Xin Zhou;Xuerong Yang;Yanchun Zhang

{"title":"CDM-Net: A Framework for Cross-View Geo-Localization With Multimodal Data","authors":"Xin Zhou;Xuerong Yang;Yanchun Zhang","doi":"10.1109/TGRS.2025.3594544","DOIUrl":null,"url":null,"abstract":"Cross-view geo-localization (CVGL) task aims to match images of the same object captured from various platforms, such as drones and satellites. The primary challenge of CVGL is that the flight altitude and shooting angle of drones can lead to changes in the visual appearance of the same target building. Most existing methods regard CVGL task as an image retrieval problem based on classification networks. Nevertheless, the multimodal information correspondence of drone-satellite views and the structural features of target buildings are not fully explored. In this article, we propose a novel CVGL framework based on multimodal information, named conditional diffusion matching network (CDM-Net). The framework consists of three stages: preprocessing, drone-view image synthesis, and structural feature matching. Specifically, the first stage employs perspective transformation (PT) to convert the drone’s oblique view into a vertical view and describes the remote sensing (RS) information of the drone images. In the second stage, we use a conditional diffusion (CD) model to synthesize a high-precision vertical view by combining the converted drone-view image with RS textual information as input. In the final stage, we utilize a structural feature matching network to extract robust point-line features of the target building, subsequently outputting the retrieval list. Experiments on two widely used public benchmarks, University-1652 and SUES-200, showed that the proposed method achieved competitive results. Additionally, our method also outperformed existing methods in cross-modal dataset generalization. The datasets of this study are available at <uri>https://github.com/cver6/CDM-Net</uri>","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-16"},"PeriodicalIF":8.6000,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11105551/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Cross-view geo-localization (CVGL) task aims to match images of the same object captured from various platforms, such as drones and satellites. The primary challenge of CVGL is that the flight altitude and shooting angle of drones can lead to changes in the visual appearance of the same target building. Most existing methods regard CVGL task as an image retrieval problem based on classification networks. Nevertheless, the multimodal information correspondence of drone-satellite views and the structural features of target buildings are not fully explored. In this article, we propose a novel CVGL framework based on multimodal information, named conditional diffusion matching network (CDM-Net). The framework consists of three stages: preprocessing, drone-view image synthesis, and structural feature matching. Specifically, the first stage employs perspective transformation (PT) to convert the drone’s oblique view into a vertical view and describes the remote sensing (RS) information of the drone images. In the second stage, we use a conditional diffusion (CD) model to synthesize a high-precision vertical view by combining the converted drone-view image with RS textual information as input. In the final stage, we utilize a structural feature matching network to extract robust point-line features of the target building, subsequently outputting the retrieval list. Experiments on two widely used public benchmarks, University-1652 and SUES-200, showed that the proposed method achieved competitive results. Additionally, our method also outperformed existing methods in cross-modal dataset generalization. The datasets of this study are available at https://github.com/cver6/CDM-Net

查看原文本刊更多论文

CDM-Net：基于多模态数据的跨视图地理定位框架

交叉视角地理定位（CVGL）任务旨在匹配从不同平台（如无人机和卫星）捕获的同一物体的图像。CVGL的主要挑战是无人机的飞行高度和射击角度会导致同一目标建筑物的视觉外观发生变化。现有方法大多将CVGL任务视为基于分类网络的图像检索问题。然而，无人机-卫星影像的多模态信息对应与目标建筑物的结构特征并没有得到充分的探讨。本文提出了一种基于多模态信息的CVGL框架，称为条件扩散匹配网络（CDM-Net）。该框架包括预处理、无人机视角图像合成和结构特征匹配三个阶段。具体而言，第一阶段采用透视变换（PT）将无人机的斜视图转换为垂直视图，并描述无人机图像的遥感信息。在第二阶段，我们使用条件扩散（CD）模型将转换后的无人机视图图像与RS文本信息作为输入，合成高精度的垂直视图。最后，利用结构特征匹配网络提取目标建筑的鲁棒点-线特征，输出检索列表。在University-1652和sus -200这两个广泛使用的公共基准上进行的实验表明，所提出的方法取得了具有竞争力的结果。此外，该方法在跨模态数据集泛化方面也优于现有方法。本研究的数据集可在https://github.com/cver6/CDM-Net上获得

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Geoscience and Remote Sensing 工程技术-地球化学与地球物理

CiteScore

11.50

自引率

28.00%

发文量

1912

审稿时长

4.0 months

期刊介绍： IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.