{"title":"CDM-Net: A Framework for Cross-View Geo-Localization With Multimodal Data","authors":"Xin Zhou;Xuerong Yang;Yanchun Zhang","doi":"10.1109/TGRS.2025.3594544","DOIUrl":null,"url":null,"abstract":"Cross-view geo-localization (CVGL) task aims to match images of the same object captured from various platforms, such as drones and satellites. The primary challenge of CVGL is that the flight altitude and shooting angle of drones can lead to changes in the visual appearance of the same target building. Most existing methods regard CVGL task as an image retrieval problem based on classification networks. Nevertheless, the multimodal information correspondence of drone-satellite views and the structural features of target buildings are not fully explored. In this article, we propose a novel CVGL framework based on multimodal information, named conditional diffusion matching network (CDM-Net). The framework consists of three stages: preprocessing, drone-view image synthesis, and structural feature matching. Specifically, the first stage employs perspective transformation (PT) to convert the drone’s oblique view into a vertical view and describes the remote sensing (RS) information of the drone images. In the second stage, we use a conditional diffusion (CD) model to synthesize a high-precision vertical view by combining the converted drone-view image with RS textual information as input. In the final stage, we utilize a structural feature matching network to extract robust point-line features of the target building, subsequently outputting the retrieval list. Experiments on two widely used public benchmarks, University-1652 and SUES-200, showed that the proposed method achieved competitive results. Additionally, our method also outperformed existing methods in cross-modal dataset generalization. The datasets of this study are available at <uri>https://github.com/cver6/CDM-Net</uri>","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-16"},"PeriodicalIF":8.6000,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11105551/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Cross-view geo-localization (CVGL) task aims to match images of the same object captured from various platforms, such as drones and satellites. The primary challenge of CVGL is that the flight altitude and shooting angle of drones can lead to changes in the visual appearance of the same target building. Most existing methods regard CVGL task as an image retrieval problem based on classification networks. Nevertheless, the multimodal information correspondence of drone-satellite views and the structural features of target buildings are not fully explored. In this article, we propose a novel CVGL framework based on multimodal information, named conditional diffusion matching network (CDM-Net). The framework consists of three stages: preprocessing, drone-view image synthesis, and structural feature matching. Specifically, the first stage employs perspective transformation (PT) to convert the drone’s oblique view into a vertical view and describes the remote sensing (RS) information of the drone images. In the second stage, we use a conditional diffusion (CD) model to synthesize a high-precision vertical view by combining the converted drone-view image with RS textual information as input. In the final stage, we utilize a structural feature matching network to extract robust point-line features of the target building, subsequently outputting the retrieval list. Experiments on two widely used public benchmarks, University-1652 and SUES-200, showed that the proposed method achieved competitive results. Additionally, our method also outperformed existing methods in cross-modal dataset generalization. The datasets of this study are available at https://github.com/cver6/CDM-Net
期刊介绍:
IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.