{"title":"A Transformer and Visual Foundation Model-Based Method for Cross-View Remote Sensing Image Retrieval","authors":"Changjiang Yin, Qin Ye, Junqi Luo","doi":"10.5194/isprs-archives-xlviii-1-2024-821-2024","DOIUrl":null,"url":null,"abstract":"Abstract. Retrieving UAV images that lack POS information with georeferenced satellite orthoimagery is challenging due to the differences in angles of views. Most existing methods rely on deep neural networks with a large number of parameters, leading to substantial time and financial investments in network training. Consequently, these methods may not be well-suited for downstream tasks that have high timeliness requirements. In this work, we propose a cross-view remote sensing image retrieval method based on transformer and visual foundation model. We investigated the potential of visual foundation model for extracting common features from cross-view images. Training is only conducted on a small, self-designed retrieval head, alleviating the burden of network training. Specifically, we designed a CVV module to optimize the features extracted from the visual foundation model, making these features more adept for cross-view image retrieval tasks. And we designed an MLP head to achieve similarity discrimination. The method is verified on a publicly available dataset containing multiple scenes. Our method shows excellent results in terms of both efficiency and accuracy on 15 sub-datasets (10 or 50 scene categories) derived from the public dataset, which holds practical value in engineering applications with streamlined scene categories and constrained computational resources. Furthermore, we initiated a comprehensive discussion and conducted ablation experiments on the network design to validate its efficacy. Additionally, we analyzed the presence of overfitting within the network and deliberated on the limitations of our study, proposing potential avenues for future enhancements.\n","PeriodicalId":505918,"journal":{"name":"The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences","volume":" 19","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5194/isprs-archives-xlviii-1-2024-821-2024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Abstract. Retrieving UAV images that lack POS information with georeferenced satellite orthoimagery is challenging due to the differences in angles of views. Most existing methods rely on deep neural networks with a large number of parameters, leading to substantial time and financial investments in network training. Consequently, these methods may not be well-suited for downstream tasks that have high timeliness requirements. In this work, we propose a cross-view remote sensing image retrieval method based on transformer and visual foundation model. We investigated the potential of visual foundation model for extracting common features from cross-view images. Training is only conducted on a small, self-designed retrieval head, alleviating the burden of network training. Specifically, we designed a CVV module to optimize the features extracted from the visual foundation model, making these features more adept for cross-view image retrieval tasks. And we designed an MLP head to achieve similarity discrimination. The method is verified on a publicly available dataset containing multiple scenes. Our method shows excellent results in terms of both efficiency and accuracy on 15 sub-datasets (10 or 50 scene categories) derived from the public dataset, which holds practical value in engineering applications with streamlined scene categories and constrained computational resources. Furthermore, we initiated a comprehensive discussion and conducted ablation experiments on the network design to validate its efficacy. Additionally, we analyzed the presence of overfitting within the network and deliberated on the limitations of our study, proposing potential avenues for future enhancements.