Dansheng Yao , Mengqi Zhu , Hehua Zhu , Wuqiang Cai , Long Zhou
{"title":"Integrating synthetic datasets with CLIP semantic insights for single image localization advancements","authors":"Dansheng Yao , Mengqi Zhu , Hehua Zhu , Wuqiang Cai , Long Zhou","doi":"10.1016/j.isprsjprs.2024.10.027","DOIUrl":null,"url":null,"abstract":"<div><div>Accurate localization of pedestrians and mobile robots is critical for navigation, emergency response, and autonomous driving. Traditional localization methods, such as satellite signals, often prove ineffective in certain environments, and acquiring sufficient positional data can be challenging. Single image localization techniques have been developed to address these issues. However, current deep learning frameworks for single image localization that rely on domain adaptation fail to effectively utilize semantically rich high-level features obtained from large-scale pretraining. This paper introduces a novel framework that leverages the Contrastive Language-Image Pre-training model and prompts to enhance feature extraction and domain adaptation through semantic information. The proposed framework generates an integrated score map from scene-specific prompts to guide feature extraction and employs adversarial components to facilitate domain adaptation. Furthermore, a reslink component is incorporated to mitigate the precision loss in high-level features compared to the original data. Experimental results demonstrate that the use of prompts reduces localization errors by 26.4 % in indoor environments and 24.3 % in outdoor settings. The model achieves localization errors as low as 0.75 m and 8.09 degrees indoors, and 4.56 m and 7.68 degrees outdoors. Analysis of prompts from labeled datasets confirms the model’s capability to effectively interpret scene information. The weights of the integrated score map enhance the model’s transparency, thereby improving interpretability. This study underscores the efficacy of integrating semantic information into image localization tasks.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"218 ","pages":"Pages 198-213"},"PeriodicalIF":10.6000,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISPRS Journal of Photogrammetry and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0924271624004040","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOGRAPHY, PHYSICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Accurate localization of pedestrians and mobile robots is critical for navigation, emergency response, and autonomous driving. Traditional localization methods, such as satellite signals, often prove ineffective in certain environments, and acquiring sufficient positional data can be challenging. Single image localization techniques have been developed to address these issues. However, current deep learning frameworks for single image localization that rely on domain adaptation fail to effectively utilize semantically rich high-level features obtained from large-scale pretraining. This paper introduces a novel framework that leverages the Contrastive Language-Image Pre-training model and prompts to enhance feature extraction and domain adaptation through semantic information. The proposed framework generates an integrated score map from scene-specific prompts to guide feature extraction and employs adversarial components to facilitate domain adaptation. Furthermore, a reslink component is incorporated to mitigate the precision loss in high-level features compared to the original data. Experimental results demonstrate that the use of prompts reduces localization errors by 26.4 % in indoor environments and 24.3 % in outdoor settings. The model achieves localization errors as low as 0.75 m and 8.09 degrees indoors, and 4.56 m and 7.68 degrees outdoors. Analysis of prompts from labeled datasets confirms the model’s capability to effectively interpret scene information. The weights of the integrated score map enhance the model’s transparency, thereby improving interpretability. This study underscores the efficacy of integrating semantic information into image localization tasks.
期刊介绍:
The ISPRS Journal of Photogrammetry and Remote Sensing (P&RS) serves as the official journal of the International Society for Photogrammetry and Remote Sensing (ISPRS). It acts as a platform for scientists and professionals worldwide who are involved in various disciplines that utilize photogrammetry, remote sensing, spatial information systems, computer vision, and related fields. The journal aims to facilitate communication and dissemination of advancements in these disciplines, while also acting as a comprehensive source of reference and archive.
P&RS endeavors to publish high-quality, peer-reviewed research papers that are preferably original and have not been published before. These papers can cover scientific/research, technological development, or application/practical aspects. Additionally, the journal welcomes papers that are based on presentations from ISPRS meetings, as long as they are considered significant contributions to the aforementioned fields.
In particular, P&RS encourages the submission of papers that are of broad scientific interest, showcase innovative applications (especially in emerging fields), have an interdisciplinary focus, discuss topics that have received limited attention in P&RS or related journals, or explore new directions in scientific or professional realms. It is preferred that theoretical papers include practical applications, while papers focusing on systems and applications should include a theoretical background.