Xiaogang Song , Hailong Yang , Junjie Tang , Xiaochang Li , Xiaofeng Lu , Xinhong Hei
{"title":"室内视觉定位多层次复合注意引导网络","authors":"Xiaogang Song , Hailong Yang , Junjie Tang , Xiaochang Li , Xiaofeng Lu , Xinhong Hei","doi":"10.1016/j.dsp.2025.105458","DOIUrl":null,"url":null,"abstract":"<div><div>Accurate and robust camera pose estimation is essential for autonomous navigation and path planning in unmanned systems. To improve the localization accuracy in complex indoor scenes and mitigate information loss during feature extraction, we propose a multi-level composite attention-guided scene coordinate regression method. The proposed model predicts the mapping between 2D pixel points and 3D scene coordinates from a single RGB image. First, we introduce a Multi-level Feature Fusion Module (MFF), which employs global pooling and parallel branches to consolidate multi-level features, enhancing discrimination in repetitive structures and low-texture regions. Next, we design an Embedded Attention Module (EAM) to dynamically fuse multi-level features through parallel channel and spatial attention mechanisms, preserving edge details and suppressing noise. Finally, a differentiable random sample consensus algorithm is used to achieve robust fitting of pose parameters. Evaluation and analysis on common indoor public datasets demonstrate that the proposed method significantly improves localization performance. Additionally, extensive ablation evaluations confirm the effectiveness of the proposed Embedded Attention Module and Multi-level Feature Fusion Module in enhancing localization accuracy.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"167 ","pages":"Article 105458"},"PeriodicalIF":2.9000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A multi-level composite attention-guided network for indoor visual localization\",\"authors\":\"Xiaogang Song , Hailong Yang , Junjie Tang , Xiaochang Li , Xiaofeng Lu , Xinhong Hei\",\"doi\":\"10.1016/j.dsp.2025.105458\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Accurate and robust camera pose estimation is essential for autonomous navigation and path planning in unmanned systems. To improve the localization accuracy in complex indoor scenes and mitigate information loss during feature extraction, we propose a multi-level composite attention-guided scene coordinate regression method. The proposed model predicts the mapping between 2D pixel points and 3D scene coordinates from a single RGB image. First, we introduce a Multi-level Feature Fusion Module (MFF), which employs global pooling and parallel branches to consolidate multi-level features, enhancing discrimination in repetitive structures and low-texture regions. Next, we design an Embedded Attention Module (EAM) to dynamically fuse multi-level features through parallel channel and spatial attention mechanisms, preserving edge details and suppressing noise. Finally, a differentiable random sample consensus algorithm is used to achieve robust fitting of pose parameters. Evaluation and analysis on common indoor public datasets demonstrate that the proposed method significantly improves localization performance. Additionally, extensive ablation evaluations confirm the effectiveness of the proposed Embedded Attention Module and Multi-level Feature Fusion Module in enhancing localization accuracy.</div></div>\",\"PeriodicalId\":51011,\"journal\":{\"name\":\"Digital Signal Processing\",\"volume\":\"167 \",\"pages\":\"Article 105458\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2025-07-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1051200425004804\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200425004804","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
A multi-level composite attention-guided network for indoor visual localization
Accurate and robust camera pose estimation is essential for autonomous navigation and path planning in unmanned systems. To improve the localization accuracy in complex indoor scenes and mitigate information loss during feature extraction, we propose a multi-level composite attention-guided scene coordinate regression method. The proposed model predicts the mapping between 2D pixel points and 3D scene coordinates from a single RGB image. First, we introduce a Multi-level Feature Fusion Module (MFF), which employs global pooling and parallel branches to consolidate multi-level features, enhancing discrimination in repetitive structures and low-texture regions. Next, we design an Embedded Attention Module (EAM) to dynamically fuse multi-level features through parallel channel and spatial attention mechanisms, preserving edge details and suppressing noise. Finally, a differentiable random sample consensus algorithm is used to achieve robust fitting of pose parameters. Evaluation and analysis on common indoor public datasets demonstrate that the proposed method significantly improves localization performance. Additionally, extensive ablation evaluations confirm the effectiveness of the proposed Embedded Attention Module and Multi-level Feature Fusion Module in enhancing localization accuracy.
期刊介绍:
Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal.
The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as:
• big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,