CoHAtNet: An integrated convolutional-transformer architecture with hybrid self-attention for end-to-end camera localization

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-07-26 DOI:10.1016/j.imavis.2025.105674

Hussein Hasan , Miguel Angel Garcia , Hatem Rashwan , Domenec Puig

{"title":"CoHAtNet: An integrated convolutional-transformer architecture with hybrid self-attention for end-to-end camera localization","authors":"Hussein Hasan , Miguel Angel Garcia , Hatem Rashwan , Domenec Puig","doi":"10.1016/j.imavis.2025.105674","DOIUrl":null,"url":null,"abstract":"<div><div>Camera localization refers to the process of automatically determining the position and orientation of a camera within its 3D environment from the images it captures. Traditional camera localization methods often rely on Convolutional Neural Networks, which are effective at extracting local visual features but struggle to capture long-range dependencies critical for accurate localization. In contrast, Transformer-based approaches model global contextual relationships appropriately, although they often lack precision in fine-grained spatial representations. To bridge this gap, we introduce CoHAtNet, a novel Convolutional Hybrid-Attention Network that tightly integrates convolutional and self-attention mechanisms.</div><div>Unlike previous hybrid models that stack convolutional and attention layers separately, CoHAtNet embeds local features extracted via Mobile Inverted Bottleneck Convolution blocks directly into the Value component of the self-attention mechanism of Transformers. This yields a hybrid self-attention block capable of dynamically capturing both local spatial detail and global semantic context within a single attention layer. Additionally, CoHAtNet enables modality-level fusion by processing RGB and depth data jointly in a unified pipeline, allowing the model to leverage complementary appearance and geometric cues throughout.</div><div>Extensive evaluations have been conducted on two widely-used camera localization datasets: 7-Scenes (RGB-D) and Cambridge Landmarks (RGB). Experimental results show that CoHAtNet achieves state-of-the-art performance in both translation and orientation accuracy. These results highlight the effectiveness of our hybrid design in challenging indoor and outdoor environments. This makes CoHAtNet a strong candidate for end-to-end camera localization tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105674"},"PeriodicalIF":4.2000,"publicationDate":"2025-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625002628","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Camera localization refers to the process of automatically determining the position and orientation of a camera within its 3D environment from the images it captures. Traditional camera localization methods often rely on Convolutional Neural Networks, which are effective at extracting local visual features but struggle to capture long-range dependencies critical for accurate localization. In contrast, Transformer-based approaches model global contextual relationships appropriately, although they often lack precision in fine-grained spatial representations. To bridge this gap, we introduce CoHAtNet, a novel Convolutional Hybrid-Attention Network that tightly integrates convolutional and self-attention mechanisms.

Unlike previous hybrid models that stack convolutional and attention layers separately, CoHAtNet embeds local features extracted via Mobile Inverted Bottleneck Convolution blocks directly into the Value component of the self-attention mechanism of Transformers. This yields a hybrid self-attention block capable of dynamically capturing both local spatial detail and global semantic context within a single attention layer. Additionally, CoHAtNet enables modality-level fusion by processing RGB and depth data jointly in a unified pipeline, allowing the model to leverage complementary appearance and geometric cues throughout.

Extensive evaluations have been conducted on two widely-used camera localization datasets: 7-Scenes (RGB-D) and Cambridge Landmarks (RGB). Experimental results show that CoHAtNet achieves state-of-the-art performance in both translation and orientation accuracy. These results highlight the effectiveness of our hybrid design in challenging indoor and outdoor environments. This makes CoHAtNet a strong candidate for end-to-end camera localization tasks.

查看原文本刊更多论文

CoHAtNet：一个集成的卷积转换器架构，具有混合自关注，用于端到端相机定位

相机定位是指根据拍摄的图像自动确定相机在其三维环境中的位置和方向的过程。传统的相机定位方法通常依赖于卷积神经网络，卷积神经网络在提取局部视觉特征方面很有效，但很难捕捉到对准确定位至关重要的远程依赖关系。相比之下，基于transformer的方法可以适当地为全局上下文关系建模，尽管它们在细粒度空间表示中通常缺乏精度。为了弥补这一差距，我们引入了CoHAtNet，这是一种新颖的卷积混合注意网络，它紧密集成了卷积和自注意机制。与之前将卷积层和注意力层分开堆叠的混合模型不同，CoHAtNet将通过移动倒瓶颈卷积块提取的局部特征直接嵌入到transformer自注意机制的Value组件中。这产生了一个混合自注意块，能够在单个注意层内动态捕获局部空间细节和全局语义上下文。此外，CoHAtNet通过在统一的管道中联合处理RGB和深度数据，实现了模态级融合，使模型能够充分利用互补的外观和几何线索。对两种广泛使用的相机定位数据集进行了广泛的评估：7-Scenes （RGB- d）和Cambridge Landmarks （RGB）。实验结果表明，CoHAtNet在平移精度和方向精度上都达到了世界先进水平。这些结果突出了我们的混合设计在具有挑战性的室内和室外环境中的有效性。这使得CoHAtNet成为端到端相机定位任务的有力候选。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.