DSFormer: A Dual-Scale Cross-Learning Transformer for Visual Place Recognition

IF 5.3 2区计算机科学 Q2 ROBOTICS

IEEE Robotics and Automation Letters Pub Date : 2025-09-02 DOI:10.1109/LRA.2025.3604761

Haiyang Jiang;Songhao Piao;Chao Gao;Lei Yu;Liguo Chen

{"title":"DSFormer: A Dual-Scale Cross-Learning Transformer for Visual Place Recognition","authors":"Haiyang Jiang;Songhao Piao;Chao Gao;Lei Yu;Liguo Chen","doi":"10.1109/LRA.2025.3604761","DOIUrl":null,"url":null,"abstract":"Visual Place Recognition (VPR) is crucial for robust mobile robot localization, yet it faces significant challenges in maintaining reliable performance under varying environmental conditions and viewpoints. To address this, we propose a novel framework that integrates Dual-Scale-Former (DSFormer), a Transformer-based cross-learning module, with an innovative block clustering strategy. DSFormer enhances feature representation by enabling bidirectional information transfer between dual-scale features extracted from the final two CNN layers, capturing both semantic richness and spatial details through self-attention for long-range dependencies within each scale and shared cross-attention for cross-scale learning. Complementing this, our block clustering strategy repartitions the widely used San Francisco eXtra Large (SF-XL) training dataset from multiple distinct perspectives, optimizing data organization to further bolster robustness against viewpoint variations. Together, these innovations not only yield a robust global embedding adaptable to environmental changes but also reduce the required training data volume by approximately 30% compared to previous partitioning methods. Comprehensive experiments demonstrate that our approach achieves state-of-the-art performance across most benchmark datasets, surpassing advanced reranking methods like DELG, Patch-NetVLAD, TransVPR, and R2Former as a global retrieval solution using 512-dim global descriptors, while significantly improving computational efficiency.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 10","pages":"10799-10806"},"PeriodicalIF":5.3000,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Robotics and Automation Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11146630/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Visual Place Recognition (VPR) is crucial for robust mobile robot localization, yet it faces significant challenges in maintaining reliable performance under varying environmental conditions and viewpoints. To address this, we propose a novel framework that integrates Dual-Scale-Former (DSFormer), a Transformer-based cross-learning module, with an innovative block clustering strategy. DSFormer enhances feature representation by enabling bidirectional information transfer between dual-scale features extracted from the final two CNN layers, capturing both semantic richness and spatial details through self-attention for long-range dependencies within each scale and shared cross-attention for cross-scale learning. Complementing this, our block clustering strategy repartitions the widely used San Francisco eXtra Large (SF-XL) training dataset from multiple distinct perspectives, optimizing data organization to further bolster robustness against viewpoint variations. Together, these innovations not only yield a robust global embedding adaptable to environmental changes but also reduce the required training data volume by approximately 30% compared to previous partitioning methods. Comprehensive experiments demonstrate that our approach achieves state-of-the-art performance across most benchmark datasets, surpassing advanced reranking methods like DELG, Patch-NetVLAD, TransVPR, and R2Former as a global retrieval solution using 512-dim global descriptors, while significantly improving computational efficiency.

查看原文本刊更多论文

DSFormer：用于视觉位置识别的双尺度交叉学习转换器

视觉位置识别（VPR）对于移动机器人的鲁棒定位至关重要，但在不同的环境条件和视点下保持可靠的性能面临重大挑战。为了解决这个问题，我们提出了一个新的框架，将双尺度前（DSFormer），一个基于变压器的交叉学习模块，与创新的块聚类策略集成在一起。DSFormer通过实现从最后两个CNN层提取的双尺度特征之间的双向信息传递来增强特征表示，通过每个尺度内的远程依赖的自注意和跨尺度学习的共享交叉注意来捕获语义丰富度和空间细节。作为补充，我们的块聚类策略从多个不同的角度重新划分了广泛使用的旧金山超大（SF-XL）训练数据集，优化了数据组织，进一步增强了对视点变化的鲁棒性。总之，这些创新不仅产生了适应环境变化的鲁棒全局嵌入，而且与以前的划分方法相比，所需的训练数据量减少了约30%。综合实验表明，我们的方法在大多数基准数据集上实现了最先进的性能，超过了DELG、Patch-NetVLAD、TransVPR和R2Former等先进的重新排序方法，作为使用512个全局描述符的全局检索解决方案，同时显著提高了计算效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Robotics and Automation Letters Computer Science-Computer Science Applications

CiteScore

9.60

自引率

15.40%

发文量

1428

期刊介绍： The scope of this journal is to publish peer-reviewed articles that provide a timely and concise account of innovative research ideas and application results, reporting significant theoretical findings and application case studies in areas of robotics and automation.