EDTformer: An Efficient Decoder Transformer for Visual Place Recognition

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-09 DOI:10.1109/TCSVT.2025.3559084

Tong Jin;Feng Lu;Shuyu Hu;Chun Yuan;Yunpeng Liu

{"title":"EDTformer: An Efficient Decoder Transformer for Visual Place Recognition","authors":"Tong Jin;Feng Lu;Shuyu Hu;Chun Yuan;Yunpeng Liu","doi":"10.1109/TCSVT.2025.3559084","DOIUrl":null,"url":null,"abstract":"Visual place recognition (VPR) aims to determine the general geographical location of a query image by retrieving visually similar images from a large geo-tagged database. To obtain a global representation for each place image, most approaches typically focus on the aggregation of deep features extracted from a backbone through using current prominent architectures (e.g., CNNs, MLPs, pooling layer, and transformer encoder), giving little attention to the transformer decoder. However, we argue that its strong capability to capture contextual dependencies and generate accurate features holds considerable potential for the VPR task. To this end, we propose an Efficient Decoder Transformer (EDTformer) for feature aggregation, which consists of several stacked simplified decoder blocks followed by two linear layers to directly produce robust and discriminative global representations. Specifically, we do this by formulating deep features as the keys and values, as well as a set of learnable parameters as the queries. Our EDTformer can fully utilize the contextual information within deep features, then gradually decode and aggregate the effective features into the learnable queries to output the global representations. Moreover, to provide more powerful deep features for EDTformer and further facilitate the robustness, we use the foundation model DINOv2 as the backbone and propose a Low-rank Parallel Adaptation (LoPA) method to enhance its performance in VPR, which can refine the intermediate features of the backbone progressively in a memory- and parameter-efficient way. As a result, our method not only outperforms single-stage VPR methods on multiple benchmark datasets, but also outperforms two-stage VPR methods which add a re-ranking with considerable cost. Code will be available at <uri>https://github.com/Tong-Jin01/EDTformer</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8835-8848"},"PeriodicalIF":11.1000,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10960340/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Visual place recognition (VPR) aims to determine the general geographical location of a query image by retrieving visually similar images from a large geo-tagged database. To obtain a global representation for each place image, most approaches typically focus on the aggregation of deep features extracted from a backbone through using current prominent architectures (e.g., CNNs, MLPs, pooling layer, and transformer encoder), giving little attention to the transformer decoder. However, we argue that its strong capability to capture contextual dependencies and generate accurate features holds considerable potential for the VPR task. To this end, we propose an Efficient Decoder Transformer (EDTformer) for feature aggregation, which consists of several stacked simplified decoder blocks followed by two linear layers to directly produce robust and discriminative global representations. Specifically, we do this by formulating deep features as the keys and values, as well as a set of learnable parameters as the queries. Our EDTformer can fully utilize the contextual information within deep features, then gradually decode and aggregate the effective features into the learnable queries to output the global representations. Moreover, to provide more powerful deep features for EDTformer and further facilitate the robustness, we use the foundation model DINOv2 as the backbone and propose a Low-rank Parallel Adaptation (LoPA) method to enhance its performance in VPR, which can refine the intermediate features of the backbone progressively in a memory- and parameter-efficient way. As a result, our method not only outperforms single-stage VPR methods on multiple benchmark datasets, but also outperforms two-stage VPR methods which add a re-ranking with considerable cost. Code will be available at https://github.com/Tong-Jin01/EDTformer.

查看原文本刊更多论文

EDTformer：一种用于视觉位置识别的高效解码器转换器

视觉位置识别（VPR）旨在通过从大型地理标记数据库中检索视觉上相似的图像来确定查询图像的大致地理位置。为了获得每个位置图像的全局表示，大多数方法通常侧重于通过使用当前著名的架构（例如，cnn, mlp，池化层和变压器编码器）从骨干提取的深度特征的聚合，而很少关注变压器解码器。然而，我们认为其强大的捕获上下文依赖关系和生成准确特征的能力对于VPR任务具有相当大的潜力。为此，我们提出了一种用于特征聚合的高效解码器变压器（EDTformer），它由几个堆叠的简化解码器块组成，然后是两个线性层，以直接产生鲁棒性和判别性的全局表示。具体来说，我们通过将深度特征作为键和值，以及一组可学习的参数作为查询来实现这一点。我们的EDTformer可以充分利用深层特征中的上下文信息，然后逐步解码和聚合有效特征到可学习的查询中，输出全局表示。此外，为了给EDTformer提供更强大的深度特征，进一步增强其鲁棒性，我们将基础模型DINOv2作为主干，并提出了一种低秩并行自适应（LoPA）方法来提高其在VPR中的性能，该方法可以以记忆和参数有效的方式逐步细化主干的中间特征。结果表明，该方法不仅在多个基准数据集上优于单阶段VPR方法，而且优于两阶段VPR方法，后者增加了重新排序的成本相当高。代码将在https://github.com/Tong-Jin01/EDTformer上提供。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.