序列视觉位置识别的解耦联合图像和序列训练框架

IF 6.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2025-09-30 DOI:10.1016/j.neucom.2025.131622

Shanshan Wan , Lai Kang , Yingmei Wei , Tianrui Shen , Haixuan Wang , Chao Zuo

{"title":"序列视觉位置识别的解耦联合图像和序列训练框架","authors":"Shanshan Wan , Lai Kang , Yingmei Wei , Tianrui Shen , Haixuan Wang , Chao Zuo","doi":"10.1016/j.neucom.2025.131622","DOIUrl":null,"url":null,"abstract":"<div><div>Traditional image-to-image (im2im) visual place recognition (VPR) involves matching a single query image to stored geo-tagged database images. In real-time robotic and autonomous applications, while a continuous stream of frames naturally leads to a simpler sequence-to-sequence (seq2seq) VPR problem, the challenges remain since labeled sequential data is much scarcer than labeled individual images. A recent work addressed this by using a unified network optimized for both seq2seq and im2im tasks, but the resulting sequential descriptors are heavily dependent on the individual descriptors trained on the im2im task. This paper proposes a decoupled joint image and sequence training (DJIST) framework, using a frozen backbone and two independent sequential branches, where one branch is supervised by both im2im and seq2seq losses and the other solely by the seq2seq loss. The feature reduction procedures for generating individual descriptors and sequential descriptors are further separated in the former branch. An attention separation loss is employed between the two branches, which forces them to focus on different parts of the images to produce more informative sequential descriptors. We retrain various existing seq2seq methods using the same backbone and two types of joint training strategies for a fair comparison. Extensive experimental results demonstrate that our proposed DJIST outperforms its original counterpart JIST by 3.9 % to 18.8 % across four benchmark test cases and achieves state-of-the-art Recall@1 scores against retrained baselines on three key benchmarks with robust cross-dataset generalization, negligible degradation under dimensionality reduction, and superior robustness against varying test-time sequence lengths. Code will be available at <span><span>https://github.com/shuimushan/DJIST</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"658 ","pages":"Article 131622"},"PeriodicalIF":6.5000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DJIST: Decoupled joint image and sequence training framework for sequential visual place recognition\",\"authors\":\"Shanshan Wan , Lai Kang , Yingmei Wei , Tianrui Shen , Haixuan Wang , Chao Zuo\",\"doi\":\"10.1016/j.neucom.2025.131622\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Traditional image-to-image (im2im) visual place recognition (VPR) involves matching a single query image to stored geo-tagged database images. In real-time robotic and autonomous applications, while a continuous stream of frames naturally leads to a simpler sequence-to-sequence (seq2seq) VPR problem, the challenges remain since labeled sequential data is much scarcer than labeled individual images. A recent work addressed this by using a unified network optimized for both seq2seq and im2im tasks, but the resulting sequential descriptors are heavily dependent on the individual descriptors trained on the im2im task. This paper proposes a decoupled joint image and sequence training (DJIST) framework, using a frozen backbone and two independent sequential branches, where one branch is supervised by both im2im and seq2seq losses and the other solely by the seq2seq loss. The feature reduction procedures for generating individual descriptors and sequential descriptors are further separated in the former branch. An attention separation loss is employed between the two branches, which forces them to focus on different parts of the images to produce more informative sequential descriptors. We retrain various existing seq2seq methods using the same backbone and two types of joint training strategies for a fair comparison. Extensive experimental results demonstrate that our proposed DJIST outperforms its original counterpart JIST by 3.9 % to 18.8 % across four benchmark test cases and achieves state-of-the-art Recall@1 scores against retrained baselines on three key benchmarks with robust cross-dataset generalization, negligible degradation under dimensionality reduction, and superior robustness against varying test-time sequence lengths. Code will be available at <span><span>https://github.com/shuimushan/DJIST</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"658 \",\"pages\":\"Article 131622\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2025-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225022945\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225022945","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

传统的图像到图像（im2im）视觉位置识别（VPR）涉及将单个查询图像与存储的地理标记数据库图像进行匹配。在实时机器人和自主应用中，虽然连续的帧流自然会导致更简单的序列到序列（seq2seq） VPR问题，但挑战仍然存在，因为标记的序列数据比标记的单个图像少得多。最近的一项工作通过使用针对seq2seq和im2im任务优化的统一网络解决了这个问题，但是得到的顺序描述符严重依赖于在im2im任务上训练的单个描述符。本文提出了一种解耦的联合图像和序列训练（DJIST）框架，该框架使用一个冻结的主干和两个独立的序列分支，其中一个分支受im2im和seq2seq损失的监督，另一个分支仅受seq2seq损失的监督。生成单个描述符和序列描述符的特征约简过程在前一个分支中进一步分离。在两个分支之间使用了注意力分离损失，这迫使它们专注于图像的不同部分，以产生更多信息的顺序描述符。我们使用相同的主干和两种类型的联合训练策略重新训练各种现有的seq2seq方法，以进行公平的比较。广泛的实验结果表明，我们提出的DJIST在四个基准测试用例中比原始的JIST高出3.9%至18.8%，并且在三个关键基准上对重新训练的基线取得了最先进的Recall@1分数，具有鲁棒的跨数据集一般化，在降维下可以忽略掉的退化，以及对不同测试时间序列长度的卓越鲁棒性。代码将在https://github.com/shuimushan/DJIST上提供。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DJIST: Decoupled joint image and sequence training framework for sequential visual place recognition

Traditional image-to-image (im2im) visual place recognition (VPR) involves matching a single query image to stored geo-tagged database images. In real-time robotic and autonomous applications, while a continuous stream of frames naturally leads to a simpler sequence-to-sequence (seq2seq) VPR problem, the challenges remain since labeled sequential data is much scarcer than labeled individual images. A recent work addressed this by using a unified network optimized for both seq2seq and im2im tasks, but the resulting sequential descriptors are heavily dependent on the individual descriptors trained on the im2im task. This paper proposes a decoupled joint image and sequence training (DJIST) framework, using a frozen backbone and two independent sequential branches, where one branch is supervised by both im2im and seq2seq losses and the other solely by the seq2seq loss. The feature reduction procedures for generating individual descriptors and sequential descriptors are further separated in the former branch. An attention separation loss is employed between the two branches, which forces them to focus on different parts of the images to produce more informative sequential descriptors. We retrain various existing seq2seq methods using the same backbone and two types of joint training strategies for a fair comparison. Extensive experimental results demonstrate that our proposed DJIST outperforms its original counterpart JIST by 3.9 % to 18.8 % across four benchmark test cases and achieves state-of-the-art Recall@1 scores against retrained baselines on three key benchmarks with robust cross-dataset generalization, negligible degradation under dimensionality reduction, and superior robustness against varying test-time sequence lengths. Code will be available at https://github.com/shuimushan/DJIST.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.