Shanshan Wan , Lai Kang , Yingmei Wei , Tianrui Shen , Haixuan Wang , Chao Zuo
{"title":"序列视觉位置识别的解耦联合图像和序列训练框架","authors":"Shanshan Wan , Lai Kang , Yingmei Wei , Tianrui Shen , Haixuan Wang , Chao Zuo","doi":"10.1016/j.neucom.2025.131622","DOIUrl":null,"url":null,"abstract":"<div><div>Traditional image-to-image (im2im) visual place recognition (VPR) involves matching a single query image to stored geo-tagged database images. In real-time robotic and autonomous applications, while a continuous stream of frames naturally leads to a simpler sequence-to-sequence (seq2seq) VPR problem, the challenges remain since labeled sequential data is much scarcer than labeled individual images. A recent work addressed this by using a unified network optimized for both seq2seq and im2im tasks, but the resulting sequential descriptors are heavily dependent on the individual descriptors trained on the im2im task. This paper proposes a decoupled joint image and sequence training (DJIST) framework, using a frozen backbone and two independent sequential branches, where one branch is supervised by both im2im and seq2seq losses and the other solely by the seq2seq loss. The feature reduction procedures for generating individual descriptors and sequential descriptors are further separated in the former branch. An attention separation loss is employed between the two branches, which forces them to focus on different parts of the images to produce more informative sequential descriptors. We retrain various existing seq2seq methods using the same backbone and two types of joint training strategies for a fair comparison. Extensive experimental results demonstrate that our proposed DJIST outperforms its original counterpart JIST by 3.9 % to 18.8 % across four benchmark test cases and achieves state-of-the-art Recall@1 scores against retrained baselines on three key benchmarks with robust cross-dataset generalization, negligible degradation under dimensionality reduction, and superior robustness against varying test-time sequence lengths. Code will be available at <span><span>https://github.com/shuimushan/DJIST</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"658 ","pages":"Article 131622"},"PeriodicalIF":6.5000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DJIST: Decoupled joint image and sequence training framework for sequential visual place recognition\",\"authors\":\"Shanshan Wan , Lai Kang , Yingmei Wei , Tianrui Shen , Haixuan Wang , Chao Zuo\",\"doi\":\"10.1016/j.neucom.2025.131622\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Traditional image-to-image (im2im) visual place recognition (VPR) involves matching a single query image to stored geo-tagged database images. In real-time robotic and autonomous applications, while a continuous stream of frames naturally leads to a simpler sequence-to-sequence (seq2seq) VPR problem, the challenges remain since labeled sequential data is much scarcer than labeled individual images. A recent work addressed this by using a unified network optimized for both seq2seq and im2im tasks, but the resulting sequential descriptors are heavily dependent on the individual descriptors trained on the im2im task. This paper proposes a decoupled joint image and sequence training (DJIST) framework, using a frozen backbone and two independent sequential branches, where one branch is supervised by both im2im and seq2seq losses and the other solely by the seq2seq loss. The feature reduction procedures for generating individual descriptors and sequential descriptors are further separated in the former branch. An attention separation loss is employed between the two branches, which forces them to focus on different parts of the images to produce more informative sequential descriptors. We retrain various existing seq2seq methods using the same backbone and two types of joint training strategies for a fair comparison. Extensive experimental results demonstrate that our proposed DJIST outperforms its original counterpart JIST by 3.9 % to 18.8 % across four benchmark test cases and achieves state-of-the-art Recall@1 scores against retrained baselines on three key benchmarks with robust cross-dataset generalization, negligible degradation under dimensionality reduction, and superior robustness against varying test-time sequence lengths. Code will be available at <span><span>https://github.com/shuimushan/DJIST</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"658 \",\"pages\":\"Article 131622\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2025-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225022945\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225022945","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
DJIST: Decoupled joint image and sequence training framework for sequential visual place recognition
Traditional image-to-image (im2im) visual place recognition (VPR) involves matching a single query image to stored geo-tagged database images. In real-time robotic and autonomous applications, while a continuous stream of frames naturally leads to a simpler sequence-to-sequence (seq2seq) VPR problem, the challenges remain since labeled sequential data is much scarcer than labeled individual images. A recent work addressed this by using a unified network optimized for both seq2seq and im2im tasks, but the resulting sequential descriptors are heavily dependent on the individual descriptors trained on the im2im task. This paper proposes a decoupled joint image and sequence training (DJIST) framework, using a frozen backbone and two independent sequential branches, where one branch is supervised by both im2im and seq2seq losses and the other solely by the seq2seq loss. The feature reduction procedures for generating individual descriptors and sequential descriptors are further separated in the former branch. An attention separation loss is employed between the two branches, which forces them to focus on different parts of the images to produce more informative sequential descriptors. We retrain various existing seq2seq methods using the same backbone and two types of joint training strategies for a fair comparison. Extensive experimental results demonstrate that our proposed DJIST outperforms its original counterpart JIST by 3.9 % to 18.8 % across four benchmark test cases and achieves state-of-the-art Recall@1 scores against retrained baselines on three key benchmarks with robust cross-dataset generalization, negligible degradation under dimensionality reduction, and superior robustness against varying test-time sequence lengths. Code will be available at https://github.com/shuimushan/DJIST.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.