连续视觉语言导航的记忆-观察协同系统

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-07-04 DOI:10.1109/TMM.2025.3586105

Ting Yu;Yifei Wu;Qiongjie Cui;Qingming Huang;Jun Yu

{"title":"连续视觉语言导航的记忆-观察协同系统","authors":"Ting Yu;Yifei Wu;Qiongjie Cui;Qingming Huang;Jun Yu","doi":"10.1109/TMM.2025.3586105","DOIUrl":null,"url":null,"abstract":"Navigating in continuous environments with vision-language cues presents critical challenges, particularly in the accuracy of waypoint prediction and the quality of navigation decision-making. Traditional methods, which predominantly rely on spatial data from depth images or straightforward RGB-depth integrations, frequently encounter difficulties in environments where waypoints share similar spatial characteristics, leading to erroneous navigational outcomes. Additionally, the capacity for effective navigation decisions is often hindered by the inadequacies of traditional topological maps and the issue of uneven data sampling. In response, this paper introduces a robust memory-observation synergistic vision-language navigation framework to substantially enhance the navigation capabilities of agents operating in continuous environments. We present an advanced observation-driven waypoint predictor that effectively utilizes spatial data and integrates aligned visual and textual cues to significantly improve the accuracy of waypoint predictions within complex real-world scenarios. Additionally, we develop a strategic memory-observation planning approach that leverages memory panoramic environmental data and detailed current observation information, enabling more informed and precise navigation decisions. Our framework sets new performance benchmarks on the VLN-CE dataset, achieving a 60.25% Success Rate (SR) and a 50.89% Path Length Score (SPL) on the R2R-CE dataset’s unseen validation splits. Furthermore, when adapted to a discrete environment, our model also shows exceptional performance on the R2R dataset, achieving a 74% SR and a 64% SPL on the unseen validation split.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6690-6704"},"PeriodicalIF":9.7000,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MossVLN: Memory-Observation Synergistic System for Continuous Vision-Language Navigation\",\"authors\":\"Ting Yu;Yifei Wu;Qiongjie Cui;Qingming Huang;Jun Yu\",\"doi\":\"10.1109/TMM.2025.3586105\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Navigating in continuous environments with vision-language cues presents critical challenges, particularly in the accuracy of waypoint prediction and the quality of navigation decision-making. Traditional methods, which predominantly rely on spatial data from depth images or straightforward RGB-depth integrations, frequently encounter difficulties in environments where waypoints share similar spatial characteristics, leading to erroneous navigational outcomes. Additionally, the capacity for effective navigation decisions is often hindered by the inadequacies of traditional topological maps and the issue of uneven data sampling. In response, this paper introduces a robust memory-observation synergistic vision-language navigation framework to substantially enhance the navigation capabilities of agents operating in continuous environments. We present an advanced observation-driven waypoint predictor that effectively utilizes spatial data and integrates aligned visual and textual cues to significantly improve the accuracy of waypoint predictions within complex real-world scenarios. Additionally, we develop a strategic memory-observation planning approach that leverages memory panoramic environmental data and detailed current observation information, enabling more informed and precise navigation decisions. Our framework sets new performance benchmarks on the VLN-CE dataset, achieving a 60.25% Success Rate (SR) and a 50.89% Path Length Score (SPL) on the R2R-CE dataset’s unseen validation splits. Furthermore, when adapted to a discrete environment, our model also shows exceptional performance on the R2R dataset, achieving a 74% SR and a 64% SPL on the unseen validation split.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"6690-6704\"},\"PeriodicalIF\":9.7000,\"publicationDate\":\"2025-07-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11071855/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11071855/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

在具有视觉语言线索的连续环境中导航提出了严峻的挑战，特别是在航点预测的准确性和导航决策的质量方面。传统方法主要依赖于深度图像的空间数据或直接的rgb -深度集成，在航路点具有相似空间特征的环境中经常遇到困难，导致错误的导航结果。此外，有效的导航决策能力往往受到传统拓扑地图的不足和数据采样不均匀问题的阻碍。为此，本文引入了一种鲁棒的记忆-观察协同视觉语言导航框架，以大幅提高智能体在连续环境中操作的导航能力。我们提出了一种先进的观测驱动的路点预测器，它有效地利用空间数据，并集成对齐的视觉和文本线索，以显着提高在复杂的现实世界场景中路点预测的准确性。此外，我们还开发了一种战略性内存观察规划方法，利用内存全景环境数据和详细的当前观察信息，实现更明智和精确的导航决策。我们的框架在VLN-CE数据集上设置了新的性能基准，在R2R-CE数据集的未见过的验证分割上实现了60.25%的成功率（SR）和50.89%的路径长度分数（SPL）。此外，当适应于离散环境时，我们的模型在R2R数据集上也显示出出色的性能，在未见过的验证分割上实现了74%的SR和64%的SPL。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MossVLN: Memory-Observation Synergistic System for Continuous Vision-Language Navigation

Navigating in continuous environments with vision-language cues presents critical challenges, particularly in the accuracy of waypoint prediction and the quality of navigation decision-making. Traditional methods, which predominantly rely on spatial data from depth images or straightforward RGB-depth integrations, frequently encounter difficulties in environments where waypoints share similar spatial characteristics, leading to erroneous navigational outcomes. Additionally, the capacity for effective navigation decisions is often hindered by the inadequacies of traditional topological maps and the issue of uneven data sampling. In response, this paper introduces a robust memory-observation synergistic vision-language navigation framework to substantially enhance the navigation capabilities of agents operating in continuous environments. We present an advanced observation-driven waypoint predictor that effectively utilizes spatial data and integrates aligned visual and textual cues to significantly improve the accuracy of waypoint predictions within complex real-world scenarios. Additionally, we develop a strategic memory-observation planning approach that leverages memory panoramic environmental data and detailed current observation information, enabling more informed and precise navigation decisions. Our framework sets new performance benchmarks on the VLN-CE dataset, achieving a 60.25% Success Rate (SR) and a 50.89% Path Length Score (SPL) on the R2R-CE dataset’s unseen validation splits. Furthermore, when adapted to a discrete environment, our model also shows exceptional performance on the R2R dataset, achieving a 74% SR and a 64% SPL on the unseen validation split.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.