{"title":"连续视觉语言导航的记忆-观察协同系统","authors":"Ting Yu;Yifei Wu;Qiongjie Cui;Qingming Huang;Jun Yu","doi":"10.1109/TMM.2025.3586105","DOIUrl":null,"url":null,"abstract":"Navigating in continuous environments with vision-language cues presents critical challenges, particularly in the accuracy of waypoint prediction and the quality of navigation decision-making. Traditional methods, which predominantly rely on spatial data from depth images or straightforward RGB-depth integrations, frequently encounter difficulties in environments where waypoints share similar spatial characteristics, leading to erroneous navigational outcomes. Additionally, the capacity for effective navigation decisions is often hindered by the inadequacies of traditional topological maps and the issue of uneven data sampling. In response, this paper introduces a robust memory-observation synergistic vision-language navigation framework to substantially enhance the navigation capabilities of agents operating in continuous environments. We present an advanced observation-driven waypoint predictor that effectively utilizes spatial data and integrates aligned visual and textual cues to significantly improve the accuracy of waypoint predictions within complex real-world scenarios. Additionally, we develop a strategic memory-observation planning approach that leverages memory panoramic environmental data and detailed current observation information, enabling more informed and precise navigation decisions. Our framework sets new performance benchmarks on the VLN-CE dataset, achieving a 60.25% Success Rate (SR) and a 50.89% Path Length Score (SPL) on the R2R-CE dataset’s unseen validation splits. Furthermore, when adapted to a discrete environment, our model also shows exceptional performance on the R2R dataset, achieving a 74% SR and a 64% SPL on the unseen validation split.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6690-6704"},"PeriodicalIF":9.7000,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MossVLN: Memory-Observation Synergistic System for Continuous Vision-Language Navigation\",\"authors\":\"Ting Yu;Yifei Wu;Qiongjie Cui;Qingming Huang;Jun Yu\",\"doi\":\"10.1109/TMM.2025.3586105\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Navigating in continuous environments with vision-language cues presents critical challenges, particularly in the accuracy of waypoint prediction and the quality of navigation decision-making. Traditional methods, which predominantly rely on spatial data from depth images or straightforward RGB-depth integrations, frequently encounter difficulties in environments where waypoints share similar spatial characteristics, leading to erroneous navigational outcomes. Additionally, the capacity for effective navigation decisions is often hindered by the inadequacies of traditional topological maps and the issue of uneven data sampling. In response, this paper introduces a robust memory-observation synergistic vision-language navigation framework to substantially enhance the navigation capabilities of agents operating in continuous environments. We present an advanced observation-driven waypoint predictor that effectively utilizes spatial data and integrates aligned visual and textual cues to significantly improve the accuracy of waypoint predictions within complex real-world scenarios. Additionally, we develop a strategic memory-observation planning approach that leverages memory panoramic environmental data and detailed current observation information, enabling more informed and precise navigation decisions. Our framework sets new performance benchmarks on the VLN-CE dataset, achieving a 60.25% Success Rate (SR) and a 50.89% Path Length Score (SPL) on the R2R-CE dataset’s unseen validation splits. Furthermore, when adapted to a discrete environment, our model also shows exceptional performance on the R2R dataset, achieving a 74% SR and a 64% SPL on the unseen validation split.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"6690-6704\"},\"PeriodicalIF\":9.7000,\"publicationDate\":\"2025-07-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11071855/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11071855/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
MossVLN: Memory-Observation Synergistic System for Continuous Vision-Language Navigation
Navigating in continuous environments with vision-language cues presents critical challenges, particularly in the accuracy of waypoint prediction and the quality of navigation decision-making. Traditional methods, which predominantly rely on spatial data from depth images or straightforward RGB-depth integrations, frequently encounter difficulties in environments where waypoints share similar spatial characteristics, leading to erroneous navigational outcomes. Additionally, the capacity for effective navigation decisions is often hindered by the inadequacies of traditional topological maps and the issue of uneven data sampling. In response, this paper introduces a robust memory-observation synergistic vision-language navigation framework to substantially enhance the navigation capabilities of agents operating in continuous environments. We present an advanced observation-driven waypoint predictor that effectively utilizes spatial data and integrates aligned visual and textual cues to significantly improve the accuracy of waypoint predictions within complex real-world scenarios. Additionally, we develop a strategic memory-observation planning approach that leverages memory panoramic environmental data and detailed current observation information, enabling more informed and precise navigation decisions. Our framework sets new performance benchmarks on the VLN-CE dataset, achieving a 60.25% Success Rate (SR) and a 50.89% Path Length Score (SPL) on the R2R-CE dataset’s unseen validation splits. Furthermore, when adapted to a discrete environment, our model also shows exceptional performance on the R2R dataset, achieving a 74% SR and a 64% SPL on the unseen validation split.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.