ESceme：利用外显场景记忆进行视觉语言导航

IF 11.6 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2024-07-26 DOI:10.1007/s11263-024-02159-8

Qi Zheng, Daqing Liu, Chaoyue Wang, Jing Zhang, Dadong Wang, Dacheng Tao

{"title":"ESceme：利用外显场景记忆进行视觉语言导航","authors":"Qi Zheng, Daqing Liu, Chaoyue Wang, Jing Zhang, Dadong Wang, Dacheng Tao","doi":"10.1007/s11263-024-02159-8","DOIUrl":null,"url":null,"abstract":"<p>Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes. Existing approaches have made enormous progress in navigation in new environments, such as beam search, pre-exploration, and dynamic or hierarchical history encoding. To balance generalization and efficiency, we resort to memorizing visited scenarios apart from the ongoing route while navigating. In this work, we introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent’s memories of past visits when it enters the current scene. The episodic scene memory allows the agent to envision a bigger picture of the next prediction. This way, the agent learns to utilize dynamically updated information instead of merely adapting to the current observations. We provide a simple yet effective implementation of ESceme by enhancing the accessible views at each location and progressively completing the memory while navigating. We verify the superiority of ESceme on short-horizon (R2R), long-horizon (R4R), and vision-and-dialog (CVDN) VLN tasks. Our ESceme also wins first place on the CVDN leaderboard. Code is available: https://github.com/qizhust/esceme.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"56 1","pages":""},"PeriodicalIF":11.6000,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ESceme: Vision-and-Language Navigation with Episodic Scene Memory\",\"authors\":\"Qi Zheng, Daqing Liu, Chaoyue Wang, Jing Zhang, Dadong Wang, Dacheng Tao\",\"doi\":\"10.1007/s11263-024-02159-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes. Existing approaches have made enormous progress in navigation in new environments, such as beam search, pre-exploration, and dynamic or hierarchical history encoding. To balance generalization and efficiency, we resort to memorizing visited scenarios apart from the ongoing route while navigating. In this work, we introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent’s memories of past visits when it enters the current scene. The episodic scene memory allows the agent to envision a bigger picture of the next prediction. This way, the agent learns to utilize dynamically updated information instead of merely adapting to the current observations. We provide a simple yet effective implementation of ESceme by enhancing the accessible views at each location and progressively completing the memory while navigating. We verify the superiority of ESceme on short-horizon (R2R), long-horizon (R4R), and vision-and-dialog (CVDN) VLN tasks. Our ESceme also wins first place on the CVDN leaderboard. Code is available: https://github.com/qizhust/esceme.</p>\",\"PeriodicalId\":13752,\"journal\":{\"name\":\"International Journal of Computer Vision\",\"volume\":\"56 1\",\"pages\":\"\"},\"PeriodicalIF\":11.6000,\"publicationDate\":\"2024-07-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Computer Vision\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11263-024-02159-8\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-024-02159-8","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

视觉语言导航（VLN）模拟的是在真实世界场景中遵循自然语言导航指令的视觉代理。现有方法在新环境导航方面取得了巨大进步，如光束搜索、预探索和动态或分层历史编码。为了在通用性和效率之间取得平衡，我们在导航过程中除了记忆当前路线外，还需要记忆访问过的场景。在这项工作中，我们为 VLN 引入了一种外显场景记忆机制（ESceme），它能在代理进入当前场景时唤醒其对过去访问的记忆。外显场景记忆允许代理设想下一个预测的大画面。这样，代理就能学会利用动态更新的信息，而不仅仅是适应当前的观察结果。我们提供了一种简单而有效的 ESceme 实现方法，即增强每个位置的可访问视图，并在导航时逐步完成记忆。我们验证了 ESceme 在短视距 (R2R)、长视距 (R4R) 和视觉与对话 (CVDN) VLN 任务中的优越性。我们的 ESceme 还赢得了 CVDN 排行榜第一名。代码见：https://github.com/qizhust/esceme。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

ESceme: Vision-and-Language Navigation with Episodic Scene Memory

查看原文本刊更多论文

ESceme: Vision-and-Language Navigation with Episodic Scene Memory

Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes. Existing approaches have made enormous progress in navigation in new environments, such as beam search, pre-exploration, and dynamic or hierarchical history encoding. To balance generalization and efficiency, we resort to memorizing visited scenarios apart from the ongoing route while navigating. In this work, we introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent’s memories of past visits when it enters the current scene. The episodic scene memory allows the agent to envision a bigger picture of the next prediction. This way, the agent learns to utilize dynamically updated information instead of merely adapting to the current observations. We provide a simple yet effective implementation of ESceme by enhancing the accessible views at each location and progressively completing the memory while navigating. We verify the superiority of ESceme on short-horizon (R2R), long-horizon (R4R), and vision-and-dialog (CVDN) VLN tasks. Our ESceme also wins first place on the CVDN leaderboard. Code is available: https://github.com/qizhust/esceme.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.