Enhancing Scene Understanding for Vision-and-Language Navigation by Knowledge Awareness

IF 4.6 2区计算机科学 Q2 ROBOTICS

IEEE Robotics and Automation Letters Pub Date : 2024-10-17 DOI:10.1109/LRA.2024.3483042

Fang Gao;Jingfeng Tang;Jiabao Wang;Shaodong Li;Jun Yu

{"title":"Enhancing Scene Understanding for Vision-and-Language Navigation by Knowledge Awareness","authors":"Fang Gao;Jingfeng Tang;Jiabao Wang;Shaodong Li;Jun Yu","doi":"10.1109/LRA.2024.3483042","DOIUrl":null,"url":null,"abstract":"Vision-and-Language Navigation (VLN) has garnered widespread attention and research interest due to its potential applications in real-world scenarios. Despite significant progress in the VLN field in recent years, limitations persist. Many agents struggle to make accurate decisions when faced with similar candidate views during navigation, relying solely on the overall features of these views. This challenge primarily arises from the lack of common-sense knowledge about room layouts. Recognizing that room knowledge can establish relationships between rooms and objects in the environment, we construct room layout knowledge described in natural language by leveraging BLIP-2, including relationships between rooms and individual objects, relationships between objects, attributes of individual objects (such as color), and room types, thus providing comprehensive room layout information to the agent. We propose a Knowledge-Enhanced Scene Understanding (KESU) model to augment the agent's understanding of the environment by leveraging room layout knowledge. The Instruction Augmentation Module (IA) and the Knowledge History Fusion Module (KHF) in KESU respectively provide room layout knowledge for instructions and vision-history features, thereby enhancing the agent's navigation abilities. To more effectively integrate knowledge information with instruction features, we introduce Dynamic Residual Fusion (DRF) in the IA module. Finally, we conduct extensive experiments on the R2R, REVERIE, and SOON datasets, demonstrating the effectiveness of the proposed approach.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"9 12","pages":"10874-10881"},"PeriodicalIF":4.6000,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Robotics and Automation Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10720886/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Vision-and-Language Navigation (VLN) has garnered widespread attention and research interest due to its potential applications in real-world scenarios. Despite significant progress in the VLN field in recent years, limitations persist. Many agents struggle to make accurate decisions when faced with similar candidate views during navigation, relying solely on the overall features of these views. This challenge primarily arises from the lack of common-sense knowledge about room layouts. Recognizing that room knowledge can establish relationships between rooms and objects in the environment, we construct room layout knowledge described in natural language by leveraging BLIP-2, including relationships between rooms and individual objects, relationships between objects, attributes of individual objects (such as color), and room types, thus providing comprehensive room layout information to the agent. We propose a Knowledge-Enhanced Scene Understanding (KESU) model to augment the agent's understanding of the environment by leveraging room layout knowledge. The Instruction Augmentation Module (IA) and the Knowledge History Fusion Module (KHF) in KESU respectively provide room layout knowledge for instructions and vision-history features, thereby enhancing the agent's navigation abilities. To more effectively integrate knowledge information with instruction features, we introduce Dynamic Residual Fusion (DRF) in the IA module. Finally, we conduct extensive experiments on the R2R, REVERIE, and SOON datasets, demonstrating the effectiveness of the proposed approach.

查看原文本刊更多论文

通过知识感知增强视觉语言导航的场景理解能力

视觉语言导航（VLN）因其在现实世界中的潜在应用而受到广泛关注和研究兴趣。尽管近年来视觉语言导航领域取得了重大进展，但其局限性依然存在。在导航过程中，面对类似的候选视图，许多代理都很难做出准确的决策，只能依赖于这些视图的整体特征。这一挑战主要源于缺乏有关房间布局的常识性知识。认识到房间知识可以建立房间与环境中物体之间的关系，我们利用 BLIP-2 构建了用自然语言描述的房间布局知识，包括房间与单个物体之间的关系、物体之间的关系、单个物体的属性（如颜色）以及房间类型，从而为代理提供全面的房间布局信息。我们提出了一个知识增强场景理解（KESU）模型，通过利用房间布局知识来增强代理对环境的理解。KESU 中的指令增强模块（IA）和知识历史融合模块（KHF）分别为指令和视觉历史特征提供房间布局知识，从而增强代理的导航能力。为了更有效地整合知识信息与指令特征，我们在 IA 模块中引入了动态残留融合（DRF）。最后，我们在 R2R、REVERIE 和 SOON 数据集上进行了大量实验，证明了所提方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Robotics and Automation Letters Computer Science-Computer Science Applications

CiteScore

9.60

自引率

15.40%

发文量

1428

期刊介绍： The scope of this journal is to publish peer-reviewed articles that provide a timely and concise account of innovative research ideas and application results, reporting significant theoretical findings and application case studies in areas of robotics and automation.