VLAI：基于视觉语言对齐信息的探索和利用，用于机器人目标导航

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2024-09-12 DOI:10.1016/j.imavis.2024.105259

Haonan Luo, Yijie Zeng, Li Yang, Kexun Chen, Zhixuan Shen, Fengmao Lv

{"title":"VLAI：基于视觉语言对齐信息的探索和利用，用于机器人目标导航","authors":"Haonan Luo, Yijie Zeng, Li Yang, Kexun Chen, Zhixuan Shen, Fengmao Lv","doi":"10.1016/j.imavis.2024.105259","DOIUrl":null,"url":null,"abstract":"<div><p>Object Goal Navigation(ObjectNav) is the task that an agent need navigate to an instance of a specific category in an unseen environment through visual observations within limited time steps. This work plays a significant role in enhancing the efficiency of locating specific items in indoor spaces and assisting individuals in completing various tasks, as well as providing support for people with disabilities. To achieve efficient ObjectNav in unfamiliar environments, global perception capabilities, understanding the regularities of space and semantics in the environment layout are significant. In this work, we propose an explicit-prediction method called VLAI that utilizes visual-language alignment information to guide the agent's exploration, unlike previous navigation methods based on frontier potential prediction or egocentric map completion, which only leverage visual observations to construct semantic maps, thus failing to help the agent develop a better global perception. Specifically, when predicting long-term goals, we retrieve previously saved visual observations to obtain visual information around the frontiers based on their position on the incrementally built incomplete semantic map. Then, we apply our designed Chat Describer to this visual information to obtain detailed frontier object descriptions. The Chat Describer, a novel automatic-questioning approach deployed in Visual-to-Language, is composed of Large Language Model(LLM) and the visual-to-language model(VLM), which has visual question-answering functionality. In addition, we also obtain the semantic similarity of target object and frontier object categories. Ultimately, by combining the semantic similarity and the boundary descriptions, the agent can predict the long-term goals more accurately. Our experiments on the Gibson and HM3D datasets reveal that our VLAI approach yields significantly better results compared to earlier methods. The code is released at</p><p><span><span><span>https://github.com/31539lab/VLAI</span></span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105259"},"PeriodicalIF":4.2000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VLAI: Exploration and Exploitation based on Visual-Language Aligned Information for Robotic Object Goal Navigation\",\"authors\":\"Haonan Luo, Yijie Zeng, Li Yang, Kexun Chen, Zhixuan Shen, Fengmao Lv\",\"doi\":\"10.1016/j.imavis.2024.105259\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Object Goal Navigation(ObjectNav) is the task that an agent need navigate to an instance of a specific category in an unseen environment through visual observations within limited time steps. This work plays a significant role in enhancing the efficiency of locating specific items in indoor spaces and assisting individuals in completing various tasks, as well as providing support for people with disabilities. To achieve efficient ObjectNav in unfamiliar environments, global perception capabilities, understanding the regularities of space and semantics in the environment layout are significant. In this work, we propose an explicit-prediction method called VLAI that utilizes visual-language alignment information to guide the agent's exploration, unlike previous navigation methods based on frontier potential prediction or egocentric map completion, which only leverage visual observations to construct semantic maps, thus failing to help the agent develop a better global perception. Specifically, when predicting long-term goals, we retrieve previously saved visual observations to obtain visual information around the frontiers based on their position on the incrementally built incomplete semantic map. Then, we apply our designed Chat Describer to this visual information to obtain detailed frontier object descriptions. The Chat Describer, a novel automatic-questioning approach deployed in Visual-to-Language, is composed of Large Language Model(LLM) and the visual-to-language model(VLM), which has visual question-answering functionality. In addition, we also obtain the semantic similarity of target object and frontier object categories. Ultimately, by combining the semantic similarity and the boundary descriptions, the agent can predict the long-term goals more accurately. Our experiments on the Gibson and HM3D datasets reveal that our VLAI approach yields significantly better results compared to earlier methods. The code is released at</p><p><span><span><span>https://github.com/31539lab/VLAI</span></span><svg><path></path></svg></span>.</p></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"151 \",\"pages\":\"Article 105259\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885624003640\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624003640","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

物体目标导航（Object Goal Navigation，ObjectNav）是指一个代理需要在有限的时间步骤内，通过视觉观察导航到未知环境中特定类别的实例。这项工作在提高室内空间特定物品的定位效率、协助个人完成各种任务以及为残障人士提供支持方面发挥着重要作用。要在陌生环境中实现高效的目标导航，全局感知能力、对空间规律性的理解以及环境布局中的语义都非常重要。在这项工作中，我们提出了一种名为 VLAI 的显式预测方法，该方法利用视觉语言对齐信息来引导代理进行探索，这与以往基于前沿势能预测或自我中心地图补全的导航方法不同，后者仅利用视觉观察来构建语义地图，从而无法帮助代理发展更好的全局感知能力。具体来说，在预测长期目标时，我们会检索之前保存的视觉观察结果，根据它们在增量构建的不完整语义地图上的位置，获取边界周围的视觉信息。然后，我们将设计的聊天描述器应用到这些视觉信息中，以获得详细的前沿对象描述。聊天描述器是在视觉转语言中部署的一种新型自动提问方法，由大语言模型（LLM）和视觉转语言模型（VLM）组成，后者具有视觉问题解答功能。此外，我们还获得了目标对象和前沿对象类别的语义相似性。最终，通过结合语义相似性和边界描述，代理可以更准确地预测长期目标。我们在 Gibson 和 HM3D 数据集上的实验表明，与早期的方法相比，我们的 VLAI 方法能产生明显更好的结果。代码发布于：https://github.com/31539lab/VLAI。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

VLAI: Exploration and Exploitation based on Visual-Language Aligned Information for Robotic Object Goal Navigation

Object Goal Navigation(ObjectNav) is the task that an agent need navigate to an instance of a specific category in an unseen environment through visual observations within limited time steps. This work plays a significant role in enhancing the efficiency of locating specific items in indoor spaces and assisting individuals in completing various tasks, as well as providing support for people with disabilities. To achieve efficient ObjectNav in unfamiliar environments, global perception capabilities, understanding the regularities of space and semantics in the environment layout are significant. In this work, we propose an explicit-prediction method called VLAI that utilizes visual-language alignment information to guide the agent's exploration, unlike previous navigation methods based on frontier potential prediction or egocentric map completion, which only leverage visual observations to construct semantic maps, thus failing to help the agent develop a better global perception. Specifically, when predicting long-term goals, we retrieve previously saved visual observations to obtain visual information around the frontiers based on their position on the incrementally built incomplete semantic map. Then, we apply our designed Chat Describer to this visual information to obtain detailed frontier object descriptions. The Chat Describer, a novel automatic-questioning approach deployed in Visual-to-Language, is composed of Large Language Model(LLM) and the visual-to-language model(VLM), which has visual question-answering functionality. In addition, we also obtain the semantic similarity of target object and frontier object categories. Ultimately, by combining the semantic similarity and the boundary descriptions, the agent can predict the long-term goals more accurately. Our experiments on the Gibson and HM3D datasets reveal that our VLAI approach yields significantly better results compared to earlier methods. The code is released at

https://github.com/31539lab/VLAI.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.