{"title":"VLAI: Exploration and Exploitation based on Visual-Language Aligned Information for Robotic Object Goal Navigation","authors":"Haonan Luo, Yijie Zeng, Li Yang, Kexun Chen, Zhixuan Shen, Fengmao Lv","doi":"10.1016/j.imavis.2024.105259","DOIUrl":null,"url":null,"abstract":"<div><p>Object Goal Navigation(ObjectNav) is the task that an agent need navigate to an instance of a specific category in an unseen environment through visual observations within limited time steps. This work plays a significant role in enhancing the efficiency of locating specific items in indoor spaces and assisting individuals in completing various tasks, as well as providing support for people with disabilities. To achieve efficient ObjectNav in unfamiliar environments, global perception capabilities, understanding the regularities of space and semantics in the environment layout are significant. In this work, we propose an explicit-prediction method called VLAI that utilizes visual-language alignment information to guide the agent's exploration, unlike previous navigation methods based on frontier potential prediction or egocentric map completion, which only leverage visual observations to construct semantic maps, thus failing to help the agent develop a better global perception. Specifically, when predicting long-term goals, we retrieve previously saved visual observations to obtain visual information around the frontiers based on their position on the incrementally built incomplete semantic map. Then, we apply our designed Chat Describer to this visual information to obtain detailed frontier object descriptions. The Chat Describer, a novel automatic-questioning approach deployed in Visual-to-Language, is composed of Large Language Model(LLM) and the visual-to-language model(VLM), which has visual question-answering functionality. In addition, we also obtain the semantic similarity of target object and frontier object categories. Ultimately, by combining the semantic similarity and the boundary descriptions, the agent can predict the long-term goals more accurately. Our experiments on the Gibson and HM3D datasets reveal that our VLAI approach yields significantly better results compared to earlier methods. The code is released at</p><p><span><span><span>https://github.com/31539lab/VLAI</span></span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105259"},"PeriodicalIF":4.2000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624003640","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Object Goal Navigation(ObjectNav) is the task that an agent need navigate to an instance of a specific category in an unseen environment through visual observations within limited time steps. This work plays a significant role in enhancing the efficiency of locating specific items in indoor spaces and assisting individuals in completing various tasks, as well as providing support for people with disabilities. To achieve efficient ObjectNav in unfamiliar environments, global perception capabilities, understanding the regularities of space and semantics in the environment layout are significant. In this work, we propose an explicit-prediction method called VLAI that utilizes visual-language alignment information to guide the agent's exploration, unlike previous navigation methods based on frontier potential prediction or egocentric map completion, which only leverage visual observations to construct semantic maps, thus failing to help the agent develop a better global perception. Specifically, when predicting long-term goals, we retrieve previously saved visual observations to obtain visual information around the frontiers based on their position on the incrementally built incomplete semantic map. Then, we apply our designed Chat Describer to this visual information to obtain detailed frontier object descriptions. The Chat Describer, a novel automatic-questioning approach deployed in Visual-to-Language, is composed of Large Language Model(LLM) and the visual-to-language model(VLM), which has visual question-answering functionality. In addition, we also obtain the semantic similarity of target object and frontier object categories. Ultimately, by combining the semantic similarity and the boundary descriptions, the agent can predict the long-term goals more accurately. Our experiments on the Gibson and HM3D datasets reveal that our VLAI approach yields significantly better results compared to earlier methods. The code is released at
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.