Learning hierarchical scene graph and contrastive learning for object goal navigation

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2025-04-30 DOI:10.1016/j.knosys.2025.113532

Jian Luo , Jian Zhang , Bo Cai , Yaoxiang Yu , Aihua Ke

{"title":"Learning hierarchical scene graph and contrastive learning for object goal navigation","authors":"Jian Luo , Jian Zhang , Bo Cai , Yaoxiang Yu , Aihua Ke","doi":"10.1016/j.knosys.2025.113532","DOIUrl":null,"url":null,"abstract":"<div><div>The task of object goal navigation (ObjNav) requires the agent to locate the given target object within a complex dynamic scene. To successfully accomplish the task, the agent needs to well understand the scenes, make executable decisions with less steps, avoid collisions, and successfully navigate to the target. As a result, efficient environmental perception and scene graph-inspired path planning is important to successfully accomplish the ObjNav task. In this paper, we present a hierarchical scene graph (HSG) contrastive learning, which consists of (1) a multimodal graph mixer that aligns the visual and textual information using open-vocabulary detector with GLIP. It can be regarded as an “eagle eye” to perceive target-related frontiers and suppress irrelevant information, and (2) a graph constructer that takes observed RGBD images to incrementally build a hierarchical scene graph. It acts as the “brain” that memorizes the common scene layout, (3) an action control contrastive learning that takes the graph contextual relationships as input to predict optimal actions to the target. It is treated as the “limbs” of the agent, coordinating and correcting incorrect movements. On the task of ObjNav, experiments on Gibson, HM3D, MP3D, and ProcTHOR demonstrate that navigation plans from the HSG framework achieve significantly higher success rates than existing map-based method, indicating the feasibility of executing navigation utilizing commonsense knowledge from language models leading efficient semantic exploration. <em>Code is available at</em> <span><span>https://github.com/luosword/HSG4VN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"319 ","pages":"Article 113532"},"PeriodicalIF":7.2000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125005787","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The task of object goal navigation (ObjNav) requires the agent to locate the given target object within a complex dynamic scene. To successfully accomplish the task, the agent needs to well understand the scenes, make executable decisions with less steps, avoid collisions, and successfully navigate to the target. As a result, efficient environmental perception and scene graph-inspired path planning is important to successfully accomplish the ObjNav task. In this paper, we present a hierarchical scene graph (HSG) contrastive learning, which consists of (1) a multimodal graph mixer that aligns the visual and textual information using open-vocabulary detector with GLIP. It can be regarded as an “eagle eye” to perceive target-related frontiers and suppress irrelevant information, and (2) a graph constructer that takes observed RGBD images to incrementally build a hierarchical scene graph. It acts as the “brain” that memorizes the common scene layout, (3) an action control contrastive learning that takes the graph contextual relationships as input to predict optimal actions to the target. It is treated as the “limbs” of the agent, coordinating and correcting incorrect movements. On the task of ObjNav, experiments on Gibson, HM3D, MP3D, and ProcTHOR demonstrate that navigation plans from the HSG framework achieve significantly higher success rates than existing map-based method, indicating the feasibility of executing navigation utilizing commonsense knowledge from language models leading efficient semantic exploration. Code is available at https://github.com/luosword/HSG4VN.

查看原文本刊更多论文

目标导航的分层场景图学习与对比学习

对象目标导航（ObjNav）任务要求智能体在复杂的动态场景中定位给定的目标对象。为了成功地完成任务，代理需要很好地理解场景，用更少的步骤做出可执行的决策，避免碰撞，并成功地导航到目标。因此，高效的环境感知和场景图启发路径规划对于成功完成ObjNav任务至关重要。在本文中，我们提出了一种分层场景图（HSG）对比学习，它由(1)一个多模态图混合器组成，该混合器使用开放词汇检测器和GLIP来对齐视觉和文本信息。它可以看作是一只“鹰眼”，用来感知目标相关的边界并抑制无关信息；(2)一个图构造器，利用观察到的RGBD图像，逐步构建层次化场景图。它就像“大脑”一样记住常见的场景布局，(3)动作控制对比学习，将图形上下文关系作为输入来预测目标的最佳动作。它被视为代理的“四肢”，协调和纠正不正确的动作。在ObjNav任务上，在Gibson、HM3D、MP3D和ProcTHOR上的实验表明，基于HSG框架的导航计划的成功率明显高于现有的基于地图的方法，表明利用语言模型的常识性知识进行导航的可行性，从而实现高效的语义探索。代码可从https://github.com/luosword/HSG4VN获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.