Jian Luo , Jian Zhang , Bo Cai , Yaoxiang Yu , Aihua Ke
{"title":"Learning hierarchical scene graph and contrastive learning for object goal navigation","authors":"Jian Luo , Jian Zhang , Bo Cai , Yaoxiang Yu , Aihua Ke","doi":"10.1016/j.knosys.2025.113532","DOIUrl":null,"url":null,"abstract":"<div><div>The task of object goal navigation (ObjNav) requires the agent to locate the given target object within a complex dynamic scene. To successfully accomplish the task, the agent needs to well understand the scenes, make executable decisions with less steps, avoid collisions, and successfully navigate to the target. As a result, efficient environmental perception and scene graph-inspired path planning is important to successfully accomplish the ObjNav task. In this paper, we present a hierarchical scene graph (HSG) contrastive learning, which consists of (1) a multimodal graph mixer that aligns the visual and textual information using open-vocabulary detector with GLIP. It can be regarded as an “eagle eye” to perceive target-related frontiers and suppress irrelevant information, and (2) a graph constructer that takes observed RGBD images to incrementally build a hierarchical scene graph. It acts as the “brain” that memorizes the common scene layout, (3) an action control contrastive learning that takes the graph contextual relationships as input to predict optimal actions to the target. It is treated as the “limbs” of the agent, coordinating and correcting incorrect movements. On the task of ObjNav, experiments on Gibson, HM3D, MP3D, and ProcTHOR demonstrate that navigation plans from the HSG framework achieve significantly higher success rates than existing map-based method, indicating the feasibility of executing navigation utilizing commonsense knowledge from language models leading efficient semantic exploration. <em>Code is available at</em> <span><span>https://github.com/luosword/HSG4VN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"319 ","pages":"Article 113532"},"PeriodicalIF":7.2000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125005787","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The task of object goal navigation (ObjNav) requires the agent to locate the given target object within a complex dynamic scene. To successfully accomplish the task, the agent needs to well understand the scenes, make executable decisions with less steps, avoid collisions, and successfully navigate to the target. As a result, efficient environmental perception and scene graph-inspired path planning is important to successfully accomplish the ObjNav task. In this paper, we present a hierarchical scene graph (HSG) contrastive learning, which consists of (1) a multimodal graph mixer that aligns the visual and textual information using open-vocabulary detector with GLIP. It can be regarded as an “eagle eye” to perceive target-related frontiers and suppress irrelevant information, and (2) a graph constructer that takes observed RGBD images to incrementally build a hierarchical scene graph. It acts as the “brain” that memorizes the common scene layout, (3) an action control contrastive learning that takes the graph contextual relationships as input to predict optimal actions to the target. It is treated as the “limbs” of the agent, coordinating and correcting incorrect movements. On the task of ObjNav, experiments on Gibson, HM3D, MP3D, and ProcTHOR demonstrate that navigation plans from the HSG framework achieve significantly higher success rates than existing map-based method, indicating the feasibility of executing navigation utilizing commonsense knowledge from language models leading efficient semantic exploration. Code is available athttps://github.com/luosword/HSG4VN.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.