Memory-MambaNav：通过整合时空扫描和状态空间模型来增强目标-目标导航

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-03-27 DOI:10.1016/j.imavis.2025.105522

Leyuan Sun , Yusuke Yoshiyasu

{"title":"Memory-MambaNav：通过整合时空扫描和状态空间模型来增强目标-目标导航","authors":"Leyuan Sun , Yusuke Yoshiyasu","doi":"10.1016/j.imavis.2025.105522","DOIUrl":null,"url":null,"abstract":"<div><div>Object-goal Navigation (ObjectNav) involves locating a specified target object using a textual command combined with semantic understanding in an unknown environment. This requires the embodied agent to have advanced spatial and temporal comprehension about environment during navigation. While earlier approaches focus on spatial modeling, they either do not utilize episodic temporal memory (e.g., keeping track of explored and unexplored spaces) or are computationally prohibitive, as long-horizon memory knowledge is resource-intensive in both storage and training. To address this issue, this paper introduces the Memory-MambaNav model, which employs multiple Mamba-based layers for refined spatial–temporal modeling. Leveraging the Mamba architecture, known for its global receptive field and linear complexity, Memory-MambaNav can efficiently extract and process memory knowledge from accumulated historical observations. To enhance spatial modeling, we introduce the Memory Spatial Difference State Space Model (MSD-SSM) to address the limitations of previous CNN and Transformer-based models in terms of receptive field and computational demand. For temporal modeling, the proposed Memory Temporal Serialization SSM (MTS-SSM) leverages Mamba’s selective scanning capabilities in a cross-temporal manner, enhancing the model’s temporal understanding and interaction with bi-temporal features. We also integrate memory-aggregated egocentric obstacle-awareness embeddings (MEOE) and memory-based fine-grained rewards into our end-to-end policy training, which improve obstacle understanding and accelerate convergence by fully utilizing memory knowledge. Our experiments on the AI2-Thor dataset confirm the benefits and superior performance of proposed Memory-MambaNav, demonstrating Mamba’s potential in ObjectNav, particularly in long-horizon trajectories. All demonstration videos referenced in this paper can be viewed on the webpage (<span><span>https://sunleyuan.github.io/Memory-MambaNav</span><svg><path></path></svg></span>).</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105522"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Memory-MambaNav: Enhancing object-goal navigation through integration of spatial–temporal scanning with state space models\",\"authors\":\"Leyuan Sun , Yusuke Yoshiyasu\",\"doi\":\"10.1016/j.imavis.2025.105522\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Object-goal Navigation (ObjectNav) involves locating a specified target object using a textual command combined with semantic understanding in an unknown environment. This requires the embodied agent to have advanced spatial and temporal comprehension about environment during navigation. While earlier approaches focus on spatial modeling, they either do not utilize episodic temporal memory (e.g., keeping track of explored and unexplored spaces) or are computationally prohibitive, as long-horizon memory knowledge is resource-intensive in both storage and training. To address this issue, this paper introduces the Memory-MambaNav model, which employs multiple Mamba-based layers for refined spatial–temporal modeling. Leveraging the Mamba architecture, known for its global receptive field and linear complexity, Memory-MambaNav can efficiently extract and process memory knowledge from accumulated historical observations. To enhance spatial modeling, we introduce the Memory Spatial Difference State Space Model (MSD-SSM) to address the limitations of previous CNN and Transformer-based models in terms of receptive field and computational demand. For temporal modeling, the proposed Memory Temporal Serialization SSM (MTS-SSM) leverages Mamba’s selective scanning capabilities in a cross-temporal manner, enhancing the model’s temporal understanding and interaction with bi-temporal features. We also integrate memory-aggregated egocentric obstacle-awareness embeddings (MEOE) and memory-based fine-grained rewards into our end-to-end policy training, which improve obstacle understanding and accelerate convergence by fully utilizing memory knowledge. Our experiments on the AI2-Thor dataset confirm the benefits and superior performance of proposed Memory-MambaNav, demonstrating Mamba’s potential in ObjectNav, particularly in long-horizon trajectories. All demonstration videos referenced in this paper can be viewed on the webpage (<span><span>https://sunleyuan.github.io/Memory-MambaNav</span><svg><path></path></svg></span>).</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"158 \",\"pages\":\"Article 105522\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-03-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625001106\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625001106","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

对象-目标导航（ObjectNav）涉及在未知环境中使用文本命令结合语义理解来定位指定的目标对象。这就要求具身智能体在导航过程中对环境具有先进的时空理解能力。虽然早期的方法侧重于空间建模，但它们要么不利用情景时间记忆（例如，跟踪探索和未探索的空间），要么在计算上令人望而却步，因为长视界记忆知识在存储和训练中都是资源密集型的。为了解决这一问题，本文引入了Memory-MambaNav模型，该模型采用基于mamba的多个层进行精细的时空建模。memory - mambanav利用以其全局接受域和线性复杂性而闻名的Mamba架构，可以有效地从积累的历史观察中提取和处理记忆知识。为了增强空间建模，我们引入了记忆空间差分状态空间模型（MSD-SSM），以解决以前基于CNN和transformer的模型在接受场和计算需求方面的局限性。对于时间建模，提出的记忆时间序列化SSM （MTS-SSM）以跨时间的方式利用曼巴的选择性扫描能力，增强模型的时间理解和与双时间特征的交互。我们还将记忆聚合的自我中心障碍感知嵌入（MEOE）和基于记忆的细粒度奖励集成到我们的端到端策略训练中，通过充分利用记忆知识来提高障碍理解和加速收敛。我们在AI2-Thor数据集上的实验证实了Memory-MambaNav的优势和卓越性能，证明了Mamba在ObjectNav中的潜力，特别是在长视界轨迹中。本文中引用的所有演示视频都可以在网页（https://sunleyuan.github.io/Memory-MambaNav）上观看。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Memory-MambaNav: Enhancing object-goal navigation through integration of spatial–temporal scanning with state space models

Object-goal Navigation (ObjectNav) involves locating a specified target object using a textual command combined with semantic understanding in an unknown environment. This requires the embodied agent to have advanced spatial and temporal comprehension about environment during navigation. While earlier approaches focus on spatial modeling, they either do not utilize episodic temporal memory (e.g., keeping track of explored and unexplored spaces) or are computationally prohibitive, as long-horizon memory knowledge is resource-intensive in both storage and training. To address this issue, this paper introduces the Memory-MambaNav model, which employs multiple Mamba-based layers for refined spatial–temporal modeling. Leveraging the Mamba architecture, known for its global receptive field and linear complexity, Memory-MambaNav can efficiently extract and process memory knowledge from accumulated historical observations. To enhance spatial modeling, we introduce the Memory Spatial Difference State Space Model (MSD-SSM) to address the limitations of previous CNN and Transformer-based models in terms of receptive field and computational demand. For temporal modeling, the proposed Memory Temporal Serialization SSM (MTS-SSM) leverages Mamba’s selective scanning capabilities in a cross-temporal manner, enhancing the model’s temporal understanding and interaction with bi-temporal features. We also integrate memory-aggregated egocentric obstacle-awareness embeddings (MEOE) and memory-based fine-grained rewards into our end-to-end policy training, which improve obstacle understanding and accelerate convergence by fully utilizing memory knowledge. Our experiments on the AI2-Thor dataset confirm the benefits and superior performance of proposed Memory-MambaNav, demonstrating Mamba’s potential in ObjectNav, particularly in long-horizon trajectories. All demonstration videos referenced in this paper can be viewed on the webpage (https://sunleyuan.github.io/Memory-MambaNav).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.