{"title":"Memory-MambaNav:通过整合时空扫描和状态空间模型来增强目标-目标导航","authors":"Leyuan Sun , Yusuke Yoshiyasu","doi":"10.1016/j.imavis.2025.105522","DOIUrl":null,"url":null,"abstract":"<div><div>Object-goal Navigation (ObjectNav) involves locating a specified target object using a textual command combined with semantic understanding in an unknown environment. This requires the embodied agent to have advanced spatial and temporal comprehension about environment during navigation. While earlier approaches focus on spatial modeling, they either do not utilize episodic temporal memory (e.g., keeping track of explored and unexplored spaces) or are computationally prohibitive, as long-horizon memory knowledge is resource-intensive in both storage and training. To address this issue, this paper introduces the Memory-MambaNav model, which employs multiple Mamba-based layers for refined spatial–temporal modeling. Leveraging the Mamba architecture, known for its global receptive field and linear complexity, Memory-MambaNav can efficiently extract and process memory knowledge from accumulated historical observations. To enhance spatial modeling, we introduce the Memory Spatial Difference State Space Model (MSD-SSM) to address the limitations of previous CNN and Transformer-based models in terms of receptive field and computational demand. For temporal modeling, the proposed Memory Temporal Serialization SSM (MTS-SSM) leverages Mamba’s selective scanning capabilities in a cross-temporal manner, enhancing the model’s temporal understanding and interaction with bi-temporal features. We also integrate memory-aggregated egocentric obstacle-awareness embeddings (MEOE) and memory-based fine-grained rewards into our end-to-end policy training, which improve obstacle understanding and accelerate convergence by fully utilizing memory knowledge. Our experiments on the AI2-Thor dataset confirm the benefits and superior performance of proposed Memory-MambaNav, demonstrating Mamba’s potential in ObjectNav, particularly in long-horizon trajectories. All demonstration videos referenced in this paper can be viewed on the webpage (<span><span>https://sunleyuan.github.io/Memory-MambaNav</span><svg><path></path></svg></span>).</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105522"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Memory-MambaNav: Enhancing object-goal navigation through integration of spatial–temporal scanning with state space models\",\"authors\":\"Leyuan Sun , Yusuke Yoshiyasu\",\"doi\":\"10.1016/j.imavis.2025.105522\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Object-goal Navigation (ObjectNav) involves locating a specified target object using a textual command combined with semantic understanding in an unknown environment. This requires the embodied agent to have advanced spatial and temporal comprehension about environment during navigation. While earlier approaches focus on spatial modeling, they either do not utilize episodic temporal memory (e.g., keeping track of explored and unexplored spaces) or are computationally prohibitive, as long-horizon memory knowledge is resource-intensive in both storage and training. To address this issue, this paper introduces the Memory-MambaNav model, which employs multiple Mamba-based layers for refined spatial–temporal modeling. Leveraging the Mamba architecture, known for its global receptive field and linear complexity, Memory-MambaNav can efficiently extract and process memory knowledge from accumulated historical observations. To enhance spatial modeling, we introduce the Memory Spatial Difference State Space Model (MSD-SSM) to address the limitations of previous CNN and Transformer-based models in terms of receptive field and computational demand. For temporal modeling, the proposed Memory Temporal Serialization SSM (MTS-SSM) leverages Mamba’s selective scanning capabilities in a cross-temporal manner, enhancing the model’s temporal understanding and interaction with bi-temporal features. We also integrate memory-aggregated egocentric obstacle-awareness embeddings (MEOE) and memory-based fine-grained rewards into our end-to-end policy training, which improve obstacle understanding and accelerate convergence by fully utilizing memory knowledge. Our experiments on the AI2-Thor dataset confirm the benefits and superior performance of proposed Memory-MambaNav, demonstrating Mamba’s potential in ObjectNav, particularly in long-horizon trajectories. All demonstration videos referenced in this paper can be viewed on the webpage (<span><span>https://sunleyuan.github.io/Memory-MambaNav</span><svg><path></path></svg></span>).</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"158 \",\"pages\":\"Article 105522\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-03-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625001106\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625001106","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Memory-MambaNav: Enhancing object-goal navigation through integration of spatial–temporal scanning with state space models
Object-goal Navigation (ObjectNav) involves locating a specified target object using a textual command combined with semantic understanding in an unknown environment. This requires the embodied agent to have advanced spatial and temporal comprehension about environment during navigation. While earlier approaches focus on spatial modeling, they either do not utilize episodic temporal memory (e.g., keeping track of explored and unexplored spaces) or are computationally prohibitive, as long-horizon memory knowledge is resource-intensive in both storage and training. To address this issue, this paper introduces the Memory-MambaNav model, which employs multiple Mamba-based layers for refined spatial–temporal modeling. Leveraging the Mamba architecture, known for its global receptive field and linear complexity, Memory-MambaNav can efficiently extract and process memory knowledge from accumulated historical observations. To enhance spatial modeling, we introduce the Memory Spatial Difference State Space Model (MSD-SSM) to address the limitations of previous CNN and Transformer-based models in terms of receptive field and computational demand. For temporal modeling, the proposed Memory Temporal Serialization SSM (MTS-SSM) leverages Mamba’s selective scanning capabilities in a cross-temporal manner, enhancing the model’s temporal understanding and interaction with bi-temporal features. We also integrate memory-aggregated egocentric obstacle-awareness embeddings (MEOE) and memory-based fine-grained rewards into our end-to-end policy training, which improve obstacle understanding and accelerate convergence by fully utilizing memory knowledge. Our experiments on the AI2-Thor dataset confirm the benefits and superior performance of proposed Memory-MambaNav, demonstrating Mamba’s potential in ObjectNav, particularly in long-horizon trajectories. All demonstration videos referenced in this paper can be viewed on the webpage (https://sunleyuan.github.io/Memory-MambaNav).
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.