StereoMamba: Real-Time and Robust Intraoperative Stereo Disparity Estimation via Long-Range Spatial Dependencies

IF 5.3 2区计算机科学 Q2 ROBOTICS

IEEE Robotics and Automation Letters Pub Date : 2025-09-01 DOI:10.1109/LRA.2025.3604749

Xu Wang;Jialang Xu;Shuai Zhang;Baoru Huang;Danail Stoyanov;Evangelos B. Mazomenos

{"title":"StereoMamba: Real-Time and Robust Intraoperative Stereo Disparity Estimation via Long-Range Spatial Dependencies","authors":"Xu Wang;Jialang Xu;Shuai Zhang;Baoru Huang;Danail Stoyanov;Evangelos B. Mazomenos","doi":"10.1109/LRA.2025.3604749","DOIUrl":null,"url":null,"abstract":"Stereo disparity estimation is crucial for obtaining depth information in robot-assisted minimally invasive surgery (RAMIS). While current deep learning methods have made significant advancements, challenges remain in achieving an optimal balance between accuracy, robustness, and inference speed. To address these challenges, we propose the StereoMamba architecture, which is specifically designed for stereo disparity estimation in RAMIS. Our approach is based on a novel Feature Extraction Mamba (FE-Mamba) module, which enhances long-range spatial dependencies both within and across stereo images. To effectively integrate multi-scale features from FE-Mamba, we then introduce a novel Multidimensional Feature Fusion (MFF) module. Experiments against the state-of-the-art on the ex-vivo SCARED benchmark demonstrate that StereoMamba achieves superior performance on EPE of 2.64 px and depth MAE of 2.55 mm, the second-best performance on Bad2 of 41.49% and Bad3 of 26.99%, while maintaining an inference speed of 21.28 FPS for a pair of high-resolution images (1280 × 1024), striking the optimum balance between accuracy, robustness, and efficiency. Furthermore, by comparing synthesized right images, generated from warping left images using the generated disparity maps, with the actual right image, StereoMamba achieves the best average SSIM (0.8970) and PSNR (16.0761), exhibiting strong zero-shot generalization on the in-vivo RIS2017 and StereoMIS datasets.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 10","pages":"10682-10689"},"PeriodicalIF":5.3000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Robotics and Automation Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11146458/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Stereo disparity estimation is crucial for obtaining depth information in robot-assisted minimally invasive surgery (RAMIS). While current deep learning methods have made significant advancements, challenges remain in achieving an optimal balance between accuracy, robustness, and inference speed. To address these challenges, we propose the StereoMamba architecture, which is specifically designed for stereo disparity estimation in RAMIS. Our approach is based on a novel Feature Extraction Mamba (FE-Mamba) module, which enhances long-range spatial dependencies both within and across stereo images. To effectively integrate multi-scale features from FE-Mamba, we then introduce a novel Multidimensional Feature Fusion (MFF) module. Experiments against the state-of-the-art on the ex-vivo SCARED benchmark demonstrate that StereoMamba achieves superior performance on EPE of 2.64 px and depth MAE of 2.55 mm, the second-best performance on Bad2 of 41.49% and Bad3 of 26.99%, while maintaining an inference speed of 21.28 FPS for a pair of high-resolution images (1280 × 1024), striking the optimum balance between accuracy, robustness, and efficiency. Furthermore, by comparing synthesized right images, generated from warping left images using the generated disparity maps, with the actual right image, StereoMamba achieves the best average SSIM (0.8970) and PSNR (16.0761), exhibiting strong zero-shot generalization on the in-vivo RIS2017 and StereoMIS datasets.

查看原文本刊更多论文

StereoMamba：基于远程空间依赖的实时鲁棒术中立体视差估计

立体视差估计是机器人辅助微创手术（RAMIS）中获取深度信息的关键。虽然目前的深度学习方法已经取得了重大进展，但在实现准确性、鲁棒性和推理速度之间的最佳平衡方面仍然存在挑战。为了解决这些挑战，我们提出了专为RAMIS中的立体视差估计而设计的StereoMamba架构。我们的方法是基于一种新颖的特征提取曼巴（FE-Mamba）模块，它增强了立体图像内部和跨立体图像的远程空间依赖性。为了有效地整合来自FE-Mamba的多尺度特征，我们引入了一种新的多维特征融合（MFF）模块。在离体基准上的实验表明，StereoMamba在2.64 px的EPE和2.55 mm的深度MAE上取得了优异的性能，在Bad2和Bad3上分别取得了41.49%和26.99%的第二好性能，同时在高分辨率图像（1280 × 1024）上保持了21.28 FPS的推理速度，在精度、鲁棒性和效率之间取得了最佳平衡。此外，通过对比使用生成的视差图将左侧图像扭曲生成的合成右侧图像与实际右侧图像，StereoMamba获得了最佳的平均SSIM（0.8970）和PSNR(16.0761)，在体内RIS2017和StereoMIS数据集上表现出很强的零次一般化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Robotics and Automation Letters Computer Science-Computer Science Applications

CiteScore

9.60

自引率

15.40%

发文量

1428

期刊介绍： The scope of this journal is to publish peer-reviewed articles that provide a timely and concise account of innovative research ideas and application results, reporting significant theoretical findings and application case studies in areas of robotics and automation.