MambaMIM: Pre-training Mamba with state space token interpolation and its application to medical image segmentation

IF 10.7 1区医学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Medical image analysis Pub Date : 2025-04-30 DOI:10.1016/j.media.2025.103606

Fenghe Tang , Bingkun Nian , Yingtai Li , Zihang Jiang , Jie Yang , Wei Liu , S. Kevin Zhou

{"title":"MambaMIM: Pre-training Mamba with state space token interpolation and its application to medical image segmentation","authors":"Fenghe Tang , Bingkun Nian , Yingtai Li , Zihang Jiang , Jie Yang , Wei Liu , S. Kevin Zhou","doi":"10.1016/j.media.2025.103606","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, the state space model Mamba has demonstrated efficient long-sequence modeling capabilities, particularly for addressing long-sequence visual tasks in 3D medical imaging. However, existing generative self-supervised learning methods have not yet fully unleashed Mamba’s potential for handling long-range dependencies because they overlook the inherent causal properties of state space sequences in masked modeling. To address this challenge, we propose a general-purpose pre-training framework called MambaMIM, a masked image modeling method based on a novel <strong>TOKen-Interpolation</strong> strategy (TOKI) for the selective structure state space sequence, which learns causal relationships of state space within the masked sequence. Further, MambaMIM introduces a bottom-up 3D hybrid masking strategy to maintain a <strong>masking consistency</strong> across different architectures and can be used on any single or hybrid Mamba architecture to enhance its multi-scale and long-range representation capability. We pre-train MambaMIM on a large-scale dataset of 6.8K CT scans and evaluate its performance across eight public medical segmentation benchmarks. Extensive downstream experiments reveal the feasibility and advancement of using Mamba for medical image pre-training. In particular, when we apply the MambaMIM to a customized architecture that hybridizes MedNeXt and Vision Mamba, we consistently obtain the state-of-the-art segmentation performance. The code is available at: <span><span>https://github.com/FengheTan9/MambaMIM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"103 ","pages":"Article 103606"},"PeriodicalIF":10.7000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical image analysis","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1361841525001537","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, the state space model Mamba has demonstrated efficient long-sequence modeling capabilities, particularly for addressing long-sequence visual tasks in 3D medical imaging. However, existing generative self-supervised learning methods have not yet fully unleashed Mamba’s potential for handling long-range dependencies because they overlook the inherent causal properties of state space sequences in masked modeling. To address this challenge, we propose a general-purpose pre-training framework called MambaMIM, a masked image modeling method based on a novel TOKen-Interpolation strategy (TOKI) for the selective structure state space sequence, which learns causal relationships of state space within the masked sequence. Further, MambaMIM introduces a bottom-up 3D hybrid masking strategy to maintain a masking consistency across different architectures and can be used on any single or hybrid Mamba architecture to enhance its multi-scale and long-range representation capability. We pre-train MambaMIM on a large-scale dataset of 6.8K CT scans and evaluate its performance across eight public medical segmentation benchmarks. Extensive downstream experiments reveal the feasibility and advancement of using Mamba for medical image pre-training. In particular, when we apply the MambaMIM to a customized architecture that hybridizes MedNeXt and Vision Mamba, we consistently obtain the state-of-the-art segmentation performance. The code is available at: https://github.com/FengheTan9/MambaMIM.

查看原文本刊更多论文

MambaMIM：状态空间标记插值预训练曼巴及其在医学图像分割中的应用

最近，状态空间模型Mamba已经证明了高效的长序列建模能力，特别是在解决3D医学成像中的长序列视觉任务方面。然而，现有的生成式自监督学习方法还没有完全释放Mamba处理远程依赖关系的潜力，因为它们忽略了掩模建模中状态空间序列的内在因果属性。为了解决这一挑战，我们提出了一种通用的预训练框架MambaMIM，这是一种基于新的token插值策略（TOKI）的选择性结构状态空间序列的掩膜图像建模方法，它可以学习掩膜序列中状态空间的因果关系。此外，MambaMIM引入了自下而上的3D混合掩蔽策略，以保持不同架构之间的掩蔽一致性，并且可以在任何单一或混合Mamba架构上使用，以增强其多尺度和远程表示能力。我们在6.8K CT扫描的大规模数据集上预训练MambaMIM，并在8个公共医疗分割基准上评估其性能。大量的下游实验揭示了曼巴用于医学图像预训练的可行性和先进性。特别是，当我们将MambaMIM应用于混合MedNeXt和Vision Mamba的定制架构时，我们始终如一地获得了最先进的分割性能。代码可从https://github.com/FengheTan9/MambaMIM获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Medical image analysis 工程技术-工程：生物医学

CiteScore

22.10

自引率

6.40%

发文量

309

审稿时长

6.6 months

期刊介绍： Medical Image Analysis serves as a platform for sharing new research findings in the realm of medical and biological image analysis, with a focus on applications of computer vision, virtual reality, and robotics to biomedical imaging challenges. The journal prioritizes the publication of high-quality, original papers contributing to the fundamental science of processing, analyzing, and utilizing medical and biological images. It welcomes approaches utilizing biomedical image datasets across all spatial scales, from molecular/cellular imaging to tissue/organ imaging.