MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image Analysis

IEEE transactions on medical imaging Pub Date : 2025-04-25 DOI:10.1109/TMI.2025.3564382

Jiaxin Zhuang;Linshan Wu;Qiong Wang;Peng Fei;Varut Vardhanabhuti;Lin Luo;Hao Chen

{"title":"MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image Analysis","authors":"Jiaxin Zhuang;Linshan Wu;Qiong Wang;Peng Fei;Varut Vardhanabhuti;Lin Luo;Hao Chen","doi":"10.1109/TMI.2025.3564382","DOIUrl":null,"url":null,"abstract":"The Vision Transformer (ViT) has demonstrated remarkable performance in Self-Supervised Learning (SSL) for 3D medical image analysis. Masked AutoEncoder (MAE) for feature pre-training can further unleash the potential of ViT on various medical vision tasks. However, due to large spatial sizes with much higher dimensions of 3D medical images, the lack of hierarchical design for MAE may hinder the performance of downstream tasks. In this paper, we propose a novel Mask in Mask (MiM) pre-training framework for 3D medical images, which aims to advance MAE by learning discriminative representation from hierarchical visual tokens across varying scales. We introduce multiple levels of granularity for masked inputs from the volume, which are then reconstructed simultaneously ranging at both fine and coarse levels. Additionally, a cross-level alignment mechanism is applied to adjacent level volumes to enforce anatomical similarity hierarchically. Furthermore, we adopt a hybrid backbone to enhance the hierarchical representation learning efficiently during the pre-training. MiM was pre-trained on a large scale of available 3D volumetric images, i.e., Computed Tomography (CT) images containing various body parts. Extensive experiments on twelve public datasets demonstrate the superiority of MiM over other SSL methods in organ/tumor segmentation and disease classification. We further scale up the MiM to large pre-training datasets with more than 10k volumes, showing that large-scale pre-training can further enhance the performance of downstream tasks. Code is available at <uri>https://github.com/JiaxinZhuang/MiM</uri>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 9","pages":"3727-3740"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on medical imaging","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10977020/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The Vision Transformer (ViT) has demonstrated remarkable performance in Self-Supervised Learning (SSL) for 3D medical image analysis. Masked AutoEncoder (MAE) for feature pre-training can further unleash the potential of ViT on various medical vision tasks. However, due to large spatial sizes with much higher dimensions of 3D medical images, the lack of hierarchical design for MAE may hinder the performance of downstream tasks. In this paper, we propose a novel Mask in Mask (MiM) pre-training framework for 3D medical images, which aims to advance MAE by learning discriminative representation from hierarchical visual tokens across varying scales. We introduce multiple levels of granularity for masked inputs from the volume, which are then reconstructed simultaneously ranging at both fine and coarse levels. Additionally, a cross-level alignment mechanism is applied to adjacent level volumes to enforce anatomical similarity hierarchically. Furthermore, we adopt a hybrid backbone to enhance the hierarchical representation learning efficiently during the pre-training. MiM was pre-trained on a large scale of available 3D volumetric images, i.e., Computed Tomography (CT) images containing various body parts. Extensive experiments on twelve public datasets demonstrate the superiority of MiM over other SSL methods in organ/tumor segmentation and disease classification. We further scale up the MiM to large pre-training datasets with more than 10k volumes, showing that large-scale pre-training can further enhance the performance of downstream tasks. Code is available at https://github.com/JiaxinZhuang/MiM

查看原文本刊更多论文

用于3D医学图像分析的面罩自监督预训练中的面罩

视觉变换器（Vision Transformer, ViT）在三维医学图像分析的自监督学习（Self-Supervised Learning， SSL）中表现出了显著的性能。用于特征预训练的掩膜自动编码器（MAE）可以进一步释放ViT在各种医疗视觉任务中的潜力。然而，由于三维医学图像的大空间尺寸和更高的维度，缺乏分层设计的MAE可能会阻碍下游任务的性能。在本文中，我们提出了一种新的用于3D医学图像的Mask In Mask （MiM）预训练框架，该框架旨在通过学习不同尺度的分层视觉标记的判别表示来推进MAE。我们为来自体积的屏蔽输入引入了多个粒度级别，然后在精细和粗糙级别上同时重建。此外，在相邻的水平体上应用了跨水平对齐机制，以在层次上加强解剖相似性。此外，在预训练过程中，我们采用混合主干来有效地增强分层表示学习。MiM在大量可用的三维体图像上进行预训练，即包含各种身体部位的计算机断层扫描（CT）图像。在12个公共数据集上的大量实验表明，MiM在器官/肿瘤分割和疾病分类方面优于其他SSL方法。我们进一步将MiM扩展到超过10k卷的大型预训练数据集，表明大规模预训练可以进一步提高下游任务的性能。代码可从https://github.com/JiaxinZhuang/MiM获得

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on medical imaging

自引率

0.00%

发文量