MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image Analysis

Jiaxin Zhuang;Linshan Wu;Qiong Wang;Peng Fei;Varut Vardhanabhuti;Lin Luo;Hao Chen
{"title":"MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image Analysis","authors":"Jiaxin Zhuang;Linshan Wu;Qiong Wang;Peng Fei;Varut Vardhanabhuti;Lin Luo;Hao Chen","doi":"10.1109/TMI.2025.3564382","DOIUrl":null,"url":null,"abstract":"The Vision Transformer (ViT) has demonstrated remarkable performance in Self-Supervised Learning (SSL) for 3D medical image analysis. Masked AutoEncoder (MAE) for feature pre-training can further unleash the potential of ViT on various medical vision tasks. However, due to large spatial sizes with much higher dimensions of 3D medical images, the lack of hierarchical design for MAE may hinder the performance of downstream tasks. In this paper, we propose a novel Mask in Mask (MiM) pre-training framework for 3D medical images, which aims to advance MAE by learning discriminative representation from hierarchical visual tokens across varying scales. We introduce multiple levels of granularity for masked inputs from the volume, which are then reconstructed simultaneously ranging at both fine and coarse levels. Additionally, a cross-level alignment mechanism is applied to adjacent level volumes to enforce anatomical similarity hierarchically. Furthermore, we adopt a hybrid backbone to enhance the hierarchical representation learning efficiently during the pre-training. MiM was pre-trained on a large scale of available 3D volumetric images, i.e., Computed Tomography (CT) images containing various body parts. Extensive experiments on twelve public datasets demonstrate the superiority of MiM over other SSL methods in organ/tumor segmentation and disease classification. We further scale up the MiM to large pre-training datasets with more than 10k volumes, showing that large-scale pre-training can further enhance the performance of downstream tasks. Code is available at <uri>https://github.com/JiaxinZhuang/MiM</uri>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 9","pages":"3727-3740"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on medical imaging","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10977020/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The Vision Transformer (ViT) has demonstrated remarkable performance in Self-Supervised Learning (SSL) for 3D medical image analysis. Masked AutoEncoder (MAE) for feature pre-training can further unleash the potential of ViT on various medical vision tasks. However, due to large spatial sizes with much higher dimensions of 3D medical images, the lack of hierarchical design for MAE may hinder the performance of downstream tasks. In this paper, we propose a novel Mask in Mask (MiM) pre-training framework for 3D medical images, which aims to advance MAE by learning discriminative representation from hierarchical visual tokens across varying scales. We introduce multiple levels of granularity for masked inputs from the volume, which are then reconstructed simultaneously ranging at both fine and coarse levels. Additionally, a cross-level alignment mechanism is applied to adjacent level volumes to enforce anatomical similarity hierarchically. Furthermore, we adopt a hybrid backbone to enhance the hierarchical representation learning efficiently during the pre-training. MiM was pre-trained on a large scale of available 3D volumetric images, i.e., Computed Tomography (CT) images containing various body parts. Extensive experiments on twelve public datasets demonstrate the superiority of MiM over other SSL methods in organ/tumor segmentation and disease classification. We further scale up the MiM to large pre-training datasets with more than 10k volumes, showing that large-scale pre-training can further enhance the performance of downstream tasks. Code is available at https://github.com/JiaxinZhuang/MiM
用于3D医学图像分析的面罩自监督预训练中的面罩
视觉变换器(Vision Transformer, ViT)在三维医学图像分析的自监督学习(Self-Supervised Learning, SSL)中表现出了显著的性能。用于特征预训练的掩膜自动编码器(MAE)可以进一步释放ViT在各种医疗视觉任务中的潜力。然而,由于三维医学图像的大空间尺寸和更高的维度,缺乏分层设计的MAE可能会阻碍下游任务的性能。在本文中,我们提出了一种新的用于3D医学图像的Mask In Mask (MiM)预训练框架,该框架旨在通过学习不同尺度的分层视觉标记的判别表示来推进MAE。我们为来自体积的屏蔽输入引入了多个粒度级别,然后在精细和粗糙级别上同时重建。此外,在相邻的水平体上应用了跨水平对齐机制,以在层次上加强解剖相似性。此外,在预训练过程中,我们采用混合主干来有效地增强分层表示学习。MiM在大量可用的三维体图像上进行预训练,即包含各种身体部位的计算机断层扫描(CT)图像。在12个公共数据集上的大量实验表明,MiM在器官/肿瘤分割和疾病分类方面优于其他SSL方法。我们进一步将MiM扩展到超过10k卷的大型预训练数据集,表明大规模预训练可以进一步提高下游任务的性能。代码可从https://github.com/JiaxinZhuang/MiM获得
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信