Hi-End-MAE: Hierarchical encoder-driven masked autoencoders are stronger vision learners for medical image segmentation

IF 11.8 1区医学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Medical image analysis Pub Date : 2025-09-12 DOI:10.1016/j.media.2025.103770

Fenghe Tang , Qingsong Yao , Wenxin Ma , Chenxu Wu , Zihang Jiang , S. Kevin Zhou

{"title":"Hi-End-MAE: Hierarchical encoder-driven masked autoencoders are stronger vision learners for medical image segmentation","authors":"Fenghe Tang , Qingsong Yao , Wenxin Ma , Chenxu Wu , Zihang Jiang , S. Kevin Zhou","doi":"10.1016/j.media.2025.103770","DOIUrl":null,"url":null,"abstract":"<div><div>Medical image segmentation remains a formidable challenge due to the label scarcity. Pre-training Vision Transformer (ViT) through masked image modeling (MIM) on large-scale unlabeled medical datasets presents a promising solution, providing both computational efficiency and model generalization for various downstream tasks. However, current ViT-based MIM pre-training frameworks predominantly emphasize local aggregation representations in output layers and fail to exploit the rich representations across different ViT layers that better capture fine-grained semantic information needed for more precise medical downstream tasks. To fill the above gap, we hereby present <strong>Hi</strong>erarchical <strong>En</strong>coder-<strong>d</strong>riven <strong>MAE</strong> (<strong>Hi-End-MAE</strong>), a simple yet effective ViT-based pre-training solution, which centers on two key innovations: (1) Encoder-driven reconstruction, which encourages the encoder to learn more informative features to guide the reconstruction of masked patches; and (2) Hierarchical dense decoding, which implements a hierarchical decoding structure to capture rich representations across different layers. We pre-train Hi-End-MAE on a large-scale dataset of 10K CT scans and evaluated its performance across nine public medical image segmentation benchmarks. Extensive experiments demonstrate that Hi-End-MAE achieves superior transfer learning capabilities across various downstream tasks, revealing the potential of ViT in medical imaging applications. The code is available at: <span><span>https://github.com/FengheTan9/Hi-End-MAE</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"107 ","pages":"Article 103770"},"PeriodicalIF":11.8000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical image analysis","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1361841525003160","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Medical image segmentation remains a formidable challenge due to the label scarcity. Pre-training Vision Transformer (ViT) through masked image modeling (MIM) on large-scale unlabeled medical datasets presents a promising solution, providing both computational efficiency and model generalization for various downstream tasks. However, current ViT-based MIM pre-training frameworks predominantly emphasize local aggregation representations in output layers and fail to exploit the rich representations across different ViT layers that better capture fine-grained semantic information needed for more precise medical downstream tasks. To fill the above gap, we hereby present Hierarchical Encoder-driven MAE (Hi-End-MAE), a simple yet effective ViT-based pre-training solution, which centers on two key innovations: (1) Encoder-driven reconstruction, which encourages the encoder to learn more informative features to guide the reconstruction of masked patches; and (2) Hierarchical dense decoding, which implements a hierarchical decoding structure to capture rich representations across different layers. We pre-train Hi-End-MAE on a large-scale dataset of 10K CT scans and evaluated its performance across nine public medical image segmentation benchmarks. Extensive experiments demonstrate that Hi-End-MAE achieves superior transfer learning capabilities across various downstream tasks, revealing the potential of ViT in medical imaging applications. The code is available at: https://github.com/FengheTan9/Hi-End-MAE.

查看原文本刊更多论文

Hi-End-MAE：分层编码器驱动的掩码自编码器是医学图像分割中较强的视觉学习器

由于标签的稀缺性，医学图像分割仍然是一个巨大的挑战。通过蒙面图像建模（MIM）在大规模未标记医疗数据集上进行预训练视觉变压器（ViT）是一种很有前途的解决方案，可以为各种下游任务提供计算效率和模型泛化。然而，目前基于ViT的MIM预训练框架主要强调输出层中的局部聚合表示，未能利用不同ViT层之间的丰富表示，从而更好地捕获更精确的医疗下游任务所需的细粒度语义信息。为了填补上述空白，我们提出了一种简单而有效的基于vit的预训练方案Hi-End-MAE (Hierarchical encoder -driven MAE)，它围绕两个关键创新：(1)编码器驱动重构，它鼓励编码器学习更多信息特征来指导掩码补丁的重构；(2)分层密集解码，实现分层解码结构，捕获跨层的丰富表示。我们在10K CT扫描的大规模数据集上预训练Hi-End-MAE，并在9个公共医学图像分割基准上评估其性能。大量实验表明，Hi-End-MAE在各种下游任务中实现了优越的迁移学习能力，揭示了ViT在医学成像应用中的潜力。代码可从https://github.com/FengheTan9/Hi-End-MAE获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Medical image analysis 工程技术-工程：生物医学

CiteScore

22.10

自引率

6.40%

发文量

309

审稿时长

6.6 months

期刊介绍： Medical Image Analysis serves as a platform for sharing new research findings in the realm of medical and biological image analysis, with a focus on applications of computer vision, virtual reality, and robotics to biomedical imaging challenges. The journal prioritizes the publication of high-quality, original papers contributing to the fundamental science of processing, analyzing, and utilizing medical and biological images. It welcomes approaches utilizing biomedical image datasets across all spatial scales, from molecular/cellular imaging to tissue/organ imaging.