Improving Representation of High-Frequency Components for Medical Visual Foundation Models

IEEE transactions on medical imaging Pub Date : 2025-04-09 DOI:10.1109/TMI.2025.3559402

Yuetan Chu;Yilan Zhang;Zhongyi Han;Changchun Yang;Longxi Zhou;Gongning Luo;Chao Huang;Xin Gao

{"title":"Improving Representation of High-Frequency Components for Medical Visual Foundation Models","authors":"Yuetan Chu;Yilan Zhang;Zhongyi Han;Changchun Yang;Longxi Zhou;Gongning Luo;Chao Huang;Xin Gao","doi":"10.1109/TMI.2025.3559402","DOIUrl":null,"url":null,"abstract":"Foundation models have attracted significant attention for their impressive generalizability across diverse downstream tasks. However, they are demonstrated to exhibit great limitations in representing high-frequency components and fine-grained details. In many medical imaging tasks, precise representation of such information is crucial due to the inherently intricate anatomical structures, sub-visual features, and complex boundaries involved. Consequently, the limited representation of prevalent foundation models can result in considerable performance degradation or even failure in these tasks. To address these challenges, we propose a novel pretraining strategy for both 2D images and 3D volumes, named Frequency-advanced Representation Autoencoder (Frepa). Through high-frequency masking and low-frequency perturbation combined with embedding consistency learning, Frepa encourages the encoder to effectively represent and preserve high-frequency components in the image embeddings. Additionally, we introduce an innovative histogram-equalized image masking strategy, extending the Masked Autoencoder approach beyond ViT to other architectures such as Swin-Transformer and convolutional networks. We develop Frepa across nine medical modalities and validate it on 32 downstream tasks for both 2D images and 3D volumes. Without fine-tuning, Frepa can outperform other self-supervised pretraining methods and, in some cases, even surpasses task-specific foundation models. This improvement is particularly significant for tasks involving fine-grained details, such as achieving up to a +15% increase in dice score for retina vessel segmentation and a +8% increase in IoU for lung tumor detection. Further experiment quantitatively reveals that Frepa enables superior high-frequency representations and preservation in the embeddings, underscoring its potential for developing more generalized and universal medical image foundation models.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 8","pages":"3196-3209"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on medical imaging","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10960415/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Foundation models have attracted significant attention for their impressive generalizability across diverse downstream tasks. However, they are demonstrated to exhibit great limitations in representing high-frequency components and fine-grained details. In many medical imaging tasks, precise representation of such information is crucial due to the inherently intricate anatomical structures, sub-visual features, and complex boundaries involved. Consequently, the limited representation of prevalent foundation models can result in considerable performance degradation or even failure in these tasks. To address these challenges, we propose a novel pretraining strategy for both 2D images and 3D volumes, named Frequency-advanced Representation Autoencoder (Frepa). Through high-frequency masking and low-frequency perturbation combined with embedding consistency learning, Frepa encourages the encoder to effectively represent and preserve high-frequency components in the image embeddings. Additionally, we introduce an innovative histogram-equalized image masking strategy, extending the Masked Autoencoder approach beyond ViT to other architectures such as Swin-Transformer and convolutional networks. We develop Frepa across nine medical modalities and validate it on 32 downstream tasks for both 2D images and 3D volumes. Without fine-tuning, Frepa can outperform other self-supervised pretraining methods and, in some cases, even surpasses task-specific foundation models. This improvement is particularly significant for tasks involving fine-grained details, such as achieving up to a +15% increase in dice score for retina vessel segmentation and a +8% increase in IoU for lung tumor detection. Further experiment quantitatively reveals that Frepa enables superior high-frequency representations and preservation in the embeddings, underscoring its potential for developing more generalized and universal medical image foundation models.

查看原文本刊更多论文

改进医学视觉基础模型高频分量的表示

基础模型因其在不同下游任务中令人印象深刻的通用性而引起了极大的关注。然而，它们在表示高频组件和细粒度细节方面表现出很大的局限性。在许多医学成像任务中，由于固有的复杂解剖结构、亚视觉特征和复杂的边界，这些信息的精确表示至关重要。因此，普遍基础模型的有限表示可能导致这些任务中相当大的性能下降甚至失败。为了解决这些挑战，我们提出了一种新的二维图像和三维体的预训练策略，称为频率高级表示自动编码器（Frepa）。Frepa通过高频掩蔽和低频扰动结合嵌入一致性学习，鼓励编码器有效地表示和保留图像嵌入中的高频成分。此外，我们引入了一种创新的直方图均衡图像掩蔽策略，将掩蔽自编码器方法扩展到ViT以外的其他架构，如swing - transformer和卷积网络。我们在9种医疗模式中开发了Frepa，并在32个下游任务中对其进行了验证，包括2D图像和3D体积。在没有微调的情况下，Frepa可以胜过其他自我监督的预训练方法，在某些情况下，甚至超过了特定任务的基础模型。这种改进对于涉及细粒度细节的任务尤其显著，例如视网膜血管分割的骰子得分提高了+15%，肺肿瘤检测的IoU提高了+8%。进一步的定量实验表明，Frepa在嵌入中实现了卓越的高频表示和保存，强调了其开发更广义和通用的医学图像基础模型的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on medical imaging

自引率

0.00%

发文量