基于混合监督的胸部x射线基础模型密度预测。

IF 9.8 1区医学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

IEEE Transactions on Medical Imaging Pub Date : 2025-07-16 DOI:10.1109/tmi.2025.3589928

Fuying Wang,Lequan Yu

{"title":"基于混合监督的胸部x射线基础模型密度预测。","authors":"Fuying Wang,Lequan Yu","doi":"10.1109/tmi.2025.3589928","DOIUrl":null,"url":null,"abstract":"Foundation models have significantly revolutionized the field of chest X-ray diagnosis with their ability to transfer across various diseases and tasks. However, previous works have predominantly utilized self-supervised learning from medical image-text pairs, which falls short in dense medical prediction tasks due to their sole reliance on such coarse pair supervision, thereby limiting their applicability to detailed diagnostics. In this paper, we introduce a Dense Chest X-ray Foundation Model (DCXFM), which utilizes mixed supervision types (i.e., text, label, and segmentation masks) to significantly enhance the scalability of foundation models across various medical tasks. Our model involves two training stages: we first employ a novel self-distilled multimodal pretraining paradigm to exploit text and label supervision, along with local-to-global self-distillation and soft cross-modal contrastive alignment strategies to enhance localization capabilities. Subsequently, we introduce an efficient cost aggregation module, comprising spatial and class aggregation mechanisms, to further advance dense prediction tasks with densely annotated datasets. Comprehensive evaluations on three tasks (phrase grounding, zero-shot semantic segmentation, and zero-shot classification) demonstrate DCXFM's superior performance over other state-of-the-art medical image-text pretraining models. Remarkably, DCXFM exhibits powerful zero-shot capabilities across various datasets in phrase grounding and zero-shot semantic segmentation, underscoring its superior generalization in dense prediction tasks.","PeriodicalId":13418,"journal":{"name":"IEEE Transactions on Medical Imaging","volume":"2 1","pages":""},"PeriodicalIF":9.8000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Scaling Chest X-ray Foundation Models from Mixed Supervisions for Dense Prediction.\",\"authors\":\"Fuying Wang,Lequan Yu\",\"doi\":\"10.1109/tmi.2025.3589928\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Foundation models have significantly revolutionized the field of chest X-ray diagnosis with their ability to transfer across various diseases and tasks. However, previous works have predominantly utilized self-supervised learning from medical image-text pairs, which falls short in dense medical prediction tasks due to their sole reliance on such coarse pair supervision, thereby limiting their applicability to detailed diagnostics. In this paper, we introduce a Dense Chest X-ray Foundation Model (DCXFM), which utilizes mixed supervision types (i.e., text, label, and segmentation masks) to significantly enhance the scalability of foundation models across various medical tasks. Our model involves two training stages: we first employ a novel self-distilled multimodal pretraining paradigm to exploit text and label supervision, along with local-to-global self-distillation and soft cross-modal contrastive alignment strategies to enhance localization capabilities. Subsequently, we introduce an efficient cost aggregation module, comprising spatial and class aggregation mechanisms, to further advance dense prediction tasks with densely annotated datasets. Comprehensive evaluations on three tasks (phrase grounding, zero-shot semantic segmentation, and zero-shot classification) demonstrate DCXFM's superior performance over other state-of-the-art medical image-text pretraining models. Remarkably, DCXFM exhibits powerful zero-shot capabilities across various datasets in phrase grounding and zero-shot semantic segmentation, underscoring its superior generalization in dense prediction tasks.\",\"PeriodicalId\":13418,\"journal\":{\"name\":\"IEEE Transactions on Medical Imaging\",\"volume\":\"2 1\",\"pages\":\"\"},\"PeriodicalIF\":9.8000,\"publicationDate\":\"2025-07-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Medical Imaging\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.1109/tmi.2025.3589928\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Medical Imaging","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1109/tmi.2025.3589928","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

基础模型具有跨越各种疾病和任务的能力，已经显著地改变了胸部x线诊断领域。然而，以前的工作主要是利用医学图像-文本对的自监督学习，由于它们完全依赖于这种粗对监督，因此在密集的医学预测任务中存在不足，从而限制了它们对详细诊断的适用性。在本文中，我们引入了密集胸部x射线基础模型（DCXFM），该模型利用混合监督类型（即文本、标签和分割掩码）来显著增强基础模型在各种医疗任务中的可扩展性。我们的模型包括两个训练阶段：我们首先采用一种新的自蒸馏多模态预训练范式来利用文本和标签监督，以及局部到全局的自蒸馏和软跨模态对比对齐策略来增强定位能力。随后，我们引入了一个高效的成本聚合模块，包括空间和类聚合机制，以进一步推进密集注释数据集的密集预测任务。对三个任务（短语基础、零样本语义分割和零样本分类）的综合评估表明，DCXFM优于其他最先进的医学图像-文本预训练模型。值得注意的是，DCXFM在短语基础和语义分割方面展示了强大的跨各种数据集的零采样能力，强调了其在密集预测任务中的优越泛化能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Scaling Chest X-ray Foundation Models from Mixed Supervisions for Dense Prediction.

Foundation models have significantly revolutionized the field of chest X-ray diagnosis with their ability to transfer across various diseases and tasks. However, previous works have predominantly utilized self-supervised learning from medical image-text pairs, which falls short in dense medical prediction tasks due to their sole reliance on such coarse pair supervision, thereby limiting their applicability to detailed diagnostics. In this paper, we introduce a Dense Chest X-ray Foundation Model (DCXFM), which utilizes mixed supervision types (i.e., text, label, and segmentation masks) to significantly enhance the scalability of foundation models across various medical tasks. Our model involves two training stages: we first employ a novel self-distilled multimodal pretraining paradigm to exploit text and label supervision, along with local-to-global self-distillation and soft cross-modal contrastive alignment strategies to enhance localization capabilities. Subsequently, we introduce an efficient cost aggregation module, comprising spatial and class aggregation mechanisms, to further advance dense prediction tasks with densely annotated datasets. Comprehensive evaluations on three tasks (phrase grounding, zero-shot semantic segmentation, and zero-shot classification) demonstrate DCXFM's superior performance over other state-of-the-art medical image-text pretraining models. Remarkably, DCXFM exhibits powerful zero-shot capabilities across various datasets in phrase grounding and zero-shot semantic segmentation, underscoring its superior generalization in dense prediction tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Medical Imaging 医学-成像科学与照相技术

CiteScore

21.80

自引率

5.70%

发文量

637

审稿时长

5.6 months

期刊介绍： The IEEE Transactions on Medical Imaging (T-MI) is a journal that welcomes the submission of manuscripts focusing on various aspects of medical imaging. The journal encourages the exploration of body structure, morphology, and function through different imaging techniques, including ultrasound, X-rays, magnetic resonance, radionuclides, microwaves, and optical methods. It also promotes contributions related to cell and molecular imaging, as well as all forms of microscopy. T-MI publishes original research papers that cover a wide range of topics, including but not limited to novel acquisition techniques, medical image processing and analysis, visualization and performance, pattern recognition, machine learning, and other related methods. The journal particularly encourages highly technical studies that offer new perspectives. By emphasizing the unification of medicine, biology, and imaging, T-MI seeks to bridge the gap between instrumentation, hardware, software, mathematics, physics, biology, and medicine by introducing new analysis methods. While the journal welcomes strong application papers that describe novel methods, it directs papers that focus solely on important applications using medically adopted or well-established methods without significant innovation in methodology to other journals. T-MI is indexed in Pubmed® and Medline®, which are products of the United States National Library of Medicine.