{"title":"基于混合监督的胸部x射线基础模型密度预测。","authors":"Fuying Wang,Lequan Yu","doi":"10.1109/tmi.2025.3589928","DOIUrl":null,"url":null,"abstract":"Foundation models have significantly revolutionized the field of chest X-ray diagnosis with their ability to transfer across various diseases and tasks. However, previous works have predominantly utilized self-supervised learning from medical image-text pairs, which falls short in dense medical prediction tasks due to their sole reliance on such coarse pair supervision, thereby limiting their applicability to detailed diagnostics. In this paper, we introduce a Dense Chest X-ray Foundation Model (DCXFM), which utilizes mixed supervision types (i.e., text, label, and segmentation masks) to significantly enhance the scalability of foundation models across various medical tasks. Our model involves two training stages: we first employ a novel self-distilled multimodal pretraining paradigm to exploit text and label supervision, along with local-to-global self-distillation and soft cross-modal contrastive alignment strategies to enhance localization capabilities. Subsequently, we introduce an efficient cost aggregation module, comprising spatial and class aggregation mechanisms, to further advance dense prediction tasks with densely annotated datasets. Comprehensive evaluations on three tasks (phrase grounding, zero-shot semantic segmentation, and zero-shot classification) demonstrate DCXFM's superior performance over other state-of-the-art medical image-text pretraining models. Remarkably, DCXFM exhibits powerful zero-shot capabilities across various datasets in phrase grounding and zero-shot semantic segmentation, underscoring its superior generalization in dense prediction tasks.","PeriodicalId":13418,"journal":{"name":"IEEE Transactions on Medical Imaging","volume":"2 1","pages":""},"PeriodicalIF":9.8000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Scaling Chest X-ray Foundation Models from Mixed Supervisions for Dense Prediction.\",\"authors\":\"Fuying Wang,Lequan Yu\",\"doi\":\"10.1109/tmi.2025.3589928\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Foundation models have significantly revolutionized the field of chest X-ray diagnosis with their ability to transfer across various diseases and tasks. However, previous works have predominantly utilized self-supervised learning from medical image-text pairs, which falls short in dense medical prediction tasks due to their sole reliance on such coarse pair supervision, thereby limiting their applicability to detailed diagnostics. In this paper, we introduce a Dense Chest X-ray Foundation Model (DCXFM), which utilizes mixed supervision types (i.e., text, label, and segmentation masks) to significantly enhance the scalability of foundation models across various medical tasks. Our model involves two training stages: we first employ a novel self-distilled multimodal pretraining paradigm to exploit text and label supervision, along with local-to-global self-distillation and soft cross-modal contrastive alignment strategies to enhance localization capabilities. Subsequently, we introduce an efficient cost aggregation module, comprising spatial and class aggregation mechanisms, to further advance dense prediction tasks with densely annotated datasets. Comprehensive evaluations on three tasks (phrase grounding, zero-shot semantic segmentation, and zero-shot classification) demonstrate DCXFM's superior performance over other state-of-the-art medical image-text pretraining models. Remarkably, DCXFM exhibits powerful zero-shot capabilities across various datasets in phrase grounding and zero-shot semantic segmentation, underscoring its superior generalization in dense prediction tasks.\",\"PeriodicalId\":13418,\"journal\":{\"name\":\"IEEE Transactions on Medical Imaging\",\"volume\":\"2 1\",\"pages\":\"\"},\"PeriodicalIF\":9.8000,\"publicationDate\":\"2025-07-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Medical Imaging\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.1109/tmi.2025.3589928\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Medical Imaging","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1109/tmi.2025.3589928","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
Scaling Chest X-ray Foundation Models from Mixed Supervisions for Dense Prediction.
Foundation models have significantly revolutionized the field of chest X-ray diagnosis with their ability to transfer across various diseases and tasks. However, previous works have predominantly utilized self-supervised learning from medical image-text pairs, which falls short in dense medical prediction tasks due to their sole reliance on such coarse pair supervision, thereby limiting their applicability to detailed diagnostics. In this paper, we introduce a Dense Chest X-ray Foundation Model (DCXFM), which utilizes mixed supervision types (i.e., text, label, and segmentation masks) to significantly enhance the scalability of foundation models across various medical tasks. Our model involves two training stages: we first employ a novel self-distilled multimodal pretraining paradigm to exploit text and label supervision, along with local-to-global self-distillation and soft cross-modal contrastive alignment strategies to enhance localization capabilities. Subsequently, we introduce an efficient cost aggregation module, comprising spatial and class aggregation mechanisms, to further advance dense prediction tasks with densely annotated datasets. Comprehensive evaluations on three tasks (phrase grounding, zero-shot semantic segmentation, and zero-shot classification) demonstrate DCXFM's superior performance over other state-of-the-art medical image-text pretraining models. Remarkably, DCXFM exhibits powerful zero-shot capabilities across various datasets in phrase grounding and zero-shot semantic segmentation, underscoring its superior generalization in dense prediction tasks.
期刊介绍:
The IEEE Transactions on Medical Imaging (T-MI) is a journal that welcomes the submission of manuscripts focusing on various aspects of medical imaging. The journal encourages the exploration of body structure, morphology, and function through different imaging techniques, including ultrasound, X-rays, magnetic resonance, radionuclides, microwaves, and optical methods. It also promotes contributions related to cell and molecular imaging, as well as all forms of microscopy.
T-MI publishes original research papers that cover a wide range of topics, including but not limited to novel acquisition techniques, medical image processing and analysis, visualization and performance, pattern recognition, machine learning, and other related methods. The journal particularly encourages highly technical studies that offer new perspectives. By emphasizing the unification of medicine, biology, and imaging, T-MI seeks to bridge the gap between instrumentation, hardware, software, mathematics, physics, biology, and medicine by introducing new analysis methods.
While the journal welcomes strong application papers that describe novel methods, it directs papers that focus solely on important applications using medically adopted or well-established methods without significant innovation in methodology to other journals. T-MI is indexed in Pubmed® and Medline®, which are products of the United States National Library of Medicine.