Leveraging Image-text Pairs for Generalized Category Discovery in Medical Image Classification.

Wei Feng, Bingjie Wang, Zhonghua Wang, Sijin Zhou, Zongyuan Ge
{"title":"Leveraging Image-text Pairs for Generalized Category Discovery in Medical Image Classification.","authors":"Wei Feng, Bingjie Wang, Zhonghua Wang, Sijin Zhou, Zongyuan Ge","doi":"10.1109/TMI.2026.3689859","DOIUrl":null,"url":null,"abstract":"<p><p>Generalized category discovery aims to identify known medical categories and unknown new medical categories from unlabeled data by migrating knowledge from labeled datasets containing only known categories, which is crucial for disease understanding and precision medicine. Many methods have been proposed and significantly improved the performance of GCD in medical images. However, most of the existing methods discover new categories based on image modalities only, ignoring useful information in the large amount of textual data related to diseases. In this paper, we propose M<sup>3</sup>GCD (Medical Multi-Modal Generalized Category Discovery), which exploits image- text pairs to jointly recognize known classes and discover novel categories in medical images. To address the varying contribution of different modalities across samples, we develop a Dynamic Expert Fusion module to automatically learn sample-specific modality weights, and further design a Local Experts Balancing mechanism to preserve the discriminative power of individual modalities. By integrating global and local perspectives, our framework adaptively balances modality contributions and enhances multi-modal robustness. Subsequently, to enable the discovery of novel unknown categories during training, we propose a Category Diffusion module grounded in the Metropolis- Hastings framework. This module adaptively merges and splits categories, allowing the model to simultaneously recognize known classes and uncover previously unseen categories during training, without requiring any prior knowledge about the unknown categories. Extensive experiments on two public multi-modal datasets (MIMIC-CXR and PatchGastric), together with a private multi-modal fundus dataset, MM-Retina, demonstrate that our method consistently improves clustering performance on both known and unknown categories compared with existing approaches.</p>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"PP ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on medical imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TMI.2026.3689859","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Generalized category discovery aims to identify known medical categories and unknown new medical categories from unlabeled data by migrating knowledge from labeled datasets containing only known categories, which is crucial for disease understanding and precision medicine. Many methods have been proposed and significantly improved the performance of GCD in medical images. However, most of the existing methods discover new categories based on image modalities only, ignoring useful information in the large amount of textual data related to diseases. In this paper, we propose M3GCD (Medical Multi-Modal Generalized Category Discovery), which exploits image- text pairs to jointly recognize known classes and discover novel categories in medical images. To address the varying contribution of different modalities across samples, we develop a Dynamic Expert Fusion module to automatically learn sample-specific modality weights, and further design a Local Experts Balancing mechanism to preserve the discriminative power of individual modalities. By integrating global and local perspectives, our framework adaptively balances modality contributions and enhances multi-modal robustness. Subsequently, to enable the discovery of novel unknown categories during training, we propose a Category Diffusion module grounded in the Metropolis- Hastings framework. This module adaptively merges and splits categories, allowing the model to simultaneously recognize known classes and uncover previously unseen categories during training, without requiring any prior knowledge about the unknown categories. Extensive experiments on two public multi-modal datasets (MIMIC-CXR and PatchGastric), together with a private multi-modal fundus dataset, MM-Retina, demonstrate that our method consistently improves clustering performance on both known and unknown categories compared with existing approaches.

利用图像-文本对进行医学图像分类中的广义类别发现。
广义类别发现旨在通过从仅包含已知类别的标记数据集中迁移知识,从未标记的数据中识别已知的医学类别和未知的新医学类别,这对于疾病理解和精准医学至关重要。人们提出了许多方法,并显著提高了GCD在医学图像中的性能。然而,大多数现有方法仅基于图像模式发现新类别,而忽略了与疾病相关的大量文本数据中的有用信息。在本文中,我们提出M3GCD(医学多模态广义类别发现),它利用图像-文本对来共同识别医学图像中的已知类别和发现新的类别。为了解决不同模态对样本的不同贡献,我们开发了一个动态专家融合模块来自动学习样本特定模态权重,并进一步设计了一个局部专家平衡机制来保持个体模态的判别能力。通过整合全局和局部视角,我们的框架自适应地平衡了模态贡献并增强了多模态鲁棒性。随后,为了在训练过程中发现新的未知类别,我们提出了一个基于Metropolis- Hastings框架的类别扩散模块。该模块自适应地合并和拆分类别,允许模型在训练过程中同时识别已知类并发现以前未见过的类别,而不需要任何关于未知类别的先验知识。在两个公共多模态数据集(MIMIC-CXR和PatchGastric)以及一个私有多模态眼底数据集MM-Retina上进行的大量实验表明,与现有方法相比,我们的方法在已知和未知类别上都能持续提高聚类性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书