Wei Feng, Bingjie Wang, Zhonghua Wang, Sijin Zhou, Zongyuan Ge
{"title":"Leveraging Image-text Pairs for Generalized Category Discovery in Medical Image Classification.","authors":"Wei Feng, Bingjie Wang, Zhonghua Wang, Sijin Zhou, Zongyuan Ge","doi":"10.1109/TMI.2026.3689859","DOIUrl":null,"url":null,"abstract":"<p><p>Generalized category discovery aims to identify known medical categories and unknown new medical categories from unlabeled data by migrating knowledge from labeled datasets containing only known categories, which is crucial for disease understanding and precision medicine. Many methods have been proposed and significantly improved the performance of GCD in medical images. However, most of the existing methods discover new categories based on image modalities only, ignoring useful information in the large amount of textual data related to diseases. In this paper, we propose M<sup>3</sup>GCD (Medical Multi-Modal Generalized Category Discovery), which exploits image- text pairs to jointly recognize known classes and discover novel categories in medical images. To address the varying contribution of different modalities across samples, we develop a Dynamic Expert Fusion module to automatically learn sample-specific modality weights, and further design a Local Experts Balancing mechanism to preserve the discriminative power of individual modalities. By integrating global and local perspectives, our framework adaptively balances modality contributions and enhances multi-modal robustness. Subsequently, to enable the discovery of novel unknown categories during training, we propose a Category Diffusion module grounded in the Metropolis- Hastings framework. This module adaptively merges and splits categories, allowing the model to simultaneously recognize known classes and uncover previously unseen categories during training, without requiring any prior knowledge about the unknown categories. Extensive experiments on two public multi-modal datasets (MIMIC-CXR and PatchGastric), together with a private multi-modal fundus dataset, MM-Retina, demonstrate that our method consistently improves clustering performance on both known and unknown categories compared with existing approaches.</p>","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"PP ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on medical imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TMI.2026.3689859","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Generalized category discovery aims to identify known medical categories and unknown new medical categories from unlabeled data by migrating knowledge from labeled datasets containing only known categories, which is crucial for disease understanding and precision medicine. Many methods have been proposed and significantly improved the performance of GCD in medical images. However, most of the existing methods discover new categories based on image modalities only, ignoring useful information in the large amount of textual data related to diseases. In this paper, we propose M3GCD (Medical Multi-Modal Generalized Category Discovery), which exploits image- text pairs to jointly recognize known classes and discover novel categories in medical images. To address the varying contribution of different modalities across samples, we develop a Dynamic Expert Fusion module to automatically learn sample-specific modality weights, and further design a Local Experts Balancing mechanism to preserve the discriminative power of individual modalities. By integrating global and local perspectives, our framework adaptively balances modality contributions and enhances multi-modal robustness. Subsequently, to enable the discovery of novel unknown categories during training, we propose a Category Diffusion module grounded in the Metropolis- Hastings framework. This module adaptively merges and splits categories, allowing the model to simultaneously recognize known classes and uncover previously unseen categories during training, without requiring any prior knowledge about the unknown categories. Extensive experiments on two public multi-modal datasets (MIMIC-CXR and PatchGastric), together with a private multi-modal fundus dataset, MM-Retina, demonstrate that our method consistently improves clustering performance on both known and unknown categories compared with existing approaches.