Jian Zhang , Kaihao He , Zunlei Feng , Shuifa Sun , Xiaoyan Sun , Zhenming Yuan , Jun Yu
{"title":"用于医学图像分类的自适应图像-文本融合","authors":"Jian Zhang , Kaihao He , Zunlei Feng , Shuifa Sun , Xiaoyan Sun , Zhenming Yuan , Jun Yu","doi":"10.1016/j.patcog.2025.111715","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal classification using both medical images and text reports propels the computer aided disease diagnosis. The performance is susceptible to the quality of image-text fusion. Due to the semantic gap and weak correlation between image and text, current image-text fusion approaches cannot achieve satisfactory results. We propose a self-adaptive image-text fusion approach to multimodal medical image classification. We learn a mapping from image to text to achieve semantic alignment that mitigates the inter-modality semantic gap, and estimate a binary correlation mask with Jensen–Shannon Divergence (JSD) loss to retrieve image and text features that have strong correlations to achieve feature alignment. Then, we propose a parameter-free feature fusion method based on a Simplified-Attention mechanism, which queries image features using text features and concatenates the results to achieve computationally efficient feature fusion. We fuse all the image and text features for medical image classification. Experimental results on three datasets reveal that the proposed approach outperforms a group of state-of-the-art methods, and demonstrates superior medical interpretability.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"167 ","pages":"Article 111715"},"PeriodicalIF":7.6000,"publicationDate":"2025-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Self-adaptive image-text fusion for medical image classification\",\"authors\":\"Jian Zhang , Kaihao He , Zunlei Feng , Shuifa Sun , Xiaoyan Sun , Zhenming Yuan , Jun Yu\",\"doi\":\"10.1016/j.patcog.2025.111715\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal classification using both medical images and text reports propels the computer aided disease diagnosis. The performance is susceptible to the quality of image-text fusion. Due to the semantic gap and weak correlation between image and text, current image-text fusion approaches cannot achieve satisfactory results. We propose a self-adaptive image-text fusion approach to multimodal medical image classification. We learn a mapping from image to text to achieve semantic alignment that mitigates the inter-modality semantic gap, and estimate a binary correlation mask with Jensen–Shannon Divergence (JSD) loss to retrieve image and text features that have strong correlations to achieve feature alignment. Then, we propose a parameter-free feature fusion method based on a Simplified-Attention mechanism, which queries image features using text features and concatenates the results to achieve computationally efficient feature fusion. We fuse all the image and text features for medical image classification. Experimental results on three datasets reveal that the proposed approach outperforms a group of state-of-the-art methods, and demonstrates superior medical interpretability.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"167 \",\"pages\":\"Article 111715\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-05-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325003759\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325003759","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Self-adaptive image-text fusion for medical image classification
Multimodal classification using both medical images and text reports propels the computer aided disease diagnosis. The performance is susceptible to the quality of image-text fusion. Due to the semantic gap and weak correlation between image and text, current image-text fusion approaches cannot achieve satisfactory results. We propose a self-adaptive image-text fusion approach to multimodal medical image classification. We learn a mapping from image to text to achieve semantic alignment that mitigates the inter-modality semantic gap, and estimate a binary correlation mask with Jensen–Shannon Divergence (JSD) loss to retrieve image and text features that have strong correlations to achieve feature alignment. Then, we propose a parameter-free feature fusion method based on a Simplified-Attention mechanism, which queries image features using text features and concatenates the results to achieve computationally efficient feature fusion. We fuse all the image and text features for medical image classification. Experimental results on three datasets reveal that the proposed approach outperforms a group of state-of-the-art methods, and demonstrates superior medical interpretability.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.