Dexuan Xu , Yanyuan Chen , Zhongyan Chai , Yifan Xiao , Yandong Yan , Weiping Ding , Hanpin Wang , Zhi Jin , Wenpin Jiao , Weihua Yue , Hang Li , Yu Huang
{"title":"Knowledge fusion in deep learning-based medical vision-language models: A review","authors":"Dexuan Xu , Yanyuan Chen , Zhongyan Chai , Yifan Xiao , Yandong Yan , Weiping Ding , Hanpin Wang , Zhi Jin , Wenpin Jiao , Weihua Yue , Hang Li , Yu Huang","doi":"10.1016/j.inffus.2025.103455","DOIUrl":null,"url":null,"abstract":"<div><div>Medical vision-language models based on deep learning can automatically extract image features and fuse them with text information, which has promoted the rapid development of multimodal medical artificial intelligence. However, the complexity of the medical field requires the model to have a deep professional knowledge background. Therefore, knowledge fusion technology provides a new idea for solving medical vision-language tasks. Different from the existing reviews, this paper systematically sorts out the knowledge fusion methods in medical vision-language models from two unique perspectives: the stage characteristics of knowledge fusion and the task-oriented fusion strategy, and provides a new theoretical framework for research in the field. Firstly, this paper introduces the classification of medical knowledge and its applicable scenarios in detail. Subsequently, we systematically discuss the knowledge fusion algorithm based on deep learning and summarize the four different knowledge fusion stages (data construction, pretraining, feature representation and inference) in the medical vision-language model. In addition, this paper comprehensively analyzes the specific strategies of knowledge fusion in five types of medical vision-language tasks (medical report generation, medical visual question answering, medical language-guided segmentation, medical multimodal pretraining, and multimodal large language model), and summarizes the evaluation methods based on knowledge fusion in detail. Finally, we summarize future research directions, including enhanced interpretability, mixture-of-experts models, knowledge editing, etc., aiming to provide researchers with references that have both theoretical value and practical significance.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"125 ","pages":"Article 103455"},"PeriodicalIF":15.5000,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525005287","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Medical vision-language models based on deep learning can automatically extract image features and fuse them with text information, which has promoted the rapid development of multimodal medical artificial intelligence. However, the complexity of the medical field requires the model to have a deep professional knowledge background. Therefore, knowledge fusion technology provides a new idea for solving medical vision-language tasks. Different from the existing reviews, this paper systematically sorts out the knowledge fusion methods in medical vision-language models from two unique perspectives: the stage characteristics of knowledge fusion and the task-oriented fusion strategy, and provides a new theoretical framework for research in the field. Firstly, this paper introduces the classification of medical knowledge and its applicable scenarios in detail. Subsequently, we systematically discuss the knowledge fusion algorithm based on deep learning and summarize the four different knowledge fusion stages (data construction, pretraining, feature representation and inference) in the medical vision-language model. In addition, this paper comprehensively analyzes the specific strategies of knowledge fusion in five types of medical vision-language tasks (medical report generation, medical visual question answering, medical language-guided segmentation, medical multimodal pretraining, and multimodal large language model), and summarizes the evaluation methods based on knowledge fusion in detail. Finally, we summarize future research directions, including enhanced interpretability, mixture-of-experts models, knowledge editing, etc., aiming to provide researchers with references that have both theoretical value and practical significance.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.