Knowledge fusion in deep learning-based medical vision-language models: A review

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-07-07 DOI:10.1016/j.inffus.2025.103455

Dexuan Xu , Yanyuan Chen , Zhongyan Chai , Yifan Xiao , Yandong Yan , Weiping Ding , Hanpin Wang , Zhi Jin , Wenpin Jiao , Weihua Yue , Hang Li , Yu Huang

{"title":"Knowledge fusion in deep learning-based medical vision-language models: A review","authors":"Dexuan Xu , Yanyuan Chen , Zhongyan Chai , Yifan Xiao , Yandong Yan , Weiping Ding , Hanpin Wang , Zhi Jin , Wenpin Jiao , Weihua Yue , Hang Li , Yu Huang","doi":"10.1016/j.inffus.2025.103455","DOIUrl":null,"url":null,"abstract":"<div><div>Medical vision-language models based on deep learning can automatically extract image features and fuse them with text information, which has promoted the rapid development of multimodal medical artificial intelligence. However, the complexity of the medical field requires the model to have a deep professional knowledge background. Therefore, knowledge fusion technology provides a new idea for solving medical vision-language tasks. Different from the existing reviews, this paper systematically sorts out the knowledge fusion methods in medical vision-language models from two unique perspectives: the stage characteristics of knowledge fusion and the task-oriented fusion strategy, and provides a new theoretical framework for research in the field. Firstly, this paper introduces the classification of medical knowledge and its applicable scenarios in detail. Subsequently, we systematically discuss the knowledge fusion algorithm based on deep learning and summarize the four different knowledge fusion stages (data construction, pretraining, feature representation and inference) in the medical vision-language model. In addition, this paper comprehensively analyzes the specific strategies of knowledge fusion in five types of medical vision-language tasks (medical report generation, medical visual question answering, medical language-guided segmentation, medical multimodal pretraining, and multimodal large language model), and summarizes the evaluation methods based on knowledge fusion in detail. Finally, we summarize future research directions, including enhanced interpretability, mixture-of-experts models, knowledge editing, etc., aiming to provide researchers with references that have both theoretical value and practical significance.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"125 ","pages":"Article 103455"},"PeriodicalIF":15.5000,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525005287","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Medical vision-language models based on deep learning can automatically extract image features and fuse them with text information, which has promoted the rapid development of multimodal medical artificial intelligence. However, the complexity of the medical field requires the model to have a deep professional knowledge background. Therefore, knowledge fusion technology provides a new idea for solving medical vision-language tasks. Different from the existing reviews, this paper systematically sorts out the knowledge fusion methods in medical vision-language models from two unique perspectives: the stage characteristics of knowledge fusion and the task-oriented fusion strategy, and provides a new theoretical framework for research in the field. Firstly, this paper introduces the classification of medical knowledge and its applicable scenarios in detail. Subsequently, we systematically discuss the knowledge fusion algorithm based on deep learning and summarize the four different knowledge fusion stages (data construction, pretraining, feature representation and inference) in the medical vision-language model. In addition, this paper comprehensively analyzes the specific strategies of knowledge fusion in five types of medical vision-language tasks (medical report generation, medical visual question answering, medical language-guided segmentation, medical multimodal pretraining, and multimodal large language model), and summarizes the evaluation methods based on knowledge fusion in detail. Finally, we summarize future research directions, including enhanced interpretability, mixture-of-experts models, knowledge editing, etc., aiming to provide researchers with references that have both theoretical value and practical significance.

查看原文本刊更多论文

基于深度学习的医学视觉语言模型中的知识融合研究进展

基于深度学习的医学视觉语言模型可以自动提取图像特征并与文本信息融合，促进了多模态医学人工智能的快速发展。然而，医学领域的复杂性要求模型具有深厚的专业知识背景。因此，知识融合技术为解决医学视觉语言任务提供了新的思路。与已有文献不同，本文从知识融合的阶段特征和任务导向的融合策略两个独特的视角对医学视觉语言模型中的知识融合方法进行了系统梳理，为该领域的研究提供了新的理论框架。本文首先详细介绍了医学知识的分类及其应用场景。随后，我们系统地讨论了基于深度学习的知识融合算法，总结了医学视觉语言模型中四个不同的知识融合阶段（数据构建、预训练、特征表示和推理）。此外，本文全面分析了五类医学视觉语言任务（医学报告生成、医学视觉问答、医学语言引导分割、医学多模态预训练、多模态大语言模型）中知识融合的具体策略，并详细总结了基于知识融合的评价方法。最后，总结了未来的研究方向，包括增强可解释性、专家混合模型、知识编辑等，旨在为研究者提供既有理论价值又有现实意义的参考。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.