{"title":"医学图像报告生成的动态特征融合引导与多模态大语言模型细化","authors":"Pu Han , Xiong Li , Shenqi Jing , Jianxiang Wei","doi":"10.1016/j.eswa.2025.130082","DOIUrl":null,"url":null,"abstract":"<div><div>Medical image report generation refers to the automatic generation of text descriptions that correspond to specific medical images. In recent years, the increasing demand for medical imaging from both patients and healthcare institutions has significantly increased radiologists’ workloads. Concurrently, shortages in medical resources and diagnostic capabilities have raised the risks of diagnostic delays and misinterpretations in medical imaging. To alleviate the burden on medical professionals and ensure accurate diagnoses, the task of automated medical report generation has attracted a growing number of researchers. In this context, systems based on deep learning methods combined with general Large Language Models (LLMs) have been developed. However, existing methods face limitations in effectively integrating visual and textual data and they ignore the fact that the contributions of different modalities to diagnostic results vary across cases. Additionally, these approaches fail to address the lack of specialized medical knowledge when applying general LLMs. This paper introduces the Dynamic Feature Fusion Guiding and Multimodal Large Language Model Refining (DFFG-MLLMR) framework, which addresses these limitations through two key components:(1) The DFFG module dynamically adjusts the contributions of visual and textual features based on their diagnostic relevance, ensuring optimal feature utilization for report generation; (2) The MLLMR module integrates visual retrieval methods with fine-tuned LLMs to generate comprehensive and accurate medical reports. Our method achieves quantitatively superior results to other baseline methods on both benchmark datasets. On the IU-Xray dataset, DFFG-MLLMR achieves BLEU-4 of 0.191 and CIDEr of 0.574, exceeding the best conventional approach Token-Mixer. On the MIMIC-CXR dataset, our method achieves BLEU-4 of 0.132 and CIDEr of 0.289, improving upon Token-Mixer by 0.008 and 0.126. Experiments on public datasets demonstrate the superiority of DFFG-MLLMR, showing significant improvements in cross-modal feature fusion performance and enhanced diagnostic quality in automated reports. Furthermore, ablation studies confirm that the DFFG and MLLMR modules contribute complementary improvements, collectively enhancing the accuracy and clinical reliability of reports. The code can be obtained at <span><span>https://github.com/BearLiX/DFFG-MLLMR</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"299 ","pages":"Article 130082"},"PeriodicalIF":7.5000,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dynamic feature fusion guiding and multimodal large language model refining for medical image report generation\",\"authors\":\"Pu Han , Xiong Li , Shenqi Jing , Jianxiang Wei\",\"doi\":\"10.1016/j.eswa.2025.130082\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Medical image report generation refers to the automatic generation of text descriptions that correspond to specific medical images. In recent years, the increasing demand for medical imaging from both patients and healthcare institutions has significantly increased radiologists’ workloads. Concurrently, shortages in medical resources and diagnostic capabilities have raised the risks of diagnostic delays and misinterpretations in medical imaging. To alleviate the burden on medical professionals and ensure accurate diagnoses, the task of automated medical report generation has attracted a growing number of researchers. In this context, systems based on deep learning methods combined with general Large Language Models (LLMs) have been developed. However, existing methods face limitations in effectively integrating visual and textual data and they ignore the fact that the contributions of different modalities to diagnostic results vary across cases. Additionally, these approaches fail to address the lack of specialized medical knowledge when applying general LLMs. This paper introduces the Dynamic Feature Fusion Guiding and Multimodal Large Language Model Refining (DFFG-MLLMR) framework, which addresses these limitations through two key components:(1) The DFFG module dynamically adjusts the contributions of visual and textual features based on their diagnostic relevance, ensuring optimal feature utilization for report generation; (2) The MLLMR module integrates visual retrieval methods with fine-tuned LLMs to generate comprehensive and accurate medical reports. Our method achieves quantitatively superior results to other baseline methods on both benchmark datasets. On the IU-Xray dataset, DFFG-MLLMR achieves BLEU-4 of 0.191 and CIDEr of 0.574, exceeding the best conventional approach Token-Mixer. On the MIMIC-CXR dataset, our method achieves BLEU-4 of 0.132 and CIDEr of 0.289, improving upon Token-Mixer by 0.008 and 0.126. Experiments on public datasets demonstrate the superiority of DFFG-MLLMR, showing significant improvements in cross-modal feature fusion performance and enhanced diagnostic quality in automated reports. Furthermore, ablation studies confirm that the DFFG and MLLMR modules contribute complementary improvements, collectively enhancing the accuracy and clinical reliability of reports. The code can be obtained at <span><span>https://github.com/BearLiX/DFFG-MLLMR</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"299 \",\"pages\":\"Article 130082\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-10-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S095741742503698X\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S095741742503698X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Dynamic feature fusion guiding and multimodal large language model refining for medical image report generation
Medical image report generation refers to the automatic generation of text descriptions that correspond to specific medical images. In recent years, the increasing demand for medical imaging from both patients and healthcare institutions has significantly increased radiologists’ workloads. Concurrently, shortages in medical resources and diagnostic capabilities have raised the risks of diagnostic delays and misinterpretations in medical imaging. To alleviate the burden on medical professionals and ensure accurate diagnoses, the task of automated medical report generation has attracted a growing number of researchers. In this context, systems based on deep learning methods combined with general Large Language Models (LLMs) have been developed. However, existing methods face limitations in effectively integrating visual and textual data and they ignore the fact that the contributions of different modalities to diagnostic results vary across cases. Additionally, these approaches fail to address the lack of specialized medical knowledge when applying general LLMs. This paper introduces the Dynamic Feature Fusion Guiding and Multimodal Large Language Model Refining (DFFG-MLLMR) framework, which addresses these limitations through two key components:(1) The DFFG module dynamically adjusts the contributions of visual and textual features based on their diagnostic relevance, ensuring optimal feature utilization for report generation; (2) The MLLMR module integrates visual retrieval methods with fine-tuned LLMs to generate comprehensive and accurate medical reports. Our method achieves quantitatively superior results to other baseline methods on both benchmark datasets. On the IU-Xray dataset, DFFG-MLLMR achieves BLEU-4 of 0.191 and CIDEr of 0.574, exceeding the best conventional approach Token-Mixer. On the MIMIC-CXR dataset, our method achieves BLEU-4 of 0.132 and CIDEr of 0.289, improving upon Token-Mixer by 0.008 and 0.126. Experiments on public datasets demonstrate the superiority of DFFG-MLLMR, showing significant improvements in cross-modal feature fusion performance and enhanced diagnostic quality in automated reports. Furthermore, ablation studies confirm that the DFFG and MLLMR modules contribute complementary improvements, collectively enhancing the accuracy and clinical reliability of reports. The code can be obtained at https://github.com/BearLiX/DFFG-MLLMR.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.