Lejun Gong , Jiaming Yang , Shengyuan Han , Yimu Ji
{"title":"MedBLIP: A multimodal method of medical question-answering based on fine-tuning large language model","authors":"Lejun Gong , Jiaming Yang , Shengyuan Han , Yimu Ji","doi":"10.1016/j.compmedimag.2025.102581","DOIUrl":null,"url":null,"abstract":"<div><div>Medical visual question answering is crucial for effectively interpreting medical images containing clinically relevant information. This study proposes a method called MedBLIP (Medical Treatment Bootstrapping Language-Image Pretraining) to tackle visual language generation tasks related to chest X-rays in the medical field. The method combine an image encoder with a large-scale language model, and effectively generates medical question-answering text through a strategy of freezing the image encoder based on the BLIP-2 model. Firstly, chest X-ray images are preprocessed, and an image sample generation algorithm is used to enhance the text data of doctor-patient question-answering, thereby increasing data diversity. Then, a multi-layer convolutional image feature extractor is introduced to better capture the feature representation of medical images. During the fine-tuning process of the large language generation model, a new unfreezing strategy is proposed, which is to unfreeze different proportions of the weights of the fully connected layer to adapt to the data in the medical field. The image feature extractor is responsible for extracting key features from images, providing the model with rich visual information, while the text feature extractor accurately captures the essential requirements of the user's question. Through their synergistic interaction, the model can more effectively integrate medical images and user inquiries, thereby generating more accurate and relevant output content. The experimental results show that unfreezing 31.25 % of the weights of the fully connected layer can significantly improve the performance of the model, with ROUGE-L reaching 66.12 %, and providing a more accurate and efficient answer generation solution for the medical field. The method of this study has potential applications in the field of medical language generation tasks. Although the proposed model cannot yet fully replace human radiologists, it plays an indispensable role in improving diagnostic efficiency, assisting decision-making, and supporting medical research. With continuous technological advancements, the model's performance will be further enhanced, and its application value in the medical field will become even more significant. The algorithm implementation can be obtained from <span><span>https://github.com/JiminFohill/MedicalChat.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50631,"journal":{"name":"Computerized Medical Imaging and Graphics","volume":"124 ","pages":"Article 102581"},"PeriodicalIF":5.4000,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computerized Medical Imaging and Graphics","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0895611125000904","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Medical visual question answering is crucial for effectively interpreting medical images containing clinically relevant information. This study proposes a method called MedBLIP (Medical Treatment Bootstrapping Language-Image Pretraining) to tackle visual language generation tasks related to chest X-rays in the medical field. The method combine an image encoder with a large-scale language model, and effectively generates medical question-answering text through a strategy of freezing the image encoder based on the BLIP-2 model. Firstly, chest X-ray images are preprocessed, and an image sample generation algorithm is used to enhance the text data of doctor-patient question-answering, thereby increasing data diversity. Then, a multi-layer convolutional image feature extractor is introduced to better capture the feature representation of medical images. During the fine-tuning process of the large language generation model, a new unfreezing strategy is proposed, which is to unfreeze different proportions of the weights of the fully connected layer to adapt to the data in the medical field. The image feature extractor is responsible for extracting key features from images, providing the model with rich visual information, while the text feature extractor accurately captures the essential requirements of the user's question. Through their synergistic interaction, the model can more effectively integrate medical images and user inquiries, thereby generating more accurate and relevant output content. The experimental results show that unfreezing 31.25 % of the weights of the fully connected layer can significantly improve the performance of the model, with ROUGE-L reaching 66.12 %, and providing a more accurate and efficient answer generation solution for the medical field. The method of this study has potential applications in the field of medical language generation tasks. Although the proposed model cannot yet fully replace human radiologists, it plays an indispensable role in improving diagnostic efficiency, assisting decision-making, and supporting medical research. With continuous technological advancements, the model's performance will be further enhanced, and its application value in the medical field will become even more significant. The algorithm implementation can be obtained from https://github.com/JiminFohill/MedicalChat.git.
期刊介绍:
The purpose of the journal Computerized Medical Imaging and Graphics is to act as a source for the exchange of research results concerning algorithmic advances, development, and application of digital imaging in disease detection, diagnosis, intervention, prevention, precision medicine, and population health. Included in the journal will be articles on novel computerized imaging or visualization techniques, including artificial intelligence and machine learning, augmented reality for surgical planning and guidance, big biomedical data visualization, computer-aided diagnosis, computerized-robotic surgery, image-guided therapy, imaging scanning and reconstruction, mobile and tele-imaging, radiomics, and imaging integration and modeling with other information relevant to digital health. The types of biomedical imaging include: magnetic resonance, computed tomography, ultrasound, nuclear medicine, X-ray, microwave, optical and multi-photon microscopy, video and sensory imaging, and the convergence of biomedical images with other non-imaging datasets.