Junjie Yao , Junhao Wang , Zhenxiang Xiao , Xinlin Hao , Xi Jiang
{"title":"Intelligent analysis of chest X-ray based on multi-modal instruction tuning","authors":"Junjie Yao , Junhao Wang , Zhenxiang Xiao , Xinlin Hao , Xi Jiang","doi":"10.1016/j.metrad.2025.100172","DOIUrl":null,"url":null,"abstract":"<div><div>Chest X-ray plays a crucial role in the screening and diagnosis of chest diseases. Due to the complexity of pathological manifestations and limitations of radiologists' experience, the accuracy and efficiency of diagnosing chest diseases need to be further improved. In recent years, deep learning has made significant progress in chest X-ray image analysis, while existing methods mainly rely on uni-modal visual information, overlooking the prior knowledge related to disease category descriptions embedded in medical text data, making it challenging to fully understand the deep semantics of chest X-ray images. To address these challenges, inspired by the Instruction-ViT model, we adopt instruction tuning techniques to integrate medical textual information into the fine-tuning process of the visual model. Furthermore, a contrastive learning loss is employed to align textual and visual features, thereby enhancing the model's capacity to understand and differentiate complex pathological patterns. Experimental results demonstrate that the model integrating medical text information outperforms uni-modal models in various evaluation metrics, confirming that with instruction tuning, our model can effectively utilize medical text as prior knowledge to improve the performance of visual models in chest disease diagnosis. Furthermore, we conduct an interpretability analysis of the model's decision-making process, revealing that the regions attended to by the model highly correspond to the radiographic manifestations of different diseases, demonstrating the model's interpretability to a certain degree.</div></div>","PeriodicalId":100921,"journal":{"name":"Meta-Radiology","volume":"3 3","pages":"Article 100172"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Meta-Radiology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2950162825000402","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Chest X-ray plays a crucial role in the screening and diagnosis of chest diseases. Due to the complexity of pathological manifestations and limitations of radiologists' experience, the accuracy and efficiency of diagnosing chest diseases need to be further improved. In recent years, deep learning has made significant progress in chest X-ray image analysis, while existing methods mainly rely on uni-modal visual information, overlooking the prior knowledge related to disease category descriptions embedded in medical text data, making it challenging to fully understand the deep semantics of chest X-ray images. To address these challenges, inspired by the Instruction-ViT model, we adopt instruction tuning techniques to integrate medical textual information into the fine-tuning process of the visual model. Furthermore, a contrastive learning loss is employed to align textual and visual features, thereby enhancing the model's capacity to understand and differentiate complex pathological patterns. Experimental results demonstrate that the model integrating medical text information outperforms uni-modal models in various evaluation metrics, confirming that with instruction tuning, our model can effectively utilize medical text as prior knowledge to improve the performance of visual models in chest disease diagnosis. Furthermore, we conduct an interpretability analysis of the model's decision-making process, revealing that the regions attended to by the model highly correspond to the radiographic manifestations of different diseases, demonstrating the model's interpretability to a certain degree.