Intelligent analysis of chest X-ray based on multi-modal instruction tuning

Meta-Radiology Pub Date : 2025-08-19 DOI:10.1016/j.metrad.2025.100172

Junjie Yao , Junhao Wang , Zhenxiang Xiao , Xinlin Hao , Xi Jiang

{"title":"Intelligent analysis of chest X-ray based on multi-modal instruction tuning","authors":"Junjie Yao , Junhao Wang , Zhenxiang Xiao , Xinlin Hao , Xi Jiang","doi":"10.1016/j.metrad.2025.100172","DOIUrl":null,"url":null,"abstract":"<div><div>Chest X-ray plays a crucial role in the screening and diagnosis of chest diseases. Due to the complexity of pathological manifestations and limitations of radiologists' experience, the accuracy and efficiency of diagnosing chest diseases need to be further improved. In recent years, deep learning has made significant progress in chest X-ray image analysis, while existing methods mainly rely on uni-modal visual information, overlooking the prior knowledge related to disease category descriptions embedded in medical text data, making it challenging to fully understand the deep semantics of chest X-ray images. To address these challenges, inspired by the Instruction-ViT model, we adopt instruction tuning techniques to integrate medical textual information into the fine-tuning process of the visual model. Furthermore, a contrastive learning loss is employed to align textual and visual features, thereby enhancing the model's capacity to understand and differentiate complex pathological patterns. Experimental results demonstrate that the model integrating medical text information outperforms uni-modal models in various evaluation metrics, confirming that with instruction tuning, our model can effectively utilize medical text as prior knowledge to improve the performance of visual models in chest disease diagnosis. Furthermore, we conduct an interpretability analysis of the model's decision-making process, revealing that the regions attended to by the model highly correspond to the radiographic manifestations of different diseases, demonstrating the model's interpretability to a certain degree.</div></div>","PeriodicalId":100921,"journal":{"name":"Meta-Radiology","volume":"3 3","pages":"Article 100172"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Meta-Radiology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2950162825000402","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Chest X-ray plays a crucial role in the screening and diagnosis of chest diseases. Due to the complexity of pathological manifestations and limitations of radiologists' experience, the accuracy and efficiency of diagnosing chest diseases need to be further improved. In recent years, deep learning has made significant progress in chest X-ray image analysis, while existing methods mainly rely on uni-modal visual information, overlooking the prior knowledge related to disease category descriptions embedded in medical text data, making it challenging to fully understand the deep semantics of chest X-ray images. To address these challenges, inspired by the Instruction-ViT model, we adopt instruction tuning techniques to integrate medical textual information into the fine-tuning process of the visual model. Furthermore, a contrastive learning loss is employed to align textual and visual features, thereby enhancing the model's capacity to understand and differentiate complex pathological patterns. Experimental results demonstrate that the model integrating medical text information outperforms uni-modal models in various evaluation metrics, confirming that with instruction tuning, our model can effectively utilize medical text as prior knowledge to improve the performance of visual models in chest disease diagnosis. Furthermore, we conduct an interpretability analysis of the model's decision-making process, revealing that the regions attended to by the model highly correspond to the radiographic manifestations of different diseases, demonstrating the model's interpretability to a certain degree.

Abstract Image

查看原文本刊更多论文

基于多模态指令调谐的胸部x射线智能分析

胸部x线在胸部疾病的筛查和诊断中起着至关重要的作用。由于病理表现的复杂性和放射科医师经验的局限性，胸部疾病诊断的准确性和效率有待进一步提高。近年来，深度学习在胸部x线图像分析方面取得了重大进展，但现有方法主要依赖于单模态视觉信息，忽略了医学文本数据中嵌入的疾病类别描述相关的先验知识，难以充分理解胸部x线图像的深度语义。为了解决这些挑战，受instruction - vit模型的启发，我们采用指令调优技术将医学文本信息集成到视觉模型的微调过程中。此外，使用对比学习损失来对齐文本和视觉特征，从而增强模型理解和区分复杂病理模式的能力。实验结果表明，集成医学文本信息的模型在各种评价指标上优于单模态模型，验证了通过指令调优，我们的模型可以有效地利用医学文本作为先验知识来提高视觉模型在胸部疾病诊断中的性能。进一步，我们对模型的决策过程进行了可解释性分析，发现模型所关注的区域与不同疾病的影像学表现高度对应，说明模型具有一定的可解释性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Meta-Radiology

自引率

0.00%

发文量