VICCA: Visual interpretation and comprehension of chest X-ray anomalies in generated report without human feedback

IF 4.9

Machine learning with applications Pub Date : 2025-06-18 DOI:10.1016/j.mlwa.2025.100684

Sayeh Gholipour Picha, Dawood Al Chanti, Alice Caplier

{"title":"VICCA: Visual interpretation and comprehension of chest X-ray anomalies in generated report without human feedback","authors":"Sayeh Gholipour Picha, Dawood Al Chanti, Alice Caplier","doi":"10.1016/j.mlwa.2025.100684","DOIUrl":null,"url":null,"abstract":"<div><div>As artificial intelligence (AI) becomes increasingly central to healthcare, the demand for explainable and trustworthy models is paramount. Current report generation systems for chest X-rays (CXR) often lack mechanisms for validating outputs without expert oversight, raising concerns about reliability and interpretability. To address these challenges, we propose a novel multimodal framework designed to enhance the semantic alignment between text and image context and the localization accuracy of pathologies within images and reports for AI-generated medical reports. Our framework integrates two key modules: a Phrase Grounding Model, which identifies and localizes pathologies in CXR images based on textual prompts, and a Text-to-Image Diffusion Module, which generates synthetic CXR images from prompts while preserving anatomical fidelity. By comparing features between the original and generated images, we introduce a dual-scoring system: one score quantifies localization accuracy, while the other evaluates semantic consistency between text and image features. Our approach significantly outperforms existing methods in pathology localization, achieving an 8% improvement in Intersection over Union score. It also surpasses state-of-the-art methods in CXR text-to-image generation, with a 1% gain in similarity metrics. Additionally, the integration of phrase grounding with diffusion models, coupled with the dual-scoring evaluation system, provides a robust mechanism for validating report quality, paving the way for more reliable and transparent AI in medical imaging.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"21 ","pages":"Article 100684"},"PeriodicalIF":4.9000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666827025000672","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

As artificial intelligence (AI) becomes increasingly central to healthcare, the demand for explainable and trustworthy models is paramount. Current report generation systems for chest X-rays (CXR) often lack mechanisms for validating outputs without expert oversight, raising concerns about reliability and interpretability. To address these challenges, we propose a novel multimodal framework designed to enhance the semantic alignment between text and image context and the localization accuracy of pathologies within images and reports for AI-generated medical reports. Our framework integrates two key modules: a Phrase Grounding Model, which identifies and localizes pathologies in CXR images based on textual prompts, and a Text-to-Image Diffusion Module, which generates synthetic CXR images from prompts while preserving anatomical fidelity. By comparing features between the original and generated images, we introduce a dual-scoring system: one score quantifies localization accuracy, while the other evaluates semantic consistency between text and image features. Our approach significantly outperforms existing methods in pathology localization, achieving an 8% improvement in Intersection over Union score. It also surpasses state-of-the-art methods in CXR text-to-image generation, with a 1% gain in similarity metrics. Additionally, the integration of phrase grounding with diffusion models, coupled with the dual-scoring evaluation system, provides a robust mechanism for validating report quality, paving the way for more reliable and transparent AI in medical imaging.

Abstract Image

查看原文本刊更多论文

VICCA：无人工反馈的生成报告中胸部x线异常的视觉解释和理解

随着人工智能（AI）在医疗保健领域变得越来越重要，对可解释和可信赖的模型的需求至关重要。目前的胸部x光（CXR）报告生成系统往往缺乏在没有专家监督的情况下验证输出的机制，这引起了对可靠性和可解释性的担忧。为了应对这些挑战，我们提出了一种新的多模态框架，旨在增强文本和图像上下文之间的语义一致性，以及人工智能生成的医学报告中图像和报告中病理的定位准确性。我们的框架集成了两个关键模块：一个短语基础模型，它基于文本提示识别和定位CXR图像中的病理；一个文本到图像扩散模块，它从提示生成合成的CXR图像，同时保持解剖保真度。通过比较原始图像和生成图像之间的特征，我们引入了一个双重评分系统：一个评分量化定位精度，而另一个评分评估文本和图像特征之间的语义一致性。我们的方法在病理定位方面明显优于现有的方法，交叉评分比联合评分提高了8%。在CXR文本到图像生成方面，它也超过了最先进的方法，相似度指标提高了1%。此外，短语基础与扩散模型的集成，加上双重评分评估系统，为验证报告质量提供了强大的机制，为医学成像中更可靠和透明的人工智能铺平了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Machine learning with applications Management Science and Operations Research, Artificial Intelligence, Computer Science Applications

自引率

0.00%

发文量

审稿时长

98 days