Segmentation-enhanced Medical Visual Question Answering with mask-prompt alignment using contrastive learning and multitask object grounding

IF 8 2区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS
Qishen Chen , Huahu Xu , Wenxuan He , Xingyuan Chen , Minjie Bian , Honghao Gao
{"title":"Segmentation-enhanced Medical Visual Question Answering with mask-prompt alignment using contrastive learning and multitask object grounding","authors":"Qishen Chen ,&nbsp;Huahu Xu ,&nbsp;Wenxuan He ,&nbsp;Xingyuan Chen ,&nbsp;Minjie Bian ,&nbsp;Honghao Gao","doi":"10.1016/j.engappai.2025.112866","DOIUrl":null,"url":null,"abstract":"<div><div>Medical Visual Question Answering (MedVQA) aims to provide clinical suggestions by analyzing medical images in response to textual queries. However, existing methods struggle to accurately identify anatomical structures and pathological abnormalities, leading to unreliable predictions. Many deep learning-based approaches also lack interpretability, making their diagnostic reasoning opaque. To address these challenges, this paper proposes Mask-Prompt Aligned Visual Question Answering (MPA-VQA), a two-stage framework that integrates segmentation information into the MedVQA process. First, a segmentation model is trained to detect key structures within medical images. To mitigate the issue of limited segmentation annotations, this paper introduces an improved CutMix-based data augmentation strategy. Second, segmentation masks are used to generate prompts, which are incorporated into the question-answering process for the first time to enhance interpretability. Third, to improve the alignment between image, mask, and prompt representations, this paper proposes a dual-granularity mask-prompt alignment (MPA) method. At the image level, MPA employs contrastive learning to encourage global consistency, while at the object level, it leverages multi-task object grounding to enhance localization accuracy. A mask-guided attention mechanism is also introduced to ensure the model focuses on clinically relevant image regions. Finally, the proposed MPA-VQA is validated on the SLAKE and MedVQA-GI datasets, demonstrating state-of-the-art performance. Notably, MPA-VQA improves location-related question accuracy by 6.37% on MedVQA-GI. MPA-VQA is also a plug-and-play framework that can be seamlessly integrated into existing MedVQA architectures without requiring major modifications.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"163 ","pages":"Article 112866"},"PeriodicalIF":8.0000,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197625028970","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Medical Visual Question Answering (MedVQA) aims to provide clinical suggestions by analyzing medical images in response to textual queries. However, existing methods struggle to accurately identify anatomical structures and pathological abnormalities, leading to unreliable predictions. Many deep learning-based approaches also lack interpretability, making their diagnostic reasoning opaque. To address these challenges, this paper proposes Mask-Prompt Aligned Visual Question Answering (MPA-VQA), a two-stage framework that integrates segmentation information into the MedVQA process. First, a segmentation model is trained to detect key structures within medical images. To mitigate the issue of limited segmentation annotations, this paper introduces an improved CutMix-based data augmentation strategy. Second, segmentation masks are used to generate prompts, which are incorporated into the question-answering process for the first time to enhance interpretability. Third, to improve the alignment between image, mask, and prompt representations, this paper proposes a dual-granularity mask-prompt alignment (MPA) method. At the image level, MPA employs contrastive learning to encourage global consistency, while at the object level, it leverages multi-task object grounding to enhance localization accuracy. A mask-guided attention mechanism is also introduced to ensure the model focuses on clinically relevant image regions. Finally, the proposed MPA-VQA is validated on the SLAKE and MedVQA-GI datasets, demonstrating state-of-the-art performance. Notably, MPA-VQA improves location-related question accuracy by 6.37% on MedVQA-GI. MPA-VQA is also a plug-and-play framework that can be seamlessly integrated into existing MedVQA architectures without requiring major modifications.
使用对比学习和多任务对象基础的基于掩码提示对齐的分割增强医学视觉问答
医学视觉问答(MedVQA)旨在通过分析医学图像来响应文本查询,从而提供临床建议。然而,现有的方法难以准确识别解剖结构和病理异常,导致不可靠的预测。许多基于深度学习的方法也缺乏可解释性,这使得它们的诊断推理不透明。为了解决这些问题,本文提出了掩码提示对齐视觉问答(MPA-VQA),这是一个将分割信息集成到MedVQA过程中的两阶段框架。首先,训练分割模型来检测医学图像中的关键结构。为了解决分割标注有限的问题,本文引入了一种改进的基于cutmix的数据增强策略。其次,使用分段掩码生成提示,并首次将其整合到问答过程中,以增强可解释性。第三,为了改善图像、掩码和提示表示之间的对齐,本文提出了一种双粒度掩码-提示对齐(MPA)方法。在图像层面,MPA采用对比学习来促进全局一致性,而在对象层面,它利用多任务对象基础来提高定位精度。还引入了一个面具引导的注意机制,以确保模型专注于临床相关的图像区域。最后,在SLAKE和MedVQA-GI数据集上验证了所提出的MPA-VQA,展示了最先进的性能。值得注意的是,MPA-VQA在MedVQA-GI上将位置相关问题的准确率提高了6.37%。MPA-VQA也是一个即插即用框架,可以无缝集成到现有的MedVQA体系结构中,而无需进行重大修改。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Engineering Applications of Artificial Intelligence
Engineering Applications of Artificial Intelligence 工程技术-工程:电子与电气
CiteScore
9.60
自引率
10.00%
发文量
505
审稿时长
68 days
期刊介绍: Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信