{"title":"Segmentation-enhanced Medical Visual Question Answering with mask-prompt alignment using contrastive learning and multitask object grounding","authors":"Qishen Chen , Huahu Xu , Wenxuan He , Xingyuan Chen , Minjie Bian , Honghao Gao","doi":"10.1016/j.engappai.2025.112866","DOIUrl":null,"url":null,"abstract":"<div><div>Medical Visual Question Answering (MedVQA) aims to provide clinical suggestions by analyzing medical images in response to textual queries. However, existing methods struggle to accurately identify anatomical structures and pathological abnormalities, leading to unreliable predictions. Many deep learning-based approaches also lack interpretability, making their diagnostic reasoning opaque. To address these challenges, this paper proposes Mask-Prompt Aligned Visual Question Answering (MPA-VQA), a two-stage framework that integrates segmentation information into the MedVQA process. First, a segmentation model is trained to detect key structures within medical images. To mitigate the issue of limited segmentation annotations, this paper introduces an improved CutMix-based data augmentation strategy. Second, segmentation masks are used to generate prompts, which are incorporated into the question-answering process for the first time to enhance interpretability. Third, to improve the alignment between image, mask, and prompt representations, this paper proposes a dual-granularity mask-prompt alignment (MPA) method. At the image level, MPA employs contrastive learning to encourage global consistency, while at the object level, it leverages multi-task object grounding to enhance localization accuracy. A mask-guided attention mechanism is also introduced to ensure the model focuses on clinically relevant image regions. Finally, the proposed MPA-VQA is validated on the SLAKE and MedVQA-GI datasets, demonstrating state-of-the-art performance. Notably, MPA-VQA improves location-related question accuracy by 6.37% on MedVQA-GI. MPA-VQA is also a plug-and-play framework that can be seamlessly integrated into existing MedVQA architectures without requiring major modifications.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"163 ","pages":"Article 112866"},"PeriodicalIF":8.0000,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197625028970","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Medical Visual Question Answering (MedVQA) aims to provide clinical suggestions by analyzing medical images in response to textual queries. However, existing methods struggle to accurately identify anatomical structures and pathological abnormalities, leading to unreliable predictions. Many deep learning-based approaches also lack interpretability, making their diagnostic reasoning opaque. To address these challenges, this paper proposes Mask-Prompt Aligned Visual Question Answering (MPA-VQA), a two-stage framework that integrates segmentation information into the MedVQA process. First, a segmentation model is trained to detect key structures within medical images. To mitigate the issue of limited segmentation annotations, this paper introduces an improved CutMix-based data augmentation strategy. Second, segmentation masks are used to generate prompts, which are incorporated into the question-answering process for the first time to enhance interpretability. Third, to improve the alignment between image, mask, and prompt representations, this paper proposes a dual-granularity mask-prompt alignment (MPA) method. At the image level, MPA employs contrastive learning to encourage global consistency, while at the object level, it leverages multi-task object grounding to enhance localization accuracy. A mask-guided attention mechanism is also introduced to ensure the model focuses on clinically relevant image regions. Finally, the proposed MPA-VQA is validated on the SLAKE and MedVQA-GI datasets, demonstrating state-of-the-art performance. Notably, MPA-VQA improves location-related question accuracy by 6.37% on MedVQA-GI. MPA-VQA is also a plug-and-play framework that can be seamlessly integrated into existing MedVQA architectures without requiring major modifications.
期刊介绍:
Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.