Cross-Modal Knowledge Diffusion-Based Generation for Difference-Aware Medical VQA

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-04-11 DOI:10.1109/TIP.2025.3558446

Qika Lin;Kai He;Yifan Zhu;Fangzhi Xu;Erik Cambria;Mengling Feng

{"title":"Cross-Modal Knowledge Diffusion-Based Generation for Difference-Aware Medical VQA","authors":"Qika Lin;Kai He;Yifan Zhu;Fangzhi Xu;Erik Cambria;Mengling Feng","doi":"10.1109/TIP.2025.3558446","DOIUrl":null,"url":null,"abstract":"Multimodal medical applications have garnered considerable attention due to their potential to offer comprehensive and robust support for medical assistance. Specifically, within this domain, difference-aware medical Visual Question Answering (VQA) has emerged as a topic of increasing interest that enables the recognition of changes in physical conditions over time when compared to previous states and provides customized suggestions accordingly. However, it is challenging because samples usually exhibit characteristics of complexity, diversity, and inherent noise. Besides, there is a need for multimodal knowledge understanding of the medical domain. The difference-aware setting requiring image comparison further intensifies these situations. To this end, we propose a cross-Modal knowlEdge diffusioN-baseD gEneration netwoRk (MENDER), where the diffusion mechanism with multi-step denoising and knowledge injection from global to local level are employed to tackle the aforementioned challenges, respectively. The diffusion process is to gradually generate answers with the sequence input of questions, random noises for the answer masks and virtual vision prompts of images. The strategy of answer nosing and knowledge cascading is specifically tailored for this task and is implemented during forward and reverse diffusion processes. Moreover, the visual and structure knowledge injection are proposed to learn virtual vision prompts to guide the diffusion process, where the former is realized using a pre-trained medical image-text network and the latter is modeled with spatial and semantic graph structures processed by the heterogeneous graph Transformer models. Experiment results demonstrate the effectiveness of MENDER for difference-aware medical VQA. Furthermore, it also exhibits notable performance in the low-resource setting and conventional medical VQA tasks.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2421-2434"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10964089/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Multimodal medical applications have garnered considerable attention due to their potential to offer comprehensive and robust support for medical assistance. Specifically, within this domain, difference-aware medical Visual Question Answering (VQA) has emerged as a topic of increasing interest that enables the recognition of changes in physical conditions over time when compared to previous states and provides customized suggestions accordingly. However, it is challenging because samples usually exhibit characteristics of complexity, diversity, and inherent noise. Besides, there is a need for multimodal knowledge understanding of the medical domain. The difference-aware setting requiring image comparison further intensifies these situations. To this end, we propose a cross-Modal knowlEdge diffusioN-baseD gEneration netwoRk (MENDER), where the diffusion mechanism with multi-step denoising and knowledge injection from global to local level are employed to tackle the aforementioned challenges, respectively. The diffusion process is to gradually generate answers with the sequence input of questions, random noises for the answer masks and virtual vision prompts of images. The strategy of answer nosing and knowledge cascading is specifically tailored for this task and is implemented during forward and reverse diffusion processes. Moreover, the visual and structure knowledge injection are proposed to learn virtual vision prompts to guide the diffusion process, where the former is realized using a pre-trained medical image-text network and the latter is modeled with spatial and semantic graph structures processed by the heterogeneous graph Transformer models. Experiment results demonstrate the effectiveness of MENDER for difference-aware medical VQA. Furthermore, it also exhibits notable performance in the low-resource setting and conventional medical VQA tasks.

查看原文本刊更多论文

基于跨模态知识扩散的差分感知医学VQA生成

多模式医疗应用由于有可能为医疗援助提供全面和有力的支持而引起了相当大的关注。具体来说，在这个领域中，差异感知医学视觉问答（VQA）已经成为一个越来越受关注的话题，它能够识别与以前状态相比，随着时间的推移身体状况的变化，并提供相应的定制建议。然而，这是具有挑战性的，因为样本通常表现出复杂性、多样性和固有噪声的特征。此外，还需要对医学领域的多模态知识进行理解。需要图像比较的差异感知设置进一步加剧了这些情况。为此，我们提出了一种基于跨模态知识扩散的生成网络（MENDER），该网络采用多步去噪扩散机制和从全局到局部的知识注入机制来解决上述挑战。扩散过程是通过问题的顺序输入、答案掩模的随机噪声和图像的虚拟视觉提示，逐步生成答案。答案嗅探和知识级联策略是专门为该任务量身定制的，并在正向和反向扩散过程中实现。此外，提出了视觉和结构知识注入来学习虚拟视觉提示来指导扩散过程，前者使用预训练的医学图像-文本网络实现，后者使用异构图Transformer模型处理的空间和语义图结构建模。实验结果证明了MENDER在差分感知医学VQA中的有效性。此外，它在低资源环境和传统医疗VQA任务中也表现出显著的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量