{"title":"Cross-Modal Knowledge Diffusion-Based Generation for Difference-Aware Medical VQA","authors":"Qika Lin;Kai He;Yifan Zhu;Fangzhi Xu;Erik Cambria;Mengling Feng","doi":"10.1109/TIP.2025.3558446","DOIUrl":null,"url":null,"abstract":"Multimodal medical applications have garnered considerable attention due to their potential to offer comprehensive and robust support for medical assistance. Specifically, within this domain, difference-aware medical Visual Question Answering (VQA) has emerged as a topic of increasing interest that enables the recognition of changes in physical conditions over time when compared to previous states and provides customized suggestions accordingly. However, it is challenging because samples usually exhibit characteristics of complexity, diversity, and inherent noise. Besides, there is a need for multimodal knowledge understanding of the medical domain. The difference-aware setting requiring image comparison further intensifies these situations. To this end, we propose a cross-Modal knowlEdge diffusioN-baseD gEneration netwoRk (MENDER), where the diffusion mechanism with multi-step denoising and knowledge injection from global to local level are employed to tackle the aforementioned challenges, respectively. The diffusion process is to gradually generate answers with the sequence input of questions, random noises for the answer masks and virtual vision prompts of images. The strategy of answer nosing and knowledge cascading is specifically tailored for this task and is implemented during forward and reverse diffusion processes. Moreover, the visual and structure knowledge injection are proposed to learn virtual vision prompts to guide the diffusion process, where the former is realized using a pre-trained medical image-text network and the latter is modeled with spatial and semantic graph structures processed by the heterogeneous graph Transformer models. Experiment results demonstrate the effectiveness of MENDER for difference-aware medical VQA. Furthermore, it also exhibits notable performance in the low-resource setting and conventional medical VQA tasks.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2421-2434"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10964089/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal medical applications have garnered considerable attention due to their potential to offer comprehensive and robust support for medical assistance. Specifically, within this domain, difference-aware medical Visual Question Answering (VQA) has emerged as a topic of increasing interest that enables the recognition of changes in physical conditions over time when compared to previous states and provides customized suggestions accordingly. However, it is challenging because samples usually exhibit characteristics of complexity, diversity, and inherent noise. Besides, there is a need for multimodal knowledge understanding of the medical domain. The difference-aware setting requiring image comparison further intensifies these situations. To this end, we propose a cross-Modal knowlEdge diffusioN-baseD gEneration netwoRk (MENDER), where the diffusion mechanism with multi-step denoising and knowledge injection from global to local level are employed to tackle the aforementioned challenges, respectively. The diffusion process is to gradually generate answers with the sequence input of questions, random noises for the answer masks and virtual vision prompts of images. The strategy of answer nosing and knowledge cascading is specifically tailored for this task and is implemented during forward and reverse diffusion processes. Moreover, the visual and structure knowledge injection are proposed to learn virtual vision prompts to guide the diffusion process, where the former is realized using a pre-trained medical image-text network and the latter is modeled with spatial and semantic graph structures processed by the heterogeneous graph Transformer models. Experiment results demonstrate the effectiveness of MENDER for difference-aware medical VQA. Furthermore, it also exhibits notable performance in the low-resource setting and conventional medical VQA tasks.