Counterfactual Dual-Bias VQA: A Multimodality Debias Learning for Robust Visual Question Answering

IF 8.9 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE transactions on neural networks and learning systems Pub Date : 2025-03-06 DOI:10.1109/TNNLS.2025.3562085

Boyue Wang;Xiaoqian Ju;Junbin Gao;Xiaoyan Li;Yongli Hu;Baocai Yin

{"title":"Counterfactual Dual-Bias VQA: A Multimodality Debias Learning for Robust Visual Question Answering","authors":"Boyue Wang;Xiaoqian Ju;Junbin Gao;Xiaoyan Li;Yongli Hu;Baocai Yin","doi":"10.1109/TNNLS.2025.3562085","DOIUrl":null,"url":null,"abstract":"Visual question answering (VQA) models often face two language bias challenges. First, they tend to rely solely on the question to predict the answer, often overlooking relevant information in the accompanying images. Second, even when considering the question, they may focus only on the wh-words, neglecting other crucial keywords that could enhance interpretability and the question sensitivity. Existing debiasing methods attempt to address this by training a bias model using question-only inputs to enhance the robustness of the target VQA model. However, this approach may not fully capture the language bias present. In this article, we propose a multimodality counterfactual dual-bias model to mitigate the linguistic bias issue in target VQA models. Our approach involves designing a shared-parameterized dual-bias model that incorporates both visual and question counterfactual samples as inputs. By doing so, we aim to fully model language biases, with visual and question counterfactual samples, respectively, emphasizing important objects and keywords to relevant the answers. To ensure that our dual-bias model behaves similarly to an ordinary model, we freeze the parameters of the target VQA model, meanwhile using the cross-entropy and Kullback-Leibler (KL) divergence as the loss function to train the dual-bias model. Subsequently, to mitigate language bias in the target VQA model, we freeze the parameters of the dual-bias model to generate pseudo-labels and then incorporate a margin loss to re-train the target VQA model. Experimental results on the <monospace>VQA-CP</monospace> datasets demonstrate the superior effectiveness of our proposed counterfactual dual-bias model. Additionally, we conduct an analysis of the unsatisfactory performance on the <monospace>VQA v2</monospace> dataset. The origin code of the proposed model is available at <uri>https://github.com/Arrow2022jv/MCD</uri>","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"36 9","pages":"16366-16378"},"PeriodicalIF":8.9000,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10988908/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Visual question answering (VQA) models often face two language bias challenges. First, they tend to rely solely on the question to predict the answer, often overlooking relevant information in the accompanying images. Second, even when considering the question, they may focus only on the wh-words, neglecting other crucial keywords that could enhance interpretability and the question sensitivity. Existing debiasing methods attempt to address this by training a bias model using question-only inputs to enhance the robustness of the target VQA model. However, this approach may not fully capture the language bias present. In this article, we propose a multimodality counterfactual dual-bias model to mitigate the linguistic bias issue in target VQA models. Our approach involves designing a shared-parameterized dual-bias model that incorporates both visual and question counterfactual samples as inputs. By doing so, we aim to fully model language biases, with visual and question counterfactual samples, respectively, emphasizing important objects and keywords to relevant the answers. To ensure that our dual-bias model behaves similarly to an ordinary model, we freeze the parameters of the target VQA model, meanwhile using the cross-entropy and Kullback-Leibler (KL) divergence as the loss function to train the dual-bias model. Subsequently, to mitigate language bias in the target VQA model, we freeze the parameters of the dual-bias model to generate pseudo-labels and then incorporate a margin loss to re-train the target VQA model. Experimental results on the VQA-CP datasets demonstrate the superior effectiveness of our proposed counterfactual dual-bias model. Additionally, we conduct an analysis of the unsatisfactory performance on the VQA v2 dataset. The origin code of the proposed model is available at https://github.com/Arrow2022jv/MCD

查看原文本刊更多论文

反事实双偏差VQA：鲁棒视觉问答的多模态去偏差学习。

视觉问答（VQA）模型经常面临两种语言偏见的挑战。首先，他们倾向于仅仅依靠问题来预测答案，往往忽略了附带图像中的相关信息。其次，即使在考虑问题时，他们也可能只关注“为什么”字，而忽略了其他可以增强可解释性和问题敏感性的关键关键词。现有的去偏方法试图通过使用纯问题输入来训练偏差模型来增强目标VQA模型的鲁棒性来解决这个问题。然而，这种方法可能无法完全捕捉到当前的语言偏见。在本文中，我们提出了一个多模态反事实双偏差模型来缓解目标VQA模型中的语言偏差问题。我们的方法包括设计一个共享参数化的双偏差模型，该模型将视觉和问题反事实样本作为输入。通过这样做，我们的目标是充分模拟语言偏见，分别使用视觉和问题反事实样本，强调重要的对象和关键词来关联答案。为了保证我们的双偏置模型与普通模型相似，我们冻结了目标VQA模型的参数，同时使用交叉熵和Kullback-Leibler （KL）散度作为损失函数来训练双偏置模型。随后，为了减轻目标VQA模型中的语言偏差，我们冻结了双偏差模型的参数来生成伪标签，然后结合边际损失来重新训练目标VQA模型。在VQA-CP数据集上的实验结果表明，我们提出的反事实双偏差模型具有优越的有效性。此外，我们对VQA v2数据集的不满意性能进行了分析。提出的模型的原始代码可在https://github.com/Arrow2022jv/MCD上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

CiteScore

23.80

自引率

9.60%

发文量

2102

审稿时长

3-8 weeks

期刊介绍： The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.