Enhancing robust VQA via contrastive and self-supervised learning

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2024-11-02 DOI:10.1016/j.patcog.2024.111129

Runlin Cao , Zhixin Li , Zhenjun Tang , Canlong Zhang , Huifang Ma

{"title":"Enhancing robust VQA via contrastive and self-supervised learning","authors":"Runlin Cao , Zhixin Li , Zhenjun Tang , Canlong Zhang , Huifang Ma","doi":"10.1016/j.patcog.2024.111129","DOIUrl":null,"url":null,"abstract":"<div><div>Visual Question Answering (VQA) aims to evaluate the reasoning abilities of an intelligent agent using visual and textual information. However, recent research indicates that many VQA models rely primarily on learning the correlation between questions and answers in the training dataset rather than demonstrating actual reasoning ability. To address this limitation, we propose a novel training approach called Enhancing Robust VQA via Contrastive and Self-supervised Learning (CSL-VQA) to construct a more robust VQA model. Our approach involves generating two types of negative samples to balance the biased data, using self-supervised auxiliary tasks to help the base VQA model overcome language priors, and filtering out biased training samples. In addition, we construct positive samples by removing spurious correlations in biased samples and perform auxiliary training through contrastive learning. Our approach does not require additional annotations and is compatible with different VQA backbones. Experimental results demonstrate that CSL-VQA significantly outperforms current state-of-the-art approaches, achieving an accuracy of 62.30% on the VQA-CP v2 dataset, while maintaining robust performance on the in-distribution VQA v2 dataset. Moreover, our method shows superior generalization capabilities on challenging datasets such as GQA-OOD and VQA-CE, proving its effectiveness in reducing language bias and enhancing the overall robustness of VQA models.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111129"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S003132032400880X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Visual Question Answering (VQA) aims to evaluate the reasoning abilities of an intelligent agent using visual and textual information. However, recent research indicates that many VQA models rely primarily on learning the correlation between questions and answers in the training dataset rather than demonstrating actual reasoning ability. To address this limitation, we propose a novel training approach called Enhancing Robust VQA via Contrastive and Self-supervised Learning (CSL-VQA) to construct a more robust VQA model. Our approach involves generating two types of negative samples to balance the biased data, using self-supervised auxiliary tasks to help the base VQA model overcome language priors, and filtering out biased training samples. In addition, we construct positive samples by removing spurious correlations in biased samples and perform auxiliary training through contrastive learning. Our approach does not require additional annotations and is compatible with different VQA backbones. Experimental results demonstrate that CSL-VQA significantly outperforms current state-of-the-art approaches, achieving an accuracy of 62.30% on the VQA-CP v2 dataset, while maintaining robust performance on the in-distribution VQA v2 dataset. Moreover, our method shows superior generalization capabilities on challenging datasets such as GQA-OOD and VQA-CE, proving its effectiveness in reducing language bias and enhancing the overall robustness of VQA models.

查看原文本刊更多论文

通过对比和自我监督学习增强稳健 VQA

可视化问题解答（VQA）旨在利用可视化和文本信息评估智能代理的推理能力。然而，最近的研究表明，许多 VQA 模型主要依赖于学习训练数据集中问题与答案之间的相关性，而不是展示实际的推理能力。为了解决这一局限性，我们提出了一种名为 "通过对比和自我监督学习增强稳健 VQA"（CSL-VQA）的新颖训练方法，以构建更稳健的 VQA 模型。我们的方法包括生成两类负样本来平衡有偏差的数据，使用自我监督辅助任务来帮助基础 VQA 模型克服语言先验，以及过滤掉有偏差的训练样本。此外，我们还通过去除有偏见样本中的虚假相关性来构建正样本，并通过对比学习进行辅助训练。我们的方法不需要额外的注释，并且与不同的 VQA 骨干兼容。实验结果表明，CSL-VQA 明显优于目前最先进的方法，在 VQA-CP v2 数据集上达到了 62.30% 的准确率，同时在分布式 VQA v2 数据集上保持了稳健的性能。此外，我们的方法在具有挑战性的数据集（如 GQA-OOD 和 VQA-CE）上显示出卓越的泛化能力，证明了它在减少语言偏差和增强 VQA 模型整体稳健性方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.