超越文字：ESC-Net通过提升视觉特征和藐视语言先验彻底改变了VQA

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Intelligence Pub Date : 2024-12-03 DOI:10.1111/coin.70010

Souvik Chowdhury, Badal Soni

{"title":"超越文字：ESC-Net通过提升视觉特征和藐视语言先验彻底改变了VQA","authors":"Souvik Chowdhury, Badal Soni","doi":"10.1111/coin.70010","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Language prior is a pressing problem in the VQA domain where a model provides an answer favoring the most frequent related answer. There are some methods that are adopted to mitigate language prior issue, for example, ensemble approach, the balanced data approach, the modified evaluation strategy, and the modified training framework. In this article, we propose a VQA model, “Ensemble of Spatial and Channel Attention Network (ESC-Net),” to overcome the language bias problem by improving the visual features. In this work, we have used regional and global image features along with an ensemble of combined channel and spatial attention mechanisms to improve visual features. The model is a simpler and effective solution than existing methods to solve language bias. Extensive experiment show a remarkable performance improvement of 18% on the VQACP v2 dataset with a comparison to current state-of-the-art (SOTA) models.</p>\n </div>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"40 6","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Beyond Words: ESC-Net Revolutionizes VQA by Elevating Visual Features and Defying Language Priors\",\"authors\":\"Souvik Chowdhury, Badal Soni\",\"doi\":\"10.1111/coin.70010\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>Language prior is a pressing problem in the VQA domain where a model provides an answer favoring the most frequent related answer. There are some methods that are adopted to mitigate language prior issue, for example, ensemble approach, the balanced data approach, the modified evaluation strategy, and the modified training framework. In this article, we propose a VQA model, “Ensemble of Spatial and Channel Attention Network (ESC-Net),” to overcome the language bias problem by improving the visual features. In this work, we have used regional and global image features along with an ensemble of combined channel and spatial attention mechanisms to improve visual features. The model is a simpler and effective solution than existing methods to solve language bias. Extensive experiment show a remarkable performance improvement of 18% on the VQACP v2 dataset with a comparison to current state-of-the-art (SOTA) models.</p>\\n </div>\",\"PeriodicalId\":55228,\"journal\":{\"name\":\"Computational Intelligence\",\"volume\":\"40 6\",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2024-12-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/coin.70010\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/coin.70010","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

语言先验在VQA领域是一个紧迫的问题，在这个领域中，模型提供的答案倾向于最常见的相关答案。为了缓解语言先验问题，采用了集成方法、平衡数据方法、改进的评估策略和改进的训练框架等方法。在本文中，我们提出了一个VQA模型，即“空间和通道注意网络集成（ESC-Net）”，通过改进视觉特征来克服语言偏差问题。在这项工作中，我们使用区域和全局图像特征以及组合通道和空间注意机制来改善视觉特征。该模型比现有的解决语言偏见的方法更简单有效。广泛的实验表明，与当前最先进的（SOTA）模型相比，VQACP v2数据集的性能提高了18%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Beyond Words: ESC-Net Revolutionizes VQA by Elevating Visual Features and Defying Language Priors

Language prior is a pressing problem in the VQA domain where a model provides an answer favoring the most frequent related answer. There are some methods that are adopted to mitigate language prior issue, for example, ensemble approach, the balanced data approach, the modified evaluation strategy, and the modified training framework. In this article, we propose a VQA model, “Ensemble of Spatial and Channel Attention Network (ESC-Net),” to overcome the language bias problem by improving the visual features. In this work, we have used regional and global image features along with an ensemble of combined channel and spatial attention mechanisms to improve visual features. The model is a simpler and effective solution than existing methods to solve language bias. Extensive experiment show a remarkable performance improvement of 18% on the VQACP v2 dataset with a comparison to current state-of-the-art (SOTA) models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computational Intelligence 工程技术-计算机：人工智能

CiteScore

6.90

自引率

3.60%

发文量

审稿时长

>12 weeks

期刊介绍： This leading international journal promotes and stimulates research in the field of artificial intelligence (AI). Covering a wide range of issues - from the tools and languages of AI to its philosophical implications - Computational Intelligence provides a vigorous forum for the publication of both experimental and theoretical research, as well as surveys and impact studies. The journal is designed to meet the needs of a wide range of AI workers in academic and industrial research.