{"title":"超越文字:ESC-Net通过提升视觉特征和藐视语言先验彻底改变了VQA","authors":"Souvik Chowdhury, Badal Soni","doi":"10.1111/coin.70010","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Language prior is a pressing problem in the VQA domain where a model provides an answer favoring the most frequent related answer. There are some methods that are adopted to mitigate language prior issue, for example, ensemble approach, the balanced data approach, the modified evaluation strategy, and the modified training framework. In this article, we propose a VQA model, “Ensemble of Spatial and Channel Attention Network (ESC-Net),” to overcome the language bias problem by improving the visual features. In this work, we have used regional and global image features along with an ensemble of combined channel and spatial attention mechanisms to improve visual features. The model is a simpler and effective solution than existing methods to solve language bias. Extensive experiment show a remarkable performance improvement of 18% on the VQACP v2 dataset with a comparison to current state-of-the-art (SOTA) models.</p>\n </div>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"40 6","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Beyond Words: ESC-Net Revolutionizes VQA by Elevating Visual Features and Defying Language Priors\",\"authors\":\"Souvik Chowdhury, Badal Soni\",\"doi\":\"10.1111/coin.70010\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>Language prior is a pressing problem in the VQA domain where a model provides an answer favoring the most frequent related answer. There are some methods that are adopted to mitigate language prior issue, for example, ensemble approach, the balanced data approach, the modified evaluation strategy, and the modified training framework. In this article, we propose a VQA model, “Ensemble of Spatial and Channel Attention Network (ESC-Net),” to overcome the language bias problem by improving the visual features. In this work, we have used regional and global image features along with an ensemble of combined channel and spatial attention mechanisms to improve visual features. The model is a simpler and effective solution than existing methods to solve language bias. Extensive experiment show a remarkable performance improvement of 18% on the VQACP v2 dataset with a comparison to current state-of-the-art (SOTA) models.</p>\\n </div>\",\"PeriodicalId\":55228,\"journal\":{\"name\":\"Computational Intelligence\",\"volume\":\"40 6\",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2024-12-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/coin.70010\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/coin.70010","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Beyond Words: ESC-Net Revolutionizes VQA by Elevating Visual Features and Defying Language Priors
Language prior is a pressing problem in the VQA domain where a model provides an answer favoring the most frequent related answer. There are some methods that are adopted to mitigate language prior issue, for example, ensemble approach, the balanced data approach, the modified evaluation strategy, and the modified training framework. In this article, we propose a VQA model, “Ensemble of Spatial and Channel Attention Network (ESC-Net),” to overcome the language bias problem by improving the visual features. In this work, we have used regional and global image features along with an ensemble of combined channel and spatial attention mechanisms to improve visual features. The model is a simpler and effective solution than existing methods to solve language bias. Extensive experiment show a remarkable performance improvement of 18% on the VQACP v2 dataset with a comparison to current state-of-the-art (SOTA) models.
期刊介绍:
This leading international journal promotes and stimulates research in the field of artificial intelligence (AI). Covering a wide range of issues - from the tools and languages of AI to its philosophical implications - Computational Intelligence provides a vigorous forum for the publication of both experimental and theoretical research, as well as surveys and impact studies. The journal is designed to meet the needs of a wide range of AI workers in academic and industrial research.