多阶段推理的内省与修正偏见视觉问答

IF 4.1 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on the Web Pub Date : 2023-08-28 DOI:10.1145/3616399

Anjin Liu, Zimu Lu, Ning Xu, Min Liu, Chenggang Yan, Bolun Zheng, Bo Lv, Yulong Duan, Zhuang Shao, Xuanya Li

{"title":"多阶段推理的内省与修正偏见视觉问答","authors":"Anjin Liu, Zimu Lu, Ning Xu, Min Liu, Chenggang Yan, Bolun Zheng, Bo Lv, Yulong Duan, Zhuang Shao, Xuanya Li","doi":"10.1145/3616399","DOIUrl":null,"url":null,"abstract":"Visual Question Answering (VQA) is a task that involves predicting an answer to a question depending on the content of an image. However, recent VQA methods have relied more on language priors between the question and answer rather than the image content. To address this issue, many debiasing methods have been proposed to reduce language bias in model reasoning. However, the bias can be divided into two categories: good bias and bad bias. Good bias can benefit to the answer predication, while the bad bias may associate the models with the unrelated information. Therefore, instead of excluding good and bad bias indiscriminately in existing debiasing methods, we proposed a bias discrimination module to distinguish them. Additionally, bad bias may reduce the model’s reliance on image content during answer reasoning, and thus attend little on image features updating. To tackle this, we leverage Markov theory to construct a Markov field with image regions and question words as nodes. This helps with feature updating for both image regions and question words, thereby facilitating more accurate and comprehensive reasoning about both the image content and question. To verify the effectiveness of our network, we evaluate our network on VQA v2 and VQA cp v2 datasets and conduct extensive quantity and quality studies to verify the effectiveness of our proposed network. Experimental results show that our network achieves significant performance against the previous state-of-the-art methods.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":"24 45","pages":""},"PeriodicalIF":4.1000,"publicationDate":"2023-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-stage reasoning on introspecting and revising bias for visual question answering\",\"authors\":\"Anjin Liu, Zimu Lu, Ning Xu, Min Liu, Chenggang Yan, Bolun Zheng, Bo Lv, Yulong Duan, Zhuang Shao, Xuanya Li\",\"doi\":\"10.1145/3616399\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual Question Answering (VQA) is a task that involves predicting an answer to a question depending on the content of an image. However, recent VQA methods have relied more on language priors between the question and answer rather than the image content. To address this issue, many debiasing methods have been proposed to reduce language bias in model reasoning. However, the bias can be divided into two categories: good bias and bad bias. Good bias can benefit to the answer predication, while the bad bias may associate the models with the unrelated information. Therefore, instead of excluding good and bad bias indiscriminately in existing debiasing methods, we proposed a bias discrimination module to distinguish them. Additionally, bad bias may reduce the model’s reliance on image content during answer reasoning, and thus attend little on image features updating. To tackle this, we leverage Markov theory to construct a Markov field with image regions and question words as nodes. This helps with feature updating for both image regions and question words, thereby facilitating more accurate and comprehensive reasoning about both the image content and question. To verify the effectiveness of our network, we evaluate our network on VQA v2 and VQA cp v2 datasets and conduct extensive quantity and quality studies to verify the effectiveness of our proposed network. Experimental results show that our network achieves significant performance against the previous state-of-the-art methods.\",\"PeriodicalId\":50940,\"journal\":{\"name\":\"ACM Transactions on the Web\",\"volume\":\"24 45\",\"pages\":\"\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2023-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on the Web\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3616399\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on the Web","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3616399","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

视觉问答（VQA）是一项涉及根据图像内容预测问题答案的任务。然而，最近的VQA方法更多地依赖于问答之间的语言先验，而不是图像内容。为了解决这个问题，已经提出了许多去偏方法来减少模型推理中的语言偏误。然而，偏见可以分为两类：好偏见和坏偏见。好的偏差有利于答案预测，而坏的偏差可能会将模型与不相关的信息联系起来。因此，我们没有在现有的去偏倚方法中不加区分地排除好偏倚和坏偏倚，而是提出了一个偏倚判别模块来区分它们。此外，不良偏差可能会减少模型在答案推理过程中对图像内容的依赖，从而很少关注图像特征的更新。为了解决这个问题，我们利用马尔可夫理论构建了一个以图像区域和问题词为节点的马尔可夫场。这有助于图像区域和问题词的特征更新，从而促进关于图像内容和问题的更准确和全面的推理。为了验证我们的网络的有效性，我们在VQA v2和VQA cp v2数据集上评估了我们的网络，并进行了大量的数量和质量研究，以验证我们提出的网络的效力。实验结果表明，与以前最先进的方法相比，我们的网络取得了显著的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi-stage reasoning on introspecting and revising bias for visual question answering

Visual Question Answering (VQA) is a task that involves predicting an answer to a question depending on the content of an image. However, recent VQA methods have relied more on language priors between the question and answer rather than the image content. To address this issue, many debiasing methods have been proposed to reduce language bias in model reasoning. However, the bias can be divided into two categories: good bias and bad bias. Good bias can benefit to the answer predication, while the bad bias may associate the models with the unrelated information. Therefore, instead of excluding good and bad bias indiscriminately in existing debiasing methods, we proposed a bias discrimination module to distinguish them. Additionally, bad bias may reduce the model’s reliance on image content during answer reasoning, and thus attend little on image features updating. To tackle this, we leverage Markov theory to construct a Markov field with image regions and question words as nodes. This helps with feature updating for both image regions and question words, thereby facilitating more accurate and comprehensive reasoning about both the image content and question. To verify the effectiveness of our network, we evaluate our network on VQA v2 and VQA cp v2 datasets and conduct extensive quantity and quality studies to verify the effectiveness of our proposed network. Experimental results show that our network achieves significant performance against the previous state-of-the-art methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on the Web 工程技术-计算机：软件工程

CiteScore

4.90

自引率

0.00%

发文量

审稿时长

7.5 months

期刊介绍： Transactions on the Web (TWEB) is a journal publishing refereed articles reporting the results of research on Web content, applications, use, and related enabling technologies. Topics in the scope of TWEB include but are not limited to the following: Browsers and Web Interfaces; Electronic Commerce; Electronic Publishing; Hypertext and Hypermedia; Semantic Web; Web Engineering; Web Services; and Service-Oriented Computing XML. In addition, papers addressing the intersection of the following broader technologies with the Web are also in scope: Accessibility; Business Services Education; Knowledge Management and Representation; Mobility and pervasive computing; Performance and scalability; Recommender systems; Searching, Indexing, Classification, Retrieval and Querying, Data Mining and Analysis; Security and Privacy; and User Interfaces. Papers discussing specific Web technologies, applications, content generation and management and use are within scope. Also, papers describing novel applications of the web as well as papers on the underlying technologies are welcome.