Examine before You Answer: Multi-task Learning with Adaptive-attentions for Multiple-choice VQA

Proceedings of the 26th ACM international conference on Multimedia Pub Date : 2018-10-15 DOI:10.1145/3240508.3240687

Lianli Gao, Pengpeng Zeng, Jingkuan Song, Xianglong Liu, Heng Tao Shen

{"title":"Examine before You Answer: Multi-task Learning with Adaptive-attentions for Multiple-choice VQA","authors":"Lianli Gao, Pengpeng Zeng, Jingkuan Song, Xianglong Liu, Heng Tao Shen","doi":"10.1145/3240508.3240687","DOIUrl":null,"url":null,"abstract":"Multiple-choice (MC) Visual Question Answering (VQA) is a similar but essentially different task to open-ended VQA because the answer options are provided. Most of existing works tackle them in a unified pipeline by solving a multi-class problem to infer the best answer from a predefined answer set. The option that matches the best answer is selected for MC VQA. Nevertheless, this violates human thinking logics. Normally, people examine the questions, answer options and the reference image before inferring a MC VQA. For MC VQA, human either rely on the question and answer options to directly deduce a correct answer if the question is not image-related, or read the question and answer options and then purposefully search for answers in a reference image. Therefore, we propose a novel approach, namely Multi-task Learning with Adaptive-attention (MTA), to simulate human logics for MC VQA. Specifically, we first fuse the answer options and question features, and then adaptively attend to the visual features for inferring a MC VQA. Furthermore, we design our model as a multi-task learning architecture by integrating the open-ended VQA task to further boost the performance of MC VQA. We evaluate our approach on two standard benchmark datasets: VQA and Visual7W and our approach sets new records on both datasets for MC VQA task, reaching 73.5% and 65.9% average accuracy respectively.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th ACM international conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3240508.3240687","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 24

Abstract

Multiple-choice (MC) Visual Question Answering (VQA) is a similar but essentially different task to open-ended VQA because the answer options are provided. Most of existing works tackle them in a unified pipeline by solving a multi-class problem to infer the best answer from a predefined answer set. The option that matches the best answer is selected for MC VQA. Nevertheless, this violates human thinking logics. Normally, people examine the questions, answer options and the reference image before inferring a MC VQA. For MC VQA, human either rely on the question and answer options to directly deduce a correct answer if the question is not image-related, or read the question and answer options and then purposefully search for answers in a reference image. Therefore, we propose a novel approach, namely Multi-task Learning with Adaptive-attention (MTA), to simulate human logics for MC VQA. Specifically, we first fuse the answer options and question features, and then adaptively attend to the visual features for inferring a MC VQA. Furthermore, we design our model as a multi-task learning architecture by integrating the open-ended VQA task to further boost the performance of MC VQA. We evaluate our approach on two standard benchmark datasets: VQA and Visual7W and our approach sets new records on both datasets for MC VQA task, reaching 73.5% and 65.9% average accuracy respectively.

查看原文本刊更多论文

先检查后回答:多选题VQA的多任务学习与自适应关注

多项选择题(MC)视觉问答(VQA)是一种与开放式VQA类似但本质上不同的任务，因为它提供了答案选项。现有的大多数工作都是通过解决一个多类问题，从一个预定义的答案集中推断出最佳答案，从而在一个统一的管道中解决这些问题。MC VQA选择与最佳答案匹配的选项。然而，这违背了人类的思维逻辑。通常，人们在推断MC VQA之前会检查问题、答案选项和参考图像。对于MC VQA，如果问题与图像无关，人类要么依靠问题和答案选项直接推断出正确答案，要么阅读问题和答案选项，然后有目的地在参考图像中搜索答案。因此，我们提出了一种新的方法，即多任务学习与自适应注意(MTA)，以模拟MC VQA中的人类逻辑。具体来说，我们首先融合答案选项和问题特征，然后自适应地关注视觉特征来推断MC VQA。此外，我们通过集成开放式VQA任务，将模型设计为一个多任务学习架构，以进一步提高MC VQA的性能。我们在两个标准基准数据集:VQA和Visual7W上对我们的方法进行了评估，我们的方法在两个数据集上都创造了MC VQA任务的新记录，分别达到了73.5%和65.9%的平均准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 26th ACM international conference on Multimedia

自引率

0.00%

发文量