Adaptive attention fusion network for visual question answering

2017 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2017-07-01 DOI:10.1109/ICME.2017.8019540

Geonmo Gu, S. T. Kim, Yong Man Ro

引用次数: 5

Abstract

Automatic understanding of the content of a reference image and natural language questions is needed in Visual Question Answering (VQA). Generating a visual attention map that focuses on the regions related to the context of the question can improve performance of VQA. In this paper, we propose adaptive attention-based VQA network. The proposed method utilizes the complementary information from the attention maps depending on three levels of word embedding (word level, phrase level, and question level embedding), and adaptively fuses the information to represent the image-question pair appropriately. Comparative experiments have been conducted on the public COCO-QA database to validate the proposed method. Experimental results have shown that the proposed method outperforms previous methods in terms of accuracy.

查看原文本刊更多论文

视觉问答的自适应注意力融合网络

在视觉问答(VQA)中，需要对参考图像和自然语言问题的内容进行自动理解。生成与问题上下文相关的区域的视觉注意力图可以提高VQA的性能。本文提出了一种基于自适应注意力的VQA网络。该方法通过三个层次的词嵌入(词层、短语层和问题层嵌入)，利用注意图中的互补信息，自适应地融合这些信息，以适当地表示图像-问题对。在COCO-QA公共数据库上进行了对比实验，验证了所提方法的有效性。实验结果表明，该方法在精度上优于现有方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE International Conference on Multimedia and Expo (ICME)

自引率

0.00%

发文量