Deep Cross of Intra and Inter Modalities for Visual Question Answering

Proceedings of the 3rd International Conference on Integrated Intelligent Computing Communication & Security (ICIIC 2021) Pub Date : 1900-01-01 DOI:10.2991/ahis.k.210913.007

Rishav Bhardwaj

引用次数: 0

Abstract

Visual Question Answering (VQA) has recently attained interest in the deep learning community. The main challenge that exists in VQA is to understand the sense of each modality and how to fuse these features. In this paper, DXMN (Deep Cross Modality Network) is introduced which takes into consideration not only the inter-modality fusion but also the intra-modality fusion. The main idea behind this architecture is to take the positioning of each feature into account and then recognize the relationship between multi-modal features as well as establishing a relationship among themselves in order to learn them in a better way. The architecture is pretrained on question answering datasets like, VQA v2.0, GQA, and Visual Genome which is later fine-tuned to achieve state-of-the-art performance. DXMN achieves an accuracy of 68.65 in test-standard and 68.43 in test-dev of VQA v2.0 dataset.

查看原文本刊更多论文

视觉问答的内部和内部模式的深度交叉

视觉问答(VQA)最近引起了深度学习社区的兴趣。VQA中存在的主要挑战是理解每种模式的意义以及如何融合这些功能。本文介绍了深交叉模态网络(DXMN)，它不仅考虑了模态间的融合，而且考虑了模态内的融合。这种架构背后的主要思想是考虑每个特征的定位，然后识别多模态特征之间的关系，并建立它们之间的关系，以便更好地学习它们。该架构是在问答数据集(如VQA v2.0、GQA和Visual Genome)上进行预训练的，这些数据集稍后会进行微调以达到最先进的性能。在VQA v2.0数据集的测试标准和测试开发中，DXMN的准确率分别达到了68.65和68.43。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 3rd International Conference on Integrated Intelligent Computing Communication & Security (ICIIC 2021)

自引率

0.00%

发文量