Multi-modal Contextual Graph Neural Network for Text Visual Question Answering

2020 25th International Conference on Pattern Recognition (ICPR) Pub Date : 2021-01-10 DOI:10.1109/ICPR48806.2021.9412891

Yaoyuan Liang, Xin Wang, Xuguang Duan, Wenwu Zhu

{"title":"Multi-modal Contextual Graph Neural Network for Text Visual Question Answering","authors":"Yaoyuan Liang, Xin Wang, Xuguang Duan, Wenwu Zhu","doi":"10.1109/ICPR48806.2021.9412891","DOIUrl":null,"url":null,"abstract":"Text visual question answering (TextVQA) targets at answering the question related to texts appearing in the given images, posing more challenges than VQA by requiring a deeper recognition and understanding of various shapes of human-readable scene texts as well as their meanings in different contexts. Existing works on TextVQA suffer from two weaknesses: i) scene texts and non-textual objects are processed separately and independently without considering their mutual interactions during the question understanding and answering process, ii) scene texts are encoded only through word embeddings without taking the corresponding visual appearance features as well as their potential relationships with other non-textual objects in the images into account. To overcome the weakness of existing works, we propose a novel multi-modal contextual graph neural network (MCG) model for TextVQA. The proposed MCG model can capture the relationships between visual features of scene texts and non-textual objects in the given images as well as utilize richer sources of multi-modal features to improve the model performance. In particular, we encode the scene texts into richer features containing textual, visual and positional features, then model the visual relations between scene texts and non-textual objects through a contextual graph neural network. Our extensive experiments on real-world dataset demonstrate the advantages of the proposed MCG model over baseline approaches.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"79 1","pages":"3491-3498"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 25th International Conference on Pattern Recognition (ICPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPR48806.2021.9412891","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Text visual question answering (TextVQA) targets at answering the question related to texts appearing in the given images, posing more challenges than VQA by requiring a deeper recognition and understanding of various shapes of human-readable scene texts as well as their meanings in different contexts. Existing works on TextVQA suffer from two weaknesses: i) scene texts and non-textual objects are processed separately and independently without considering their mutual interactions during the question understanding and answering process, ii) scene texts are encoded only through word embeddings without taking the corresponding visual appearance features as well as their potential relationships with other non-textual objects in the images into account. To overcome the weakness of existing works, we propose a novel multi-modal contextual graph neural network (MCG) model for TextVQA. The proposed MCG model can capture the relationships between visual features of scene texts and non-textual objects in the given images as well as utilize richer sources of multi-modal features to improve the model performance. In particular, we encode the scene texts into richer features containing textual, visual and positional features, then model the visual relations between scene texts and non-textual objects through a contextual graph neural network. Our extensive experiments on real-world dataset demonstrate the advantages of the proposed MCG model over baseline approaches.

查看原文本刊更多论文

文本视觉问答的多模态上下文图神经网络

文本视觉问答(TextVQA)的目标是回答与给定图像中出现的文本相关的问题，这比VQA更具挑战性，因为它需要更深入地识别和理解人类可读场景文本的各种形状及其在不同上下文中的含义。现有的TextVQA工作存在两个弱点:1)场景文本和非文本对象在理解和回答问题的过程中被分开、独立地处理，而没有考虑它们之间的相互作用;2)场景文本仅通过词嵌入进行编码，而没有考虑相应的视觉外观特征以及它们与图像中其他非文本对象的潜在关系。为了克服现有工作的不足，我们提出了一种新的TextVQA多模态上下文图神经网络(MCG)模型。所提出的MCG模型能够捕捉给定图像中场景文本和非文本对象的视觉特征之间的关系，并利用更丰富的多模态特征来源来提高模型的性能。特别是，我们将场景文本编码为包含文本、视觉和位置特征的更丰富的特征，然后通过上下文图神经网络对场景文本和非文本对象之间的视觉关系进行建模。我们在真实数据集上的大量实验证明了所提出的MCG模型相对于基线方法的优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 25th International Conference on Pattern Recognition (ICPR)

自引率

0.00%

发文量