视觉问答的先验视觉关系推理

2020 IEEE International Conference on Image Processing (ICIP) Pub Date : 2020-10-01 DOI:10.1109/ICIP40778.2020.9190771

Zhuoqian Yang, Zengchang Qin, Jing Yu, T. Wan

{"title":"视觉问答的先验视觉关系推理","authors":"Zhuoqian Yang, Zengchang Qin, Jing Yu, T. Wan","doi":"10.1109/ICIP40778.2020.9190771","DOIUrl":null,"url":null,"abstract":"Visual Question Answering (VQA) is a representative task of cross-modal reasoning where an image and a free-form question in natural language are presented and the correct answer needs to be determined using both visual and textual information. One of the key issues of VQA is to reason with semantic clues in the visual content under the guidance of the question. In this paper, we propose Scene Graph Convolutional Network (SceneGCN) to jointly reason the object properties and their semantic relations for the correct answer. The visual relationship is projected into a deep learned semantic space constrained by visual context and language priors. Based on comprehensive experiments on two challenging datasets: GQA and VQA 2.0, we demonstrate the effectiveness and interpretability of the new model.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Prior Visual Relationship Reasoning For Visual Question Answering\",\"authors\":\"Zhuoqian Yang, Zengchang Qin, Jing Yu, T. Wan\",\"doi\":\"10.1109/ICIP40778.2020.9190771\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual Question Answering (VQA) is a representative task of cross-modal reasoning where an image and a free-form question in natural language are presented and the correct answer needs to be determined using both visual and textual information. One of the key issues of VQA is to reason with semantic clues in the visual content under the guidance of the question. In this paper, we propose Scene Graph Convolutional Network (SceneGCN) to jointly reason the object properties and their semantic relations for the correct answer. The visual relationship is projected into a deep learned semantic space constrained by visual context and language priors. Based on comprehensive experiments on two challenging datasets: GQA and VQA 2.0, we demonstrate the effectiveness and interpretability of the new model.\",\"PeriodicalId\":405734,\"journal\":{\"name\":\"2020 IEEE International Conference on Image Processing (ICIP)\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE International Conference on Image Processing (ICIP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIP40778.2020.9190771\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Conference on Image Processing (ICIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIP40778.2020.9190771","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

视觉问答(Visual Question answer, VQA)是一种跨模态推理的代表性任务，它以自然语言呈现图像和自由形式的问题，并且需要使用视觉和文本信息来确定正确的答案。VQA的关键问题之一是在问题的引导下，利用视觉内容中的语义线索进行推理。本文提出了场景图卷积网络(Scene Graph Convolutional Network, SceneGCN)来联合推理物体属性及其语义关系以获得正确答案。视觉关系被投射到一个受视觉语境和语言先验约束的深度学习语义空间中。在GQA和VQA 2.0两个具有挑战性的数据集上进行了综合实验，验证了新模型的有效性和可解释性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Prior Visual Relationship Reasoning For Visual Question Answering

Visual Question Answering (VQA) is a representative task of cross-modal reasoning where an image and a free-form question in natural language are presented and the correct answer needs to be determined using both visual and textual information. One of the key issues of VQA is to reason with semantic clues in the visual content under the guidance of the question. In this paper, we propose Scene Graph Convolutional Network (SceneGCN) to jointly reason the object properties and their semantic relations for the correct answer. The visual relationship is projected into a deep learned semantic space constrained by visual context and language priors. Based on comprehensive experiments on two challenging datasets: GQA and VQA 2.0, we demonstrate the effectiveness and interpretability of the new model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE International Conference on Image Processing (ICIP)

自引率

0.00%

发文量