面向视觉问答的跨模态关系推理网络

Hongyu Chen, Ruifang Liu, Bo Peng
{"title":"面向视觉问答的跨模态关系推理网络","authors":"Hongyu Chen, Ruifang Liu, Bo Peng","doi":"10.1109/ICCVW54120.2021.00441","DOIUrl":null,"url":null,"abstract":"Visual Question Answering (VQA) is a challenging task that requires a cross-modal understanding of images and questions with relational reasoning leading to the correct answer. To bridge the semantic gap between these two modalities, previous works focus on the word-region alignments of all possible pairs without attending more attention to the corresponding word and object. Treating all pairs equally without consideration of relation consistency hinders the model’s performance. In this paper, to align the relation-consistent pairs and integrate the interpretability of VQA systems, we propose a Cross-modal Relational Reasoning Network (CRRN), to mask the inconsistent attention map and highlight the full latent alignments of corresponding word-region pairs. Specifically, we present two relational masks for inter-modal and intra-modal highlighting, inferring the more and less important words in sentences or regions in images. The attention interrelationship of consistent pairs can be enhanced with the shift of learning focus by masking the unaligned relations. Then, we propose two novel losses ℒCMAM and ℒSMAM with explicit supervision to capture the fine-grained interplay between vision and language. We have conduct thorough experiments to prove the effectiveness and achieve the competitive performance for reaching 61.74% on GQA benchmark.","PeriodicalId":226794,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Cross-modal Relational Reasoning Network for Visual Question Answering\",\"authors\":\"Hongyu Chen, Ruifang Liu, Bo Peng\",\"doi\":\"10.1109/ICCVW54120.2021.00441\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual Question Answering (VQA) is a challenging task that requires a cross-modal understanding of images and questions with relational reasoning leading to the correct answer. To bridge the semantic gap between these two modalities, previous works focus on the word-region alignments of all possible pairs without attending more attention to the corresponding word and object. Treating all pairs equally without consideration of relation consistency hinders the model’s performance. In this paper, to align the relation-consistent pairs and integrate the interpretability of VQA systems, we propose a Cross-modal Relational Reasoning Network (CRRN), to mask the inconsistent attention map and highlight the full latent alignments of corresponding word-region pairs. Specifically, we present two relational masks for inter-modal and intra-modal highlighting, inferring the more and less important words in sentences or regions in images. The attention interrelationship of consistent pairs can be enhanced with the shift of learning focus by masking the unaligned relations. Then, we propose two novel losses ℒCMAM and ℒSMAM with explicit supervision to capture the fine-grained interplay between vision and language. We have conduct thorough experiments to prove the effectiveness and achieve the competitive performance for reaching 61.74% on GQA benchmark.\",\"PeriodicalId\":226794,\"journal\":{\"name\":\"2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCVW54120.2021.00441\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCVW54120.2021.00441","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

视觉问答(VQA)是一项具有挑战性的任务,它需要对图像和问题进行跨模态理解,并通过关系推理得出正确答案。为了弥补这两种模式之间的语义差距,以前的研究集中在所有可能对的词-区域对齐上,而没有更多地关注相应的词和对象。平等对待所有对而不考虑关系一致性会影响模型的性能。本文提出了一种跨模态关系推理网络(Cross-modal Relational Reasoning Network, CRRN)来掩盖不一致的注意图,突出对应词区对的完全潜在对齐,从而对关系一致对进行对齐并整合VQA系统的可解释性。具体来说,我们提出了两个模态间和模态内高亮的关系掩模,来推断句子或图像中重要的词和不重要的区域。一致对的注意相互关系可以通过掩盖不一致对的关系而随着学习焦点的转移而增强。然后,我们提出了两种新的具有显式监督的损失函数,用于捕获视觉和语言之间的细粒度相互作用。我们进行了深入的实验,证明了该方法的有效性,并在GQA基准上达到了61.74%的竞争性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Cross-modal Relational Reasoning Network for Visual Question Answering
Visual Question Answering (VQA) is a challenging task that requires a cross-modal understanding of images and questions with relational reasoning leading to the correct answer. To bridge the semantic gap between these two modalities, previous works focus on the word-region alignments of all possible pairs without attending more attention to the corresponding word and object. Treating all pairs equally without consideration of relation consistency hinders the model’s performance. In this paper, to align the relation-consistent pairs and integrate the interpretability of VQA systems, we propose a Cross-modal Relational Reasoning Network (CRRN), to mask the inconsistent attention map and highlight the full latent alignments of corresponding word-region pairs. Specifically, we present two relational masks for inter-modal and intra-modal highlighting, inferring the more and less important words in sentences or regions in images. The attention interrelationship of consistent pairs can be enhanced with the shift of learning focus by masking the unaligned relations. Then, we propose two novel losses ℒCMAM and ℒSMAM with explicit supervision to capture the fine-grained interplay between vision and language. We have conduct thorough experiments to prove the effectiveness and achieve the competitive performance for reaching 61.74% on GQA benchmark.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信