Multi-Granularity Feature Interaction and Multi-Region Selection Based Triplet Visual Question Answering

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data Pub Date : 2024-09-03 DOI:10.1109/TBDATA.2024.3453750

Heng Liu;Boyue Wang;Yanfeng Sun;Junbin Gao;Xiaoyan Li;Yongli Hu;Baocai Yin

{"title":"Multi-Granularity Feature Interaction and Multi-Region Selection Based Triplet Visual Question Answering","authors":"Heng Liu;Boyue Wang;Yanfeng Sun;Junbin Gao;Xiaoyan Li;Yongli Hu;Baocai Yin","doi":"10.1109/TBDATA.2024.3453750","DOIUrl":null,"url":null,"abstract":"Accurately locating the question-related regions in one given image is crucial for visual question answering (VQA). The current approaches suffer two limitations: (1) Dividing one image into multiple regions may lose parts of semantic information and original relationships between regions; (2) Choosing only one or all image regions to predict the answer may correspondingly result in the insufficiency or redundancy of information. Therefore, how to effectively mine the relationship between image regions and choose the relevant image regions are vital. In this paper, we propose a novel <b>M</b>ulti-granularity feature interaction and <b>M</b>ulti-region selection-based triplet VQA model (M2TVQA). To tackle the first limitation, we propose the multi-granularity feature interaction strategy that adaptively supplements the global coarse-granularity features with the regional fine-granularity features. To overcome the second limitation, we design the Top-<inline-formula><tex-math>$K$</tex-math></inline-formula> learning strategy to adaptively select <inline-formula><tex-math>$K$</tex-math></inline-formula> most relevant image regions to the question, even if the selected regions are far away in space. Such a strategy can select as many relevant image regions as possible and reduce introducing noise. Finally, we construct the multi-modality triplet to predict the answer of VQA. Extended experiments on two public outside knowledge datasets OK-VQA and KRVQA verify the effectiveness of the proposed model.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1346-1356"},"PeriodicalIF":7.5000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10663929/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Accurately locating the question-related regions in one given image is crucial for visual question answering (VQA). The current approaches suffer two limitations: (1) Dividing one image into multiple regions may lose parts of semantic information and original relationships between regions; (2) Choosing only one or all image regions to predict the answer may correspondingly result in the insufficiency or redundancy of information. Therefore, how to effectively mine the relationship between image regions and choose the relevant image regions are vital. In this paper, we propose a novel Multi-granularity feature interaction and Multi-region selection-based triplet VQA model (M2TVQA). To tackle the first limitation, we propose the multi-granularity feature interaction strategy that adaptively supplements the global coarse-granularity features with the regional fine-granularity features. To overcome the second limitation, we design the Top-

$K$

learning strategy to adaptively select

$K$

most relevant image regions to the question, even if the selected regions are far away in space. Such a strategy can select as many relevant image regions as possible and reduce introducing noise. Finally, we construct the multi-modality triplet to predict the answer of VQA. Extended experiments on two public outside knowledge datasets OK-VQA and KRVQA verify the effectiveness of the proposed model.

查看原文本刊更多论文

基于多粒度特征交互和多区域选择的三联体视觉问答

在给定图像中准确定位问题相关区域对于视觉问答（VQA）至关重要。目前的方法存在两个局限性：(1)将一幅图像划分为多个区域可能会丢失部分语义信息和区域之间的原始关系；(2)只选择一个或全部图像区域来预测答案，可能会导致信息不足或冗余。因此，如何有效地挖掘图像区域之间的关系，选择相关的图像区域至关重要。在本文中，我们提出了一种新的基于多粒度特征交互和多区域选择的三重态VQA模型（M2TVQA）。为了解决第一个限制，我们提出了多粒度特征交互策略，该策略自适应地将全局粗粒度特征与区域细粒度特征相补充。为了克服第二个限制，我们设计了Top-$K$学习策略来自适应地选择$K$与问题最相关的图像区域，即使所选区域在空间中距离较远。这种策略可以选择尽可能多的相关图像区域，并减少引入噪声。最后，我们构建了多模态三元组来预测VQA的答案。在两个公开的外部知识数据集OK-VQA和KRVQA上的扩展实验验证了该模型的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Big Data Multiple-

CiteScore

11.80

自引率

2.80%

发文量

114

期刊介绍： The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.