{"title":"Multi-Granularity Feature Interaction and Multi-Region Selection Based Triplet Visual Question Answering","authors":"Heng Liu;Boyue Wang;Yanfeng Sun;Junbin Gao;Xiaoyan Li;Yongli Hu;Baocai Yin","doi":"10.1109/TBDATA.2024.3453750","DOIUrl":null,"url":null,"abstract":"Accurately locating the question-related regions in one given image is crucial for visual question answering (VQA). The current approaches suffer two limitations: (1) Dividing one image into multiple regions may lose parts of semantic information and original relationships between regions; (2) Choosing only one or all image regions to predict the answer may correspondingly result in the insufficiency or redundancy of information. Therefore, how to effectively mine the relationship between image regions and choose the relevant image regions are vital. In this paper, we propose a novel <b>M</b>ulti-granularity feature interaction and <b>M</b>ulti-region selection-based triplet VQA model (M2TVQA). To tackle the first limitation, we propose the multi-granularity feature interaction strategy that adaptively supplements the global coarse-granularity features with the regional fine-granularity features. To overcome the second limitation, we design the Top-<inline-formula><tex-math>$K$</tex-math></inline-formula> learning strategy to adaptively select <inline-formula><tex-math>$K$</tex-math></inline-formula> most relevant image regions to the question, even if the selected regions are far away in space. Such a strategy can select as many relevant image regions as possible and reduce introducing noise. Finally, we construct the multi-modality triplet to predict the answer of VQA. Extended experiments on two public outside knowledge datasets OK-VQA and KRVQA verify the effectiveness of the proposed model.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1346-1356"},"PeriodicalIF":7.5000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10663929/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Accurately locating the question-related regions in one given image is crucial for visual question answering (VQA). The current approaches suffer two limitations: (1) Dividing one image into multiple regions may lose parts of semantic information and original relationships between regions; (2) Choosing only one or all image regions to predict the answer may correspondingly result in the insufficiency or redundancy of information. Therefore, how to effectively mine the relationship between image regions and choose the relevant image regions are vital. In this paper, we propose a novel Multi-granularity feature interaction and Multi-region selection-based triplet VQA model (M2TVQA). To tackle the first limitation, we propose the multi-granularity feature interaction strategy that adaptively supplements the global coarse-granularity features with the regional fine-granularity features. To overcome the second limitation, we design the Top-$K$ learning strategy to adaptively select $K$ most relevant image regions to the question, even if the selected regions are far away in space. Such a strategy can select as many relevant image regions as possible and reduce introducing noise. Finally, we construct the multi-modality triplet to predict the answer of VQA. Extended experiments on two public outside knowledge datasets OK-VQA and KRVQA verify the effectiveness of the proposed model.
期刊介绍:
The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.