{"title":"Dual-Modality Integration Attention with Graph-Based Feature Extraction for Visual Question and Answering","authors":"Jing Lu;Chunlei Wu;Leiquan Wang;Ran Li;Xiuxuan Shen","doi":"10.26599/TST.2024.9010093","DOIUrl":null,"url":null,"abstract":"Visual Question and Answering (VQA) has garnered significant attention as a domain that requires the synthesis of visual and textual information to produce accurate responses. While existing methods often rely on Convolutional Neural Networks (CNNs) for feature extraction and attention mechanisms for embedding learning, they frequently fail to capture the nuanced interactions between entities within images, leading to potential ambiguities in answer generation. In this paper, we introduce a novel network architecture, Dual-modality Integration Attention with Graph-based Feature Extraction (DIAGFE), which addresses these limitations by incorporating two key innovations: a Graph-based Feature Extraction (GFE) module that enhances the precision of visual semantics extraction, and a Dual-modality Integration Attention (DIA) mechanism that efficiently fuses visual and question features to guide the model towards more accurate answer generation. Our model is trained with a composite loss function to refine its predictive accuracy. Rigorous experiments on the VQA2.0 dataset demonstrate that DIAGFE outperforms existing methods, underscoring the effectiveness of our approach in advancing VQA research and its potential for cross-modal understanding.","PeriodicalId":48690,"journal":{"name":"Tsinghua Science and Technology","volume":"30 5","pages":"2133-2145"},"PeriodicalIF":6.6000,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10979795","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Tsinghua Science and Technology","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10979795/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Multidisciplinary","Score":null,"Total":0}
引用次数: 0
Abstract
Visual Question and Answering (VQA) has garnered significant attention as a domain that requires the synthesis of visual and textual information to produce accurate responses. While existing methods often rely on Convolutional Neural Networks (CNNs) for feature extraction and attention mechanisms for embedding learning, they frequently fail to capture the nuanced interactions between entities within images, leading to potential ambiguities in answer generation. In this paper, we introduce a novel network architecture, Dual-modality Integration Attention with Graph-based Feature Extraction (DIAGFE), which addresses these limitations by incorporating two key innovations: a Graph-based Feature Extraction (GFE) module that enhances the precision of visual semantics extraction, and a Dual-modality Integration Attention (DIA) mechanism that efficiently fuses visual and question features to guide the model towards more accurate answer generation. Our model is trained with a composite loss function to refine its predictive accuracy. Rigorous experiments on the VQA2.0 dataset demonstrate that DIAGFE outperforms existing methods, underscoring the effectiveness of our approach in advancing VQA research and its potential for cross-modal understanding.
期刊介绍:
Tsinghua Science and Technology (Tsinghua Sci Technol) started publication in 1996. It is an international academic journal sponsored by Tsinghua University and is published bimonthly. This journal aims at presenting the up-to-date scientific achievements in computer science, electronic engineering, and other IT fields. Contributions all over the world are welcome.