Dual-Modality Integration Attention with Graph-Based Feature Extraction for Visual Question and Answering

IF 3.5 1区计算机科学 Q1 Multidisciplinary

Tsinghua Science and Technology Pub Date : 2025-04-29 DOI:10.26599/TST.2024.9010093

Jing Lu;Chunlei Wu;Leiquan Wang;Ran Li;Xiuxuan Shen

{"title":"Dual-Modality Integration Attention with Graph-Based Feature Extraction for Visual Question and Answering","authors":"Jing Lu;Chunlei Wu;Leiquan Wang;Ran Li;Xiuxuan Shen","doi":"10.26599/TST.2024.9010093","DOIUrl":null,"url":null,"abstract":"Visual Question and Answering (VQA) has garnered significant attention as a domain that requires the synthesis of visual and textual information to produce accurate responses. While existing methods often rely on Convolutional Neural Networks (CNNs) for feature extraction and attention mechanisms for embedding learning, they frequently fail to capture the nuanced interactions between entities within images, leading to potential ambiguities in answer generation. In this paper, we introduce a novel network architecture, Dual-modality Integration Attention with Graph-based Feature Extraction (DIAGFE), which addresses these limitations by incorporating two key innovations: a Graph-based Feature Extraction (GFE) module that enhances the precision of visual semantics extraction, and a Dual-modality Integration Attention (DIA) mechanism that efficiently fuses visual and question features to guide the model towards more accurate answer generation. Our model is trained with a composite loss function to refine its predictive accuracy. Rigorous experiments on the VQA2.0 dataset demonstrate that DIAGFE outperforms existing methods, underscoring the effectiveness of our approach in advancing VQA research and its potential for cross-modal understanding.","PeriodicalId":48690,"journal":{"name":"Tsinghua Science and Technology","volume":"30 5","pages":"2133-2145"},"PeriodicalIF":3.5000,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10979795","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Tsinghua Science and Technology","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10979795/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Multidisciplinary","Score":null,"Total":0}

引用次数: 0

Abstract

Visual Question and Answering (VQA) has garnered significant attention as a domain that requires the synthesis of visual and textual information to produce accurate responses. While existing methods often rely on Convolutional Neural Networks (CNNs) for feature extraction and attention mechanisms for embedding learning, they frequently fail to capture the nuanced interactions between entities within images, leading to potential ambiguities in answer generation. In this paper, we introduce a novel network architecture, Dual-modality Integration Attention with Graph-based Feature Extraction (DIAGFE), which addresses these limitations by incorporating two key innovations: a Graph-based Feature Extraction (GFE) module that enhances the precision of visual semantics extraction, and a Dual-modality Integration Attention (DIA) mechanism that efficiently fuses visual and question features to guide the model towards more accurate answer generation. Our model is trained with a composite loss function to refine its predictive accuracy. Rigorous experiments on the VQA2.0 dataset demonstrate that DIAGFE outperforms existing methods, underscoring the effectiveness of our approach in advancing VQA research and its potential for cross-modal understanding.

查看原文本刊更多论文

基于图的视觉问答特征提取的双模态集成关注

视觉问答（VQA）作为一个需要综合视觉和文本信息来产生准确响应的领域，已经引起了人们的广泛关注。虽然现有的方法通常依赖于卷积神经网络（cnn）进行特征提取和嵌入学习的注意机制，但它们经常无法捕获图像中实体之间细微的相互作用，从而导致答案生成中的潜在歧义。在本文中，我们介绍了一种新的网络架构，双模态集成注意与基于图的特征提取（DIAGFE），它通过结合两个关键创新来解决这些限制：基于图的特征提取（GFE）模块，提高了视觉语义提取的精度，以及双模态集成注意（DIA）机制，有效地融合了视觉和问题特征，以指导模型更准确地生成答案。我们的模型是用一个复合损失函数来训练的，以提高其预测精度。在VQA2.0数据集上进行的严格实验表明，DIAGFE优于现有方法，强调了我们的方法在推进VQA研究方面的有效性及其跨模态理解的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Tsinghua Science and Technology COMPUTER SCIENCE, INFORMATION SYSTEMSCOMPU-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

10.20

自引率

10.60%

发文量

2340

期刊介绍： Tsinghua Science and Technology (Tsinghua Sci Technol) started publication in 1996. It is an international academic journal sponsored by Tsinghua University and is published bimonthly. This journal aims at presenting the up-to-date scientific achievements in computer science, electronic engineering, and other IT fields. Contributions all over the world are welcome.