Multi-scale dual-stream visual feature extraction and graph reasoning for visual question answering

IF 3.4 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Intelligence Pub Date : 2025-03-18 DOI:10.1007/s10489-025-06325-4

Abdulganiyu Abdu Yusuf, Chong Feng, Xianling Mao, Xinyan Li, Yunusa Haruna, Ramadhani Ally Duma

{"title":"Multi-scale dual-stream visual feature extraction and graph reasoning for visual question answering","authors":"Abdulganiyu Abdu Yusuf, Chong Feng, Xianling Mao, Xinyan Li, Yunusa Haruna, Ramadhani Ally Duma","doi":"10.1007/s10489-025-06325-4","DOIUrl":null,"url":null,"abstract":"<div><p>Recent advancements in deep learning algorithms have significantly expanded the capabilities of systems to handle vision-to-language (V2L) tasks. Visual question answering (VQA) presents challenges that require a deep understanding of visual and language content to perform complex reasoning tasks. The existing VQA models often rely on grid-based or region-based visual features, which capture global context and object-specific details, respectively. However, balancing the complementary strengths of each feature type while minimizing fusion noise remains a significant challenge. This study propose a multi-scale dual-stream visual feature extraction method that combines grid and region features to enhance both global and local visual feature representations. Also, a visual graph relational reasoning (VGRR) approach is proposed to further improve reasoning by constructing a graph that models spatial and semantic relationships between visual objects, using Graph Attention Networks (GATs) for relational reasoning. To enhance the interaction between visual and textual modalities, we further propose a cross-modal self-attention fusion strategy, which enables the model to focus selectively on the most relevant parts of both the image and the question. The proposed model is evaluated on the VQA 2.0 and GQA benchmark datasets, demonstrating competitive performance with significant accuracy improvements compared to state-of-the-art methods. Ablation studies confirm the effectiveness of each module in enhancing visual-textual understanding and answer prediction.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 6","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-025-06325-4","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advancements in deep learning algorithms have significantly expanded the capabilities of systems to handle vision-to-language (V2L) tasks. Visual question answering (VQA) presents challenges that require a deep understanding of visual and language content to perform complex reasoning tasks. The existing VQA models often rely on grid-based or region-based visual features, which capture global context and object-specific details, respectively. However, balancing the complementary strengths of each feature type while minimizing fusion noise remains a significant challenge. This study propose a multi-scale dual-stream visual feature extraction method that combines grid and region features to enhance both global and local visual feature representations. Also, a visual graph relational reasoning (VGRR) approach is proposed to further improve reasoning by constructing a graph that models spatial and semantic relationships between visual objects, using Graph Attention Networks (GATs) for relational reasoning. To enhance the interaction between visual and textual modalities, we further propose a cross-modal self-attention fusion strategy, which enables the model to focus selectively on the most relevant parts of both the image and the question. The proposed model is evaluated on the VQA 2.0 and GQA benchmark datasets, demonstrating competitive performance with significant accuracy improvements compared to state-of-the-art methods. Ablation studies confirm the effectiveness of each module in enhancing visual-textual understanding and answer prediction.

查看原文本刊更多论文

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Intelligence 工程技术-计算机：人工智能

CiteScore

6.60

自引率

20.80%

发文量

1361

审稿时长

5.9 months

期刊介绍： With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance. The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.