ExVQA: a novel stacked attention networks with extended long short-term memory model for visual question answering

IF 4 3区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Computers & Electrical Engineering Pub Date : 2025-06-13 DOI:10.1016/j.compeleceng.2025.110439

Bui Thanh Hung, Ho Vo Hoang Duy

{"title":"ExVQA: a novel stacked attention networks with extended long short-term memory model for visual question answering","authors":"Bui Thanh Hung, Ho Vo Hoang Duy","doi":"10.1016/j.compeleceng.2025.110439","DOIUrl":null,"url":null,"abstract":"<div><div>Visual Question Answering (VQA) has garnered significant attention in recent years due to its potential for broad applications across fields such as medicine, education, and entertainment. However, existing VQA methods still face several limitations, including challenges in handling abstract and complex questions, poor generalization, lack of explainability, and susceptibility to noise and bias. In this study, we propose a novel ExVQA model that leverages Stacked Attention Networks (SANs) and Extended Long Short-Term Memory (xLSTM) for Visual Question Answering. Image features are extracted using Sigmoid loss for Language-Image Pre-training (SigLIP), while question features are represented using the Autoregressive Transformer Decoder model (GPT-Neo) and Extended Long Short-Term Memory networks to facilitate the answer generation process. By utilizing the strengths of SANs and xLSTM, our approach aims to overcome the limitations of previous models and enhance the performance and reliability of VQA systems. Evaluation results on three datasets: PathVQA, VQA-Med 2019 and GQA show that our proposed ExVQA model achieves better performance than existing methods, demonstrating great application potential in the fields of medicine, education and entertainment.</div></div>","PeriodicalId":50630,"journal":{"name":"Computers & Electrical Engineering","volume":"126 ","pages":"Article 110439"},"PeriodicalIF":4.0000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Electrical Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0045790625003829","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Visual Question Answering (VQA) has garnered significant attention in recent years due to its potential for broad applications across fields such as medicine, education, and entertainment. However, existing VQA methods still face several limitations, including challenges in handling abstract and complex questions, poor generalization, lack of explainability, and susceptibility to noise and bias. In this study, we propose a novel ExVQA model that leverages Stacked Attention Networks (SANs) and Extended Long Short-Term Memory (xLSTM) for Visual Question Answering. Image features are extracted using Sigmoid loss for Language-Image Pre-training (SigLIP), while question features are represented using the Autoregressive Transformer Decoder model (GPT-Neo) and Extended Long Short-Term Memory networks to facilitate the answer generation process. By utilizing the strengths of SANs and xLSTM, our approach aims to overcome the limitations of previous models and enhance the performance and reliability of VQA systems. Evaluation results on three datasets: PathVQA, VQA-Med 2019 and GQA show that our proposed ExVQA model achieves better performance than existing methods, demonstrating great application potential in the fields of medicine, education and entertainment.

查看原文本刊更多论文

基于扩展长短期记忆的堆叠注意网络视觉问答模型

视觉问答（VQA）由于其在医学、教育和娱乐等领域的广泛应用潜力，近年来引起了人们的极大关注。然而，现有的VQA方法仍然面临着一些局限性，包括处理抽象和复杂问题的挑战，泛化能力差，缺乏可解释性，易受噪声和偏差的影响。在这项研究中，我们提出了一种新的ExVQA模型，该模型利用堆叠注意网络（SANs）和扩展长短期记忆（xLSTM）进行视觉问答。图像特征提取使用Sigmoid损失进行语言图像预训练（SigLIP），而问题特征则使用自回归变压器解码器模型（GPT-Neo）和扩展长短期记忆网络来表示，以促进答案生成过程。通过利用san和xLSTM的优势，我们的方法旨在克服以前模型的局限性，增强VQA系统的性能和可靠性。在PathVQA、VQA-Med 2019和GQA三个数据集上的评估结果表明，我们提出的ExVQA模型取得了比现有方法更好的性能，在医学、教育和娱乐领域显示出巨大的应用潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computers & Electrical Engineering 工程技术-工程：电子与电气

CiteScore

9.20

自引率

7.00%

发文量

661

审稿时长

47 days

期刊介绍： The impact of computers has nowhere been more revolutionary than in electrical engineering. The design, analysis, and operation of electrical and electronic systems are now dominated by computers, a transformation that has been motivated by the natural ease of interface between computers and electrical systems, and the promise of spectacular improvements in speed and efficiency. Published since 1973, Computers & Electrical Engineering provides rapid publication of topical research into the integration of computer technology and computational techniques with electrical and electronic systems. The journal publishes papers featuring novel implementations of computers and computational techniques in areas like signal and image processing, high-performance computing, parallel processing, and communications. Special attention will be paid to papers describing innovative architectures, algorithms, and software tools.