ExVQA: a novel stacked attention networks with extended long short-term memory model for visual question answering

IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Bui Thanh Hung, Ho Vo Hoang Duy
{"title":"ExVQA: a novel stacked attention networks with extended long short-term memory model for visual question answering","authors":"Bui Thanh Hung,&nbsp;Ho Vo Hoang Duy","doi":"10.1016/j.compeleceng.2025.110439","DOIUrl":null,"url":null,"abstract":"<div><div>Visual Question Answering (VQA) has garnered significant attention in recent years due to its potential for broad applications across fields such as medicine, education, and entertainment. However, existing VQA methods still face several limitations, including challenges in handling abstract and complex questions, poor generalization, lack of explainability, and susceptibility to noise and bias. In this study, we propose a novel ExVQA model that leverages Stacked Attention Networks (SANs) and Extended Long Short-Term Memory (xLSTM) for Visual Question Answering. Image features are extracted using Sigmoid loss for Language-Image Pre-training (SigLIP), while question features are represented using the Autoregressive Transformer Decoder model (GPT-Neo) and Extended Long Short-Term Memory networks to facilitate the answer generation process. By utilizing the strengths of SANs and xLSTM, our approach aims to overcome the limitations of previous models and enhance the performance and reliability of VQA systems. Evaluation results on three datasets: PathVQA, VQA-Med 2019 and GQA show that our proposed ExVQA model achieves better performance than existing methods, demonstrating great application potential in the fields of medicine, education and entertainment.</div></div>","PeriodicalId":50630,"journal":{"name":"Computers & Electrical Engineering","volume":"126 ","pages":"Article 110439"},"PeriodicalIF":4.0000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Electrical Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0045790625003829","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Visual Question Answering (VQA) has garnered significant attention in recent years due to its potential for broad applications across fields such as medicine, education, and entertainment. However, existing VQA methods still face several limitations, including challenges in handling abstract and complex questions, poor generalization, lack of explainability, and susceptibility to noise and bias. In this study, we propose a novel ExVQA model that leverages Stacked Attention Networks (SANs) and Extended Long Short-Term Memory (xLSTM) for Visual Question Answering. Image features are extracted using Sigmoid loss for Language-Image Pre-training (SigLIP), while question features are represented using the Autoregressive Transformer Decoder model (GPT-Neo) and Extended Long Short-Term Memory networks to facilitate the answer generation process. By utilizing the strengths of SANs and xLSTM, our approach aims to overcome the limitations of previous models and enhance the performance and reliability of VQA systems. Evaluation results on three datasets: PathVQA, VQA-Med 2019 and GQA show that our proposed ExVQA model achieves better performance than existing methods, demonstrating great application potential in the fields of medicine, education and entertainment.
基于扩展长短期记忆的堆叠注意网络视觉问答模型
视觉问答(VQA)由于其在医学、教育和娱乐等领域的广泛应用潜力,近年来引起了人们的极大关注。然而,现有的VQA方法仍然面临着一些局限性,包括处理抽象和复杂问题的挑战,泛化能力差,缺乏可解释性,易受噪声和偏差的影响。在这项研究中,我们提出了一种新的ExVQA模型,该模型利用堆叠注意网络(SANs)和扩展长短期记忆(xLSTM)进行视觉问答。图像特征提取使用Sigmoid损失进行语言图像预训练(SigLIP),而问题特征则使用自回归变压器解码器模型(GPT-Neo)和扩展长短期记忆网络来表示,以促进答案生成过程。通过利用san和xLSTM的优势,我们的方法旨在克服以前模型的局限性,增强VQA系统的性能和可靠性。在PathVQA、VQA-Med 2019和GQA三个数据集上的评估结果表明,我们提出的ExVQA模型取得了比现有方法更好的性能,在医学、教育和娱乐领域显示出巨大的应用潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computers & Electrical Engineering
Computers & Electrical Engineering 工程技术-工程:电子与电气
CiteScore
9.20
自引率
7.00%
发文量
661
审稿时长
47 days
期刊介绍: The impact of computers has nowhere been more revolutionary than in electrical engineering. The design, analysis, and operation of electrical and electronic systems are now dominated by computers, a transformation that has been motivated by the natural ease of interface between computers and electrical systems, and the promise of spectacular improvements in speed and efficiency. Published since 1973, Computers & Electrical Engineering provides rapid publication of topical research into the integration of computer technology and computational techniques with electrical and electronic systems. The journal publishes papers featuring novel implementations of computers and computational techniques in areas like signal and image processing, high-performance computing, parallel processing, and communications. Special attention will be paid to papers describing innovative architectures, algorithms, and software tools.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信