TRANS-VQA: Fully Transformer-Based Image Question-Answering Model Using Question-guided Vision Attention

Dipali Koshti, Ashutosh Gupta, M. Kalla, Arvind Sharma
{"title":"TRANS-VQA: Fully Transformer-Based Image Question-Answering Model Using Question-guided Vision Attention","authors":"Dipali Koshti, Ashutosh Gupta, M. Kalla, Arvind Sharma","doi":"10.4114/intartif.vol27iss73pp111-128","DOIUrl":null,"url":null,"abstract":"Understanding multiple modalities and relating them is an easy task for humans. But for machines, this is a stimulating task. One such multi-modal reasoning task is Visual question answering which demands the machine to produce an answer for the natural language query asked based on the given image. Although plenty of work is done in this field, there is still a challenge of improving the answer prediction ability of the model and breaching human accuracy. A novel model for answering image-based questions based on a transformer has been proposed. The proposed model is a fully Transformer-based architecture that utilizes the power of a transformer for extracting language features as well as for performing joint understanding of question and image features. The proposed VQA model utilizes F-RCNN for image feature extraction. The retrieved language features and object-level image features are fed to a decoder inspired by the Bi-Directional Encoder Representation Transformer - BERT architecture that learns jointly the image characteristics directed by the question characteristics and rich representations of the image features are obtained. Extensive experimentation has been carried out to observe the effect of various hyperparameters on the performance of the model. The experimental results demonstrate that the model’s ability to predict the answer increases with the increase in the number of layers in the transformer’s encoder and decoder. The proposed model improves upon the previous models and is highly scalable due to the introduction of the BERT. Our best model reports 72.31% accuracy on the test-standard split of the VQAv2 dataset.","PeriodicalId":176050,"journal":{"name":"Inteligencia Artif.","volume":"21 1","pages":"111-128"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Inteligencia Artif.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4114/intartif.vol27iss73pp111-128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Understanding multiple modalities and relating them is an easy task for humans. But for machines, this is a stimulating task. One such multi-modal reasoning task is Visual question answering which demands the machine to produce an answer for the natural language query asked based on the given image. Although plenty of work is done in this field, there is still a challenge of improving the answer prediction ability of the model and breaching human accuracy. A novel model for answering image-based questions based on a transformer has been proposed. The proposed model is a fully Transformer-based architecture that utilizes the power of a transformer for extracting language features as well as for performing joint understanding of question and image features. The proposed VQA model utilizes F-RCNN for image feature extraction. The retrieved language features and object-level image features are fed to a decoder inspired by the Bi-Directional Encoder Representation Transformer - BERT architecture that learns jointly the image characteristics directed by the question characteristics and rich representations of the image features are obtained. Extensive experimentation has been carried out to observe the effect of various hyperparameters on the performance of the model. The experimental results demonstrate that the model’s ability to predict the answer increases with the increase in the number of layers in the transformer’s encoder and decoder. The proposed model improves upon the previous models and is highly scalable due to the introduction of the BERT. Our best model reports 72.31% accuracy on the test-standard split of the VQAv2 dataset.
TRANS-VQA:基于完全变换器的图像问答模型,使用问题引导的视觉注意力
理解多种模式并将它们联系起来,对人类来说是一件容易的事。但对于机器来说,这却是一项艰巨的任务。视觉问题解答就是这样一项多模态推理任务,它要求机器根据给定图像为自然语言查询生成答案。尽管在这一领域已经做了大量工作,但如何提高模型的答案预测能力并突破人类的准确性仍是一个挑战。我们提出了一种基于变换器的新型图像问题解答模型。所提出的模型是一个完全基于变换器的架构,它利用变换器的强大功能提取语言特征,并对问题和图像特征进行联合理解。拟议的 VQA 模型利用 F-RCNN 进行图像特征提取。检索到的语言特征和对象级图像特征被输入到受双向编码器表征变换器(BERT)架构启发的解码器中,该解码器根据问题特征联合学习图像特征,从而获得丰富的图像特征表征。为了观察各种超参数对模型性能的影响,我们进行了广泛的实验。实验结果表明,模型预测答案的能力随着变换器编码器和解码器层数的增加而提高。由于引入了 BERT,所提出的模型改进了之前的模型,并具有很强的可扩展性。我们的最佳模型在 VQAv2 数据集的测试标准拆分中获得了 72.31% 的准确率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信