Cross-Attention Based Text-Image Transformer for Visual Question Answering

Q3 Computer Science
Mahdi Rezapour
{"title":"Cross-Attention Based Text-Image Transformer for Visual Question\nAnswering","authors":"Mahdi Rezapour","doi":"10.2174/0126662558291150240102111855","DOIUrl":null,"url":null,"abstract":"\n\nVisual question answering (VQA) is a challenging task that requires\nmultimodal reasoning and knowledge. The objective of VQA is to answer natural language\nquestions based on corresponding present information in a given image. The challenge of VQA\nis to extract visual and textual features and pass them into a common space. However, the\nmethod faces the challenge of object detection being present in an image and finding the relationship between objects.\n\n\n\nIn this study, we explored different methods of feature fusion for VQA, using pretrained models to encode the text and image features and then applying different attention\nmechanisms to fuse them. We evaluated our methods on the DAQUAR dataset.\n\n\n\nWe used three metrics to measure the performance of our methods: WUPS, Acc, and\nF1. We found that concatenating raw text and image features performs slightly better than selfattention for VQA. We also found that using text as query and image as key and value performs worse than other methods of cross-attention or self-attention for VQA because it might\nnot capture the bidirectional interactions between the text and image modalities\n\n\n\nIn this paper, we presented a comparative study of different feature fusion methods for VQA, using pre-trained models to encode the text and image features and then applying\ndifferent attention mechanisms to fuse them. We showed that concatenating raw text and image\nfeatures is a simple but effective method for VQA while using text as query and image as key\nand value is a suboptimal method for VQA. We also discussed the limitations and future directions of our work.\n","PeriodicalId":36514,"journal":{"name":"Recent Advances in Computer Science and Communications","volume":"53 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Recent Advances in Computer Science and Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2174/0126662558291150240102111855","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0

Abstract

Visual question answering (VQA) is a challenging task that requires multimodal reasoning and knowledge. The objective of VQA is to answer natural language questions based on corresponding present information in a given image. The challenge of VQA is to extract visual and textual features and pass them into a common space. However, the method faces the challenge of object detection being present in an image and finding the relationship between objects. In this study, we explored different methods of feature fusion for VQA, using pretrained models to encode the text and image features and then applying different attention mechanisms to fuse them. We evaluated our methods on the DAQUAR dataset. We used three metrics to measure the performance of our methods: WUPS, Acc, and F1. We found that concatenating raw text and image features performs slightly better than selfattention for VQA. We also found that using text as query and image as key and value performs worse than other methods of cross-attention or self-attention for VQA because it might not capture the bidirectional interactions between the text and image modalities In this paper, we presented a comparative study of different feature fusion methods for VQA, using pre-trained models to encode the text and image features and then applying different attention mechanisms to fuse them. We showed that concatenating raw text and image features is a simple but effective method for VQA while using text as query and image as key and value is a suboptimal method for VQA. We also discussed the limitations and future directions of our work.
基于交叉注意力的可视化问答文本图像转换器
视觉问题解答(VQA)是一项具有挑战性的任务,需要多模态推理和知识。VQA 的目标是根据给定图像中相应的当前信息回答自然语言问题。VQA 的挑战在于提取视觉和文本特征,并将它们传递到一个共同的空间。在这项研究中,我们探索了不同的 VQA 特征融合方法,使用预训练模型对文本和图像特征进行编码,然后应用不同的注意机制对它们进行融合。我们在 DAQUAR 数据集上对我们的方法进行了评估:我们使用了三个指标来衡量我们的方法的性能:WUPS、Acc 和 F1。我们发现,就 VQA 而言,将原始文本和图像特征融合在一起的性能略优于自我注意。我们还发现,在 VQA 中,使用文本作为查询,使用图像作为键和值的表现比其他交叉注意或自我注意方法差,因为它可能无法捕捉文本和图像模式之间的双向交互。我们的研究表明,将原始文本和图像特征串联起来是一种简单而有效的 VQA 方法,而使用文本作为查询和图像作为键和值则是一种次优的 VQA 方法。我们还讨论了我们工作的局限性和未来方向。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Recent Advances in Computer Science and Communications
Recent Advances in Computer Science and Communications Computer Science-Computer Science (all)
CiteScore
2.50
自引率
0.00%
发文量
142
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信