基于交叉注意力的可视化问答文本图像转换器

Q3 Computer Science

Recent Advances in Computer Science and Communications Pub Date : 2024-01-30 DOI:10.2174/0126662558291150240102111855

Mahdi Rezapour

{"title":"基于交叉注意力的可视化问答文本图像转换器","authors":"Mahdi Rezapour","doi":"10.2174/0126662558291150240102111855","DOIUrl":null,"url":null,"abstract":"\n\nVisual question answering (VQA) is a challenging task that requires\nmultimodal reasoning and knowledge. The objective of VQA is to answer natural language\nquestions based on corresponding present information in a given image. The challenge of VQA\nis to extract visual and textual features and pass them into a common space. However, the\nmethod faces the challenge of object detection being present in an image and finding the relationship between objects.\n\n\n\nIn this study, we explored different methods of feature fusion for VQA, using pretrained models to encode the text and image features and then applying different attention\nmechanisms to fuse them. We evaluated our methods on the DAQUAR dataset.\n\n\n\nWe used three metrics to measure the performance of our methods: WUPS, Acc, and\nF1. We found that concatenating raw text and image features performs slightly better than selfattention for VQA. We also found that using text as query and image as key and value performs worse than other methods of cross-attention or self-attention for VQA because it might\nnot capture the bidirectional interactions between the text and image modalities\n\n\n\nIn this paper, we presented a comparative study of different feature fusion methods for VQA, using pre-trained models to encode the text and image features and then applying\ndifferent attention mechanisms to fuse them. We showed that concatenating raw text and image\nfeatures is a simple but effective method for VQA while using text as query and image as key\nand value is a suboptimal method for VQA. We also discussed the limitations and future directions of our work.\n","PeriodicalId":36514,"journal":{"name":"Recent Advances in Computer Science and Communications","volume":"53 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cross-Attention Based Text-Image Transformer for Visual Question\\nAnswering\",\"authors\":\"Mahdi Rezapour\",\"doi\":\"10.2174/0126662558291150240102111855\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n\\nVisual question answering (VQA) is a challenging task that requires\\nmultimodal reasoning and knowledge. The objective of VQA is to answer natural language\\nquestions based on corresponding present information in a given image. The challenge of VQA\\nis to extract visual and textual features and pass them into a common space. However, the\\nmethod faces the challenge of object detection being present in an image and finding the relationship between objects.\\n\\n\\n\\nIn this study, we explored different methods of feature fusion for VQA, using pretrained models to encode the text and image features and then applying different attention\\nmechanisms to fuse them. We evaluated our methods on the DAQUAR dataset.\\n\\n\\n\\nWe used three metrics to measure the performance of our methods: WUPS, Acc, and\\nF1. We found that concatenating raw text and image features performs slightly better than selfattention for VQA. We also found that using text as query and image as key and value performs worse than other methods of cross-attention or self-attention for VQA because it might\\nnot capture the bidirectional interactions between the text and image modalities\\n\\n\\n\\nIn this paper, we presented a comparative study of different feature fusion methods for VQA, using pre-trained models to encode the text and image features and then applying\\ndifferent attention mechanisms to fuse them. We showed that concatenating raw text and image\\nfeatures is a simple but effective method for VQA while using text as query and image as key\\nand value is a suboptimal method for VQA. We also discussed the limitations and future directions of our work.\\n\",\"PeriodicalId\":36514,\"journal\":{\"name\":\"Recent Advances in Computer Science and Communications\",\"volume\":\"53 3\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Recent Advances in Computer Science and Communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2174/0126662558291150240102111855\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Recent Advances in Computer Science and Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2174/0126662558291150240102111855","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

摘要

视觉问题解答（VQA）是一项具有挑战性的任务，需要多模态推理和知识。VQA 的目标是根据给定图像中相应的当前信息回答自然语言问题。VQA 的挑战在于提取视觉和文本特征，并将它们传递到一个共同的空间。在这项研究中，我们探索了不同的 VQA 特征融合方法，使用预训练模型对文本和图像特征进行编码，然后应用不同的注意机制对它们进行融合。我们在 DAQUAR 数据集上对我们的方法进行了评估：我们使用了三个指标来衡量我们的方法的性能：WUPS、Acc 和 F1。我们发现，就 VQA 而言，将原始文本和图像特征融合在一起的性能略优于自我注意。我们还发现，在 VQA 中，使用文本作为查询，使用图像作为键和值的表现比其他交叉注意或自我注意方法差，因为它可能无法捕捉文本和图像模式之间的双向交互。我们的研究表明，将原始文本和图像特征串联起来是一种简单而有效的 VQA 方法，而使用文本作为查询和图像作为键和值则是一种次优的 VQA 方法。我们还讨论了我们工作的局限性和未来方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cross-Attention Based Text-Image Transformer for Visual Question Answering

Visual question answering (VQA) is a challenging task that requires multimodal reasoning and knowledge. The objective of VQA is to answer natural language questions based on corresponding present information in a given image. The challenge of VQA is to extract visual and textual features and pass them into a common space. However, the method faces the challenge of object detection being present in an image and finding the relationship between objects. In this study, we explored different methods of feature fusion for VQA, using pretrained models to encode the text and image features and then applying different attention mechanisms to fuse them. We evaluated our methods on the DAQUAR dataset. We used three metrics to measure the performance of our methods: WUPS, Acc, and F1. We found that concatenating raw text and image features performs slightly better than selfattention for VQA. We also found that using text as query and image as key and value performs worse than other methods of cross-attention or self-attention for VQA because it might not capture the bidirectional interactions between the text and image modalities In this paper, we presented a comparative study of different feature fusion methods for VQA, using pre-trained models to encode the text and image features and then applying different attention mechanisms to fuse them. We showed that concatenating raw text and image features is a simple but effective method for VQA while using text as query and image as key and value is a suboptimal method for VQA. We also discussed the limitations and future directions of our work.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Recent Advances in Computer Science and Communications Computer Science-Computer Science (all)

CiteScore

2.50

自引率

0.00%

发文量

142