{"title":"Cross-Attention Based Text-Image Transformer for Visual Question\nAnswering","authors":"Mahdi Rezapour","doi":"10.2174/0126662558291150240102111855","DOIUrl":null,"url":null,"abstract":"\n\nVisual question answering (VQA) is a challenging task that requires\nmultimodal reasoning and knowledge. The objective of VQA is to answer natural language\nquestions based on corresponding present information in a given image. The challenge of VQA\nis to extract visual and textual features and pass them into a common space. However, the\nmethod faces the challenge of object detection being present in an image and finding the relationship between objects.\n\n\n\nIn this study, we explored different methods of feature fusion for VQA, using pretrained models to encode the text and image features and then applying different attention\nmechanisms to fuse them. We evaluated our methods on the DAQUAR dataset.\n\n\n\nWe used three metrics to measure the performance of our methods: WUPS, Acc, and\nF1. We found that concatenating raw text and image features performs slightly better than selfattention for VQA. We also found that using text as query and image as key and value performs worse than other methods of cross-attention or self-attention for VQA because it might\nnot capture the bidirectional interactions between the text and image modalities\n\n\n\nIn this paper, we presented a comparative study of different feature fusion methods for VQA, using pre-trained models to encode the text and image features and then applying\ndifferent attention mechanisms to fuse them. We showed that concatenating raw text and image\nfeatures is a simple but effective method for VQA while using text as query and image as key\nand value is a suboptimal method for VQA. We also discussed the limitations and future directions of our work.\n","PeriodicalId":36514,"journal":{"name":"Recent Advances in Computer Science and Communications","volume":"53 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Recent Advances in Computer Science and Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2174/0126662558291150240102111855","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0
Abstract
Visual question answering (VQA) is a challenging task that requires
multimodal reasoning and knowledge. The objective of VQA is to answer natural language
questions based on corresponding present information in a given image. The challenge of VQA
is to extract visual and textual features and pass them into a common space. However, the
method faces the challenge of object detection being present in an image and finding the relationship between objects.
In this study, we explored different methods of feature fusion for VQA, using pretrained models to encode the text and image features and then applying different attention
mechanisms to fuse them. We evaluated our methods on the DAQUAR dataset.
We used three metrics to measure the performance of our methods: WUPS, Acc, and
F1. We found that concatenating raw text and image features performs slightly better than selfattention for VQA. We also found that using text as query and image as key and value performs worse than other methods of cross-attention or self-attention for VQA because it might
not capture the bidirectional interactions between the text and image modalities
In this paper, we presented a comparative study of different feature fusion methods for VQA, using pre-trained models to encode the text and image features and then applying
different attention mechanisms to fuse them. We showed that concatenating raw text and image
features is a simple but effective method for VQA while using text as query and image as key
and value is a suboptimal method for VQA. We also discussed the limitations and future directions of our work.