Jinmeng Wu, Lei Ma, Fulin Ge, Y. Hao, Pengcheng Shu
{"title":"Question-Driven Multiple Attention(DQMA) Model for Visual Question Answer","authors":"Jinmeng Wu, Lei Ma, Fulin Ge, Y. Hao, Pengcheng Shu","doi":"10.1109/AICIT55386.2022.9930294","DOIUrl":null,"url":null,"abstract":"Visual Question and Answer (VQA) refers to a typical multimodal problem in the fields of computer vision and natural language processing, which aims to give an open-ended question about an image that can be answered accurately. The currently existing visual question answer models inevitably introduce redundant and inaccurate visual information when exploring the rich interaction between complex image targets and texts, and they also fail to focus effectively on the targets in the scene. To address this problem, the Question-Driven Multiple Attention Model (QDMA) is proposed. Firstly, Faster R-CNN and LSTM are used to extract visual features of images and textual features of questions. Then we design a question-driven attention network to obtain question regions of interest in images so that the model can accurately target relevant targets in complex scenes. To establish intensive interaction between the image region of interest and the question word, the co-attentive network consisting of self-attentive and guided-attentive units is introduced. Finally, the correct answer is obtained by inputting question features and image features into an answer prediction module consisting of two-layer Multi-Layer Perceptron. On the VQA2.0 dataset, the suggested method is empirically compared with other methods. The results reveal that the model outperforms other methods, demonstrating the usefulness of the framework.","PeriodicalId":231070,"journal":{"name":"2022 International Conference on Artificial Intelligence and Computer Information Technology (AICIT)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Artificial Intelligence and Computer Information Technology (AICIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICIT55386.2022.9930294","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Visual Question and Answer (VQA) refers to a typical multimodal problem in the fields of computer vision and natural language processing, which aims to give an open-ended question about an image that can be answered accurately. The currently existing visual question answer models inevitably introduce redundant and inaccurate visual information when exploring the rich interaction between complex image targets and texts, and they also fail to focus effectively on the targets in the scene. To address this problem, the Question-Driven Multiple Attention Model (QDMA) is proposed. Firstly, Faster R-CNN and LSTM are used to extract visual features of images and textual features of questions. Then we design a question-driven attention network to obtain question regions of interest in images so that the model can accurately target relevant targets in complex scenes. To establish intensive interaction between the image region of interest and the question word, the co-attentive network consisting of self-attentive and guided-attentive units is introduced. Finally, the correct answer is obtained by inputting question features and image features into an answer prediction module consisting of two-layer Multi-Layer Perceptron. On the VQA2.0 dataset, the suggested method is empirically compared with other methods. The results reveal that the model outperforms other methods, demonstrating the usefulness of the framework.