{"title":"视觉问答的剩余自我注意","authors":"Daojian Zeng, Guanhong Zhou, Jin Wang","doi":"10.1109/ICECIE47765.2019.8974765","DOIUrl":null,"url":null,"abstract":"Over these years, many attention mechanism-based neural network models have been put forward in the visual question answering (VQA) by some researchers. However, the semantic gap between the multimodality cannot be bridged simply, and the combination of the questions and the images may be arbitrary, which both hinder the jointly representing learning and make the VQA model easy for overfitting. Therefore, in this paper, a multi-stage attention model has been put forward, that is, for the image, the bottom-up attention and residual self-attention for the image itself and the question-guided double-headed soft (top-down) attention method were used to extract the image features; for the question, the pre-trained GloVe word embeddings were used in this paper to represent the semantics and the GRU was used to encode the questions into fixed-length sentence embedding; finally, the image features were fused with the question embedding through the Hadamard product and then input into the sigmoid multi-classifier. The experiment showed that with the model the overall accuracy of 67.26% in the VQA 2.0 dataset was finally achieved, higher than that of other advanced models.","PeriodicalId":154051,"journal":{"name":"2019 1st International Conference on Electrical, Control and Instrumentation Engineering (ICECIE)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Residual Self-Attention for Visual Question Answering\",\"authors\":\"Daojian Zeng, Guanhong Zhou, Jin Wang\",\"doi\":\"10.1109/ICECIE47765.2019.8974765\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Over these years, many attention mechanism-based neural network models have been put forward in the visual question answering (VQA) by some researchers. However, the semantic gap between the multimodality cannot be bridged simply, and the combination of the questions and the images may be arbitrary, which both hinder the jointly representing learning and make the VQA model easy for overfitting. Therefore, in this paper, a multi-stage attention model has been put forward, that is, for the image, the bottom-up attention and residual self-attention for the image itself and the question-guided double-headed soft (top-down) attention method were used to extract the image features; for the question, the pre-trained GloVe word embeddings were used in this paper to represent the semantics and the GRU was used to encode the questions into fixed-length sentence embedding; finally, the image features were fused with the question embedding through the Hadamard product and then input into the sigmoid multi-classifier. The experiment showed that with the model the overall accuracy of 67.26% in the VQA 2.0 dataset was finally achieved, higher than that of other advanced models.\",\"PeriodicalId\":154051,\"journal\":{\"name\":\"2019 1st International Conference on Electrical, Control and Instrumentation Engineering (ICECIE)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 1st International Conference on Electrical, Control and Instrumentation Engineering (ICECIE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICECIE47765.2019.8974765\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 1st International Conference on Electrical, Control and Instrumentation Engineering (ICECIE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICECIE47765.2019.8974765","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Residual Self-Attention for Visual Question Answering
Over these years, many attention mechanism-based neural network models have been put forward in the visual question answering (VQA) by some researchers. However, the semantic gap between the multimodality cannot be bridged simply, and the combination of the questions and the images may be arbitrary, which both hinder the jointly representing learning and make the VQA model easy for overfitting. Therefore, in this paper, a multi-stage attention model has been put forward, that is, for the image, the bottom-up attention and residual self-attention for the image itself and the question-guided double-headed soft (top-down) attention method were used to extract the image features; for the question, the pre-trained GloVe word embeddings were used in this paper to represent the semantics and the GRU was used to encode the questions into fixed-length sentence embedding; finally, the image features were fused with the question embedding through the Hadamard product and then input into the sigmoid multi-classifier. The experiment showed that with the model the overall accuracy of 67.26% in the VQA 2.0 dataset was finally achieved, higher than that of other advanced models.