{"title":"图像情感信息增强的多模态视觉问答模型","authors":"Jin Cai, Guoyong Cai","doi":"10.1109/ICNLP58431.2023.00056","DOIUrl":null,"url":null,"abstract":"Visual Question Answering is a multimedia understanding task that gives an image and natural language questions related to its content and allows the computer to answer them correctly. The early visual question answering models often ignore the emotional information in the image, resulting in insufficient performance in answering emotional-related questions; on the other hand, the existing visual question answering models that integrate emotional information do not make full use of the key areas of the image and text keywords, and do not understand fine-grained questions deeply enough, resulting in low accuracy. In order to fully integrate image emotional information into the visual question answering model and enhance the ability of the model to answer questions, a multimodal visual question answering model (IEMVQA) enhanced by image emotional information is proposed, and experiments are carried out on the visual question answering benchmark dataset. The final experiment shows that the IEMVQA model performs better than other comparison methods in comprehensive indicators, and verifies the effectiveness of using emotional information to assist visual question answering model.","PeriodicalId":53637,"journal":{"name":"Icon","volume":"24 1","pages":"268-273"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal Visual Question Answering Model Enhanced with Image Emotional Information\",\"authors\":\"Jin Cai, Guoyong Cai\",\"doi\":\"10.1109/ICNLP58431.2023.00056\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual Question Answering is a multimedia understanding task that gives an image and natural language questions related to its content and allows the computer to answer them correctly. The early visual question answering models often ignore the emotional information in the image, resulting in insufficient performance in answering emotional-related questions; on the other hand, the existing visual question answering models that integrate emotional information do not make full use of the key areas of the image and text keywords, and do not understand fine-grained questions deeply enough, resulting in low accuracy. In order to fully integrate image emotional information into the visual question answering model and enhance the ability of the model to answer questions, a multimodal visual question answering model (IEMVQA) enhanced by image emotional information is proposed, and experiments are carried out on the visual question answering benchmark dataset. The final experiment shows that the IEMVQA model performs better than other comparison methods in comprehensive indicators, and verifies the effectiveness of using emotional information to assist visual question answering model.\",\"PeriodicalId\":53637,\"journal\":{\"name\":\"Icon\",\"volume\":\"24 1\",\"pages\":\"268-273\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Icon\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICNLP58431.2023.00056\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Arts and Humanities\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Icon","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNLP58431.2023.00056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Arts and Humanities","Score":null,"Total":0}
Multimodal Visual Question Answering Model Enhanced with Image Emotional Information
Visual Question Answering is a multimedia understanding task that gives an image and natural language questions related to its content and allows the computer to answer them correctly. The early visual question answering models often ignore the emotional information in the image, resulting in insufficient performance in answering emotional-related questions; on the other hand, the existing visual question answering models that integrate emotional information do not make full use of the key areas of the image and text keywords, and do not understand fine-grained questions deeply enough, resulting in low accuracy. In order to fully integrate image emotional information into the visual question answering model and enhance the ability of the model to answer questions, a multimodal visual question answering model (IEMVQA) enhanced by image emotional information is proposed, and experiments are carried out on the visual question answering benchmark dataset. The final experiment shows that the IEMVQA model performs better than other comparison methods in comprehensive indicators, and verifies the effectiveness of using emotional information to assist visual question answering model.