{"title":"基于CAM和GCN的可视化问答模型","authors":"Ping Wen, Matthew Li, Zhang Zhen, Wang Ze","doi":"10.1145/3573942.3574090","DOIUrl":null,"url":null,"abstract":"Visual Question Answering (VQA) is a challenging problem that needs to combine concepts from computer vision and natural language processing. In recent years, researchers have proposed many methods for this typical multimodal problem. Most existing methods use a two-stream strategy, i.e., compute image and question features separately and fuse them using various techniques, rarely relying on higher-level image representations, to capture semantic and spatial relationships. Based on the above problems, a visual question answering model (CAM-GCN) based on Cooperative Attention Mechanism (CAM) and Graph Convolutional Network (GCN) is proposed. First, the graph learning module and the concept of graph convolution are combined to learn the problem-specific graph representation of the input image and capture the interactive image representation of the specific problem. Image region dependence, and finally, continue to optimize the fused features through feature enhancement. The test results on the VQA v2 dataset show that the CAM-GCN model achieves better classification results than the current representative models.","PeriodicalId":103293,"journal":{"name":"Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Visual Question Answering Model Based on CAM and GCN\",\"authors\":\"Ping Wen, Matthew Li, Zhang Zhen, Wang Ze\",\"doi\":\"10.1145/3573942.3574090\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual Question Answering (VQA) is a challenging problem that needs to combine concepts from computer vision and natural language processing. In recent years, researchers have proposed many methods for this typical multimodal problem. Most existing methods use a two-stream strategy, i.e., compute image and question features separately and fuse them using various techniques, rarely relying on higher-level image representations, to capture semantic and spatial relationships. Based on the above problems, a visual question answering model (CAM-GCN) based on Cooperative Attention Mechanism (CAM) and Graph Convolutional Network (GCN) is proposed. First, the graph learning module and the concept of graph convolution are combined to learn the problem-specific graph representation of the input image and capture the interactive image representation of the specific problem. Image region dependence, and finally, continue to optimize the fused features through feature enhancement. The test results on the VQA v2 dataset show that the CAM-GCN model achieves better classification results than the current representative models.\",\"PeriodicalId\":103293,\"journal\":{\"name\":\"Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3573942.3574090\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3573942.3574090","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Visual Question Answering Model Based on CAM and GCN
Visual Question Answering (VQA) is a challenging problem that needs to combine concepts from computer vision and natural language processing. In recent years, researchers have proposed many methods for this typical multimodal problem. Most existing methods use a two-stream strategy, i.e., compute image and question features separately and fuse them using various techniques, rarely relying on higher-level image representations, to capture semantic and spatial relationships. Based on the above problems, a visual question answering model (CAM-GCN) based on Cooperative Attention Mechanism (CAM) and Graph Convolutional Network (GCN) is proposed. First, the graph learning module and the concept of graph convolution are combined to learn the problem-specific graph representation of the input image and capture the interactive image representation of the specific problem. Image region dependence, and finally, continue to optimize the fused features through feature enhancement. The test results on the VQA v2 dataset show that the CAM-GCN model achieves better classification results than the current representative models.