Leodécio Braz, Vinicius Teixeira, H. Pedrini, Z. Dias
{"title":"基于多模态融合网络模块的图像-文本集成电影类型分类","authors":"Leodécio Braz, Vinicius Teixeira, H. Pedrini, Z. Dias","doi":"10.1049/icp.2021.1456","DOIUrl":null,"url":null,"abstract":"Multimodal models have received increasing attention from researchers for using the complementarity of data to obtain a better inference on the dataset. These multimodal models have been applied to several deep learning tasks, such as emotion recognition, video classification and audio-visual speech enhancement. In this paper, we propose a multimodal method that has two branches, one for text classification and another for image classification. In the image classification branch, we use the Class Activation Mapping (CAM) method as an attention module for the identification of relevant regions of the images. To validate our method, we used the MM-IMDB dataset, which consists of 25959 movies with their respective plot outlines, poster and genres. Our results showed that our method averaged 0:6749 in F1-Weight, 0:6734 in F1-Samples, 0:6750 in F1-Micro and 0:6159 in F1-Macro, achieving better results than the state of the art in the F1-Weight and F1-Macro metrics, and being the second best result in the F1-Samples and F1-Micro metrics.","PeriodicalId":431144,"journal":{"name":"11th International Conference of Pattern Recognition Systems (ICPRS 2021)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Image-Text Integration Using a Multimodal Fusion Network Module for Movie Genre Classification\",\"authors\":\"Leodécio Braz, Vinicius Teixeira, H. Pedrini, Z. Dias\",\"doi\":\"10.1049/icp.2021.1456\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal models have received increasing attention from researchers for using the complementarity of data to obtain a better inference on the dataset. These multimodal models have been applied to several deep learning tasks, such as emotion recognition, video classification and audio-visual speech enhancement. In this paper, we propose a multimodal method that has two branches, one for text classification and another for image classification. In the image classification branch, we use the Class Activation Mapping (CAM) method as an attention module for the identification of relevant regions of the images. To validate our method, we used the MM-IMDB dataset, which consists of 25959 movies with their respective plot outlines, poster and genres. Our results showed that our method averaged 0:6749 in F1-Weight, 0:6734 in F1-Samples, 0:6750 in F1-Micro and 0:6159 in F1-Macro, achieving better results than the state of the art in the F1-Weight and F1-Macro metrics, and being the second best result in the F1-Samples and F1-Micro metrics.\",\"PeriodicalId\":431144,\"journal\":{\"name\":\"11th International Conference of Pattern Recognition Systems (ICPRS 2021)\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"11th International Conference of Pattern Recognition Systems (ICPRS 2021)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1049/icp.2021.1456\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"11th International Conference of Pattern Recognition Systems (ICPRS 2021)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1049/icp.2021.1456","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Image-Text Integration Using a Multimodal Fusion Network Module for Movie Genre Classification
Multimodal models have received increasing attention from researchers for using the complementarity of data to obtain a better inference on the dataset. These multimodal models have been applied to several deep learning tasks, such as emotion recognition, video classification and audio-visual speech enhancement. In this paper, we propose a multimodal method that has two branches, one for text classification and another for image classification. In the image classification branch, we use the Class Activation Mapping (CAM) method as an attention module for the identification of relevant regions of the images. To validate our method, we used the MM-IMDB dataset, which consists of 25959 movies with their respective plot outlines, poster and genres. Our results showed that our method averaged 0:6749 in F1-Weight, 0:6734 in F1-Samples, 0:6750 in F1-Micro and 0:6159 in F1-Macro, achieving better results than the state of the art in the F1-Weight and F1-Macro metrics, and being the second best result in the F1-Samples and F1-Micro metrics.