基于多模态融合网络模块的图像-文本集成电影类型分类

11th International Conference of Pattern Recognition Systems (ICPRS 2021) Pub Date : 1900-01-01 DOI:10.1049/icp.2021.1456

Leodécio Braz, Vinicius Teixeira, H. Pedrini, Z. Dias

{"title":"基于多模态融合网络模块的图像-文本集成电影类型分类","authors":"Leodécio Braz, Vinicius Teixeira, H. Pedrini, Z. Dias","doi":"10.1049/icp.2021.1456","DOIUrl":null,"url":null,"abstract":"Multimodal models have received increasing attention from researchers for using the complementarity of data to obtain a better inference on the dataset. These multimodal models have been applied to several deep learning tasks, such as emotion recognition, video classification and audio-visual speech enhancement. In this paper, we propose a multimodal method that has two branches, one for text classification and another for image classification. In the image classification branch, we use the Class Activation Mapping (CAM) method as an attention module for the identification of relevant regions of the images. To validate our method, we used the MM-IMDB dataset, which consists of 25959 movies with their respective plot outlines, poster and genres. Our results showed that our method averaged 0:6749 in F1-Weight, 0:6734 in F1-Samples, 0:6750 in F1-Micro and 0:6159 in F1-Macro, achieving better results than the state of the art in the F1-Weight and F1-Macro metrics, and being the second best result in the F1-Samples and F1-Micro metrics.","PeriodicalId":431144,"journal":{"name":"11th International Conference of Pattern Recognition Systems (ICPRS 2021)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Image-Text Integration Using a Multimodal Fusion Network Module for Movie Genre Classification\",\"authors\":\"Leodécio Braz, Vinicius Teixeira, H. Pedrini, Z. Dias\",\"doi\":\"10.1049/icp.2021.1456\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal models have received increasing attention from researchers for using the complementarity of data to obtain a better inference on the dataset. These multimodal models have been applied to several deep learning tasks, such as emotion recognition, video classification and audio-visual speech enhancement. In this paper, we propose a multimodal method that has two branches, one for text classification and another for image classification. In the image classification branch, we use the Class Activation Mapping (CAM) method as an attention module for the identification of relevant regions of the images. To validate our method, we used the MM-IMDB dataset, which consists of 25959 movies with their respective plot outlines, poster and genres. Our results showed that our method averaged 0:6749 in F1-Weight, 0:6734 in F1-Samples, 0:6750 in F1-Micro and 0:6159 in F1-Macro, achieving better results than the state of the art in the F1-Weight and F1-Macro metrics, and being the second best result in the F1-Samples and F1-Micro metrics.\",\"PeriodicalId\":431144,\"journal\":{\"name\":\"11th International Conference of Pattern Recognition Systems (ICPRS 2021)\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"11th International Conference of Pattern Recognition Systems (ICPRS 2021)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1049/icp.2021.1456\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"11th International Conference of Pattern Recognition Systems (ICPRS 2021)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1049/icp.2021.1456","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

多模态模型利用数据的互补性对数据集进行更好的推理，越来越受到研究人员的关注。这些多模态模型已被应用于多个深度学习任务，如情感识别、视频分类和视听语音增强。在本文中，我们提出了一种多模态方法，它有两个分支，一个用于文本分类，另一个用于图像分类。在图像分类分支中，我们使用类激活映射(Class Activation Mapping, CAM)方法作为关注模块来识别图像的相关区域。为了验证我们的方法，我们使用了MM-IMDB数据集，该数据集包含25959部电影及其各自的情节大纲、海报和类型。结果表明，我们的方法在F1-Weight、F1-Samples、F1-Micro和F1-Macro的平均值分别为0:6749、0:6734、0:6750和0:6159，在F1-Weight和F1-Macro指标上取得了比现有方法更好的结果，在F1-Samples和F1-Micro指标上取得了第二好的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Image-Text Integration Using a Multimodal Fusion Network Module for Movie Genre Classification

Multimodal models have received increasing attention from researchers for using the complementarity of data to obtain a better inference on the dataset. These multimodal models have been applied to several deep learning tasks, such as emotion recognition, video classification and audio-visual speech enhancement. In this paper, we propose a multimodal method that has two branches, one for text classification and another for image classification. In the image classification branch, we use the Class Activation Mapping (CAM) method as an attention module for the identification of relevant regions of the images. To validate our method, we used the MM-IMDB dataset, which consists of 25959 movies with their respective plot outlines, poster and genres. Our results showed that our method averaged 0:6749 in F1-Weight, 0:6734 in F1-Samples, 0:6750 in F1-Micro and 0:6159 in F1-Macro, achieving better results than the state of the art in the F1-Weight and F1-Macro metrics, and being the second best result in the F1-Samples and F1-Micro metrics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

11th International Conference of Pattern Recognition Systems (ICPRS 2021)

自引率

0.00%

发文量