{"title":"基于时空特征多任务学习的电影视频暴力场景检测","authors":"Z. Zheng, Wei Zhong, Long Ye, Li Fang, Qin Zhang","doi":"10.1109/MIPR51284.2021.00067","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a new framework for the violent scene detection of film videos based on multi-task learning of temporal-spatial features. In the proposed framework, for the violent behavior representation of film clips, we employ a temporal excitation and aggregation network to extract the temporal-spatial deep features in the visual modality. And on the other hand, a recurrent neural network with local attention is utilized to extract the utterance-level representation of affective analysis in the audio modality. In the process of feature mapping, we concern the task of violent scene detection together with that of affective analysis and then propose a multi-task learning strategy to effectively predict the violent scene of film clips. To evaluate the effectiveness of the proposed method, the experiments are done on the task of violent scenes detection 2015. The experimental results show that our model outperforms most of the state of the art methods, validating the innovation of considering the task of violent scene detection jointly with the violence emotion analysis.","PeriodicalId":139543,"journal":{"name":"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Violent Scene Detection of Film Videos Based on Multi-Task Learning of Temporal-Spatial Features\",\"authors\":\"Z. Zheng, Wei Zhong, Long Ye, Li Fang, Qin Zhang\",\"doi\":\"10.1109/MIPR51284.2021.00067\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose a new framework for the violent scene detection of film videos based on multi-task learning of temporal-spatial features. In the proposed framework, for the violent behavior representation of film clips, we employ a temporal excitation and aggregation network to extract the temporal-spatial deep features in the visual modality. And on the other hand, a recurrent neural network with local attention is utilized to extract the utterance-level representation of affective analysis in the audio modality. In the process of feature mapping, we concern the task of violent scene detection together with that of affective analysis and then propose a multi-task learning strategy to effectively predict the violent scene of film clips. To evaluate the effectiveness of the proposed method, the experiments are done on the task of violent scenes detection 2015. The experimental results show that our model outperforms most of the state of the art methods, validating the innovation of considering the task of violent scene detection jointly with the violence emotion analysis.\",\"PeriodicalId\":139543,\"journal\":{\"name\":\"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)\",\"volume\":\"51 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MIPR51284.2021.00067\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MIPR51284.2021.00067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Violent Scene Detection of Film Videos Based on Multi-Task Learning of Temporal-Spatial Features
In this paper, we propose a new framework for the violent scene detection of film videos based on multi-task learning of temporal-spatial features. In the proposed framework, for the violent behavior representation of film clips, we employ a temporal excitation and aggregation network to extract the temporal-spatial deep features in the visual modality. And on the other hand, a recurrent neural network with local attention is utilized to extract the utterance-level representation of affective analysis in the audio modality. In the process of feature mapping, we concern the task of violent scene detection together with that of affective analysis and then propose a multi-task learning strategy to effectively predict the violent scene of film clips. To evaluate the effectiveness of the proposed method, the experiments are done on the task of violent scenes detection 2015. The experimental results show that our model outperforms most of the state of the art methods, validating the innovation of considering the task of violent scene detection jointly with the violence emotion analysis.