{"title":"基于ConvLSTM的视频游戏元数据暴力检测","authors":"Helena A. Correia, José Henrique Brito","doi":"10.1109/SEGAH52098.2021.9551853","DOIUrl":null,"url":null,"abstract":"The automatic detection of violent situations is relevant to monitor exposure to violence, both in the context of the analysis of real video and video generated in virtual environments, namely in simulated scenarios or virtual/mixed reality applications, such as serious games. In this paper, we propose a deep neural network to identify violent videos, with an approach capable of working in real video and synthetic video. An efficient detector of the 2D pose together with a multiple person tracker is used to extract motion features from the video sequence that will be fed directly into the proposed network. The proposed convolutional neural network is a recurrent neural network for a Spatio-temporal prediction that has convolutional structures in both the input-to-state and state-to-state transitions, which enables the analysis of local motion in the video. By stacking multiple ConvLSTM layers and forming an encoding-forecasting structure, we obtain a network model for the violence detection problem and more general spatiotemporal sequence forecasting problems. The inputs for the model correspond to sequences of keypoints extracted from the skeletons present in each frame originating an output corresponding to the classification of the video. The model was trained and evaluated with an innovative dataset that contains violent videos from a popular fighting game and non-violent videos related to people's daily lives. Comparison of the results obtained with the state-of-the-art techniques revealed the promising capability of the proposed method in recognizing violent videos with 100% precision, although it is not as robust as other datasets. Conv-LSTM units are shown to be an effective means for modelling and predicting video sequences.","PeriodicalId":189731,"journal":{"name":"2021 IEEE 9th International Conference on Serious Games and Applications for Health(SeGAH)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Violence detection in video game metadata using ConvLSTM\",\"authors\":\"Helena A. Correia, José Henrique Brito\",\"doi\":\"10.1109/SEGAH52098.2021.9551853\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The automatic detection of violent situations is relevant to monitor exposure to violence, both in the context of the analysis of real video and video generated in virtual environments, namely in simulated scenarios or virtual/mixed reality applications, such as serious games. In this paper, we propose a deep neural network to identify violent videos, with an approach capable of working in real video and synthetic video. An efficient detector of the 2D pose together with a multiple person tracker is used to extract motion features from the video sequence that will be fed directly into the proposed network. The proposed convolutional neural network is a recurrent neural network for a Spatio-temporal prediction that has convolutional structures in both the input-to-state and state-to-state transitions, which enables the analysis of local motion in the video. By stacking multiple ConvLSTM layers and forming an encoding-forecasting structure, we obtain a network model for the violence detection problem and more general spatiotemporal sequence forecasting problems. The inputs for the model correspond to sequences of keypoints extracted from the skeletons present in each frame originating an output corresponding to the classification of the video. The model was trained and evaluated with an innovative dataset that contains violent videos from a popular fighting game and non-violent videos related to people's daily lives. Comparison of the results obtained with the state-of-the-art techniques revealed the promising capability of the proposed method in recognizing violent videos with 100% precision, although it is not as robust as other datasets. Conv-LSTM units are shown to be an effective means for modelling and predicting video sequences.\",\"PeriodicalId\":189731,\"journal\":{\"name\":\"2021 IEEE 9th International Conference on Serious Games and Applications for Health(SeGAH)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 9th International Conference on Serious Games and Applications for Health(SeGAH)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SEGAH52098.2021.9551853\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 9th International Conference on Serious Games and Applications for Health(SeGAH)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SEGAH52098.2021.9551853","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Violence detection in video game metadata using ConvLSTM
The automatic detection of violent situations is relevant to monitor exposure to violence, both in the context of the analysis of real video and video generated in virtual environments, namely in simulated scenarios or virtual/mixed reality applications, such as serious games. In this paper, we propose a deep neural network to identify violent videos, with an approach capable of working in real video and synthetic video. An efficient detector of the 2D pose together with a multiple person tracker is used to extract motion features from the video sequence that will be fed directly into the proposed network. The proposed convolutional neural network is a recurrent neural network for a Spatio-temporal prediction that has convolutional structures in both the input-to-state and state-to-state transitions, which enables the analysis of local motion in the video. By stacking multiple ConvLSTM layers and forming an encoding-forecasting structure, we obtain a network model for the violence detection problem and more general spatiotemporal sequence forecasting problems. The inputs for the model correspond to sequences of keypoints extracted from the skeletons present in each frame originating an output corresponding to the classification of the video. The model was trained and evaluated with an innovative dataset that contains violent videos from a popular fighting game and non-violent videos related to people's daily lives. Comparison of the results obtained with the state-of-the-art techniques revealed the promising capability of the proposed method in recognizing violent videos with 100% precision, although it is not as robust as other datasets. Conv-LSTM units are shown to be an effective means for modelling and predicting video sequences.