Mo Sun, Jian Li, Hui Feng, Wei Gou, Haifeng Shen, Jian-Bo Tang, Yi Yang, Jieping Ye
{"title":"Multi-modal Fusion Using Spatio-temporal and Static Features for Group Emotion Recognition","authors":"Mo Sun, Jian Li, Hui Feng, Wei Gou, Haifeng Shen, Jian-Bo Tang, Yi Yang, Jieping Ye","doi":"10.1145/3382507.3417971","DOIUrl":null,"url":null,"abstract":"This paper presents our approach for Audio-video Group Emotion Recognition sub-challenge in the EmotiW 2020. The task is to classify a video into one of the group emotions such as positive, neutral, and negative. Our approach exploits two different feature levels for this task, spatio-temporal feature and static feature level. In spatio-temporal feature level, we adopt multiple input modalities (RGB, RGB difference, optical flow, warped optical flow) into multiple video classification network to train the spatio-temporal model. In static feature level, we crop all faces and bodies in an image with the state-of the-art human pose estimation method and train kinds of CNNs with the image-level labels of group emotions. Finally, we fuse all 14 models result together, and achieve the third place in this sub-challenge with classification accuracies of 71.93% and 70.77% on the validation set and test set, respectively.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"195 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3382507.3417971","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
This paper presents our approach for Audio-video Group Emotion Recognition sub-challenge in the EmotiW 2020. The task is to classify a video into one of the group emotions such as positive, neutral, and negative. Our approach exploits two different feature levels for this task, spatio-temporal feature and static feature level. In spatio-temporal feature level, we adopt multiple input modalities (RGB, RGB difference, optical flow, warped optical flow) into multiple video classification network to train the spatio-temporal model. In static feature level, we crop all faces and bodies in an image with the state-of the-art human pose estimation method and train kinds of CNNs with the image-level labels of group emotions. Finally, we fuse all 14 models result together, and achieve the third place in this sub-challenge with classification accuracies of 71.93% and 70.77% on the validation set and test set, respectively.