基于时空和静态特征的多模态融合群体情绪识别

Mo Sun, Jian Li, Hui Feng, Wei Gou, Haifeng Shen, Jian-Bo Tang, Yi Yang, Jieping Ye
{"title":"基于时空和静态特征的多模态融合群体情绪识别","authors":"Mo Sun, Jian Li, Hui Feng, Wei Gou, Haifeng Shen, Jian-Bo Tang, Yi Yang, Jieping Ye","doi":"10.1145/3382507.3417971","DOIUrl":null,"url":null,"abstract":"This paper presents our approach for Audio-video Group Emotion Recognition sub-challenge in the EmotiW 2020. The task is to classify a video into one of the group emotions such as positive, neutral, and negative. Our approach exploits two different feature levels for this task, spatio-temporal feature and static feature level. In spatio-temporal feature level, we adopt multiple input modalities (RGB, RGB difference, optical flow, warped optical flow) into multiple video classification network to train the spatio-temporal model. In static feature level, we crop all faces and bodies in an image with the state-of the-art human pose estimation method and train kinds of CNNs with the image-level labels of group emotions. Finally, we fuse all 14 models result together, and achieve the third place in this sub-challenge with classification accuracies of 71.93% and 70.77% on the validation set and test set, respectively.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"195 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Multi-modal Fusion Using Spatio-temporal and Static Features for Group Emotion Recognition\",\"authors\":\"Mo Sun, Jian Li, Hui Feng, Wei Gou, Haifeng Shen, Jian-Bo Tang, Yi Yang, Jieping Ye\",\"doi\":\"10.1145/3382507.3417971\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents our approach for Audio-video Group Emotion Recognition sub-challenge in the EmotiW 2020. The task is to classify a video into one of the group emotions such as positive, neutral, and negative. Our approach exploits two different feature levels for this task, spatio-temporal feature and static feature level. In spatio-temporal feature level, we adopt multiple input modalities (RGB, RGB difference, optical flow, warped optical flow) into multiple video classification network to train the spatio-temporal model. In static feature level, we crop all faces and bodies in an image with the state-of the-art human pose estimation method and train kinds of CNNs with the image-level labels of group emotions. Finally, we fuse all 14 models result together, and achieve the third place in this sub-challenge with classification accuracies of 71.93% and 70.77% on the validation set and test set, respectively.\",\"PeriodicalId\":402394,\"journal\":{\"name\":\"Proceedings of the 2020 International Conference on Multimodal Interaction\",\"volume\":\"195 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2020 International Conference on Multimodal Interaction\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3382507.3417971\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3382507.3417971","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

摘要

本文介绍了我们在EmotiW 2020中音频-视频组情感识别子挑战的方法。任务是将视频分类为一组情绪,如积极,中性和消极。我们的方法利用了两个不同的特征级别,时空特征和静态特征级别。在时空特征层面,我们将RGB、RGB差分、光流、扭曲光流等多种输入方式引入到多视频分类网络中,对时空模型进行训练。在静态特征层面,我们使用最先进的人体姿态估计方法裁剪图像中的所有面部和身体,并使用图像级别的群体情绪标签训练各种cnn。最后,我们将所有14个模型的结果融合在一起,在验证集和测试集上分别以71.93%和70.77%的分类准确率获得了该子挑战的第三名。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Multi-modal Fusion Using Spatio-temporal and Static Features for Group Emotion Recognition
This paper presents our approach for Audio-video Group Emotion Recognition sub-challenge in the EmotiW 2020. The task is to classify a video into one of the group emotions such as positive, neutral, and negative. Our approach exploits two different feature levels for this task, spatio-temporal feature and static feature level. In spatio-temporal feature level, we adopt multiple input modalities (RGB, RGB difference, optical flow, warped optical flow) into multiple video classification network to train the spatio-temporal model. In static feature level, we crop all faces and bodies in an image with the state-of the-art human pose estimation method and train kinds of CNNs with the image-level labels of group emotions. Finally, we fuse all 14 models result together, and achieve the third place in this sub-challenge with classification accuracies of 71.93% and 70.77% on the validation set and test set, respectively.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信