Multi-modal Fusion Using Spatio-temporal and Static Features for Group Emotion Recognition

Proceedings of the 2020 International Conference on Multimodal Interaction Pub Date : 2020-10-21 DOI:10.1145/3382507.3417971

Mo Sun, Jian Li, Hui Feng, Wei Gou, Haifeng Shen, Jian-Bo Tang, Yi Yang, Jieping Ye

引用次数: 9

Abstract

This paper presents our approach for Audio-video Group Emotion Recognition sub-challenge in the EmotiW 2020. The task is to classify a video into one of the group emotions such as positive, neutral, and negative. Our approach exploits two different feature levels for this task, spatio-temporal feature and static feature level. In spatio-temporal feature level, we adopt multiple input modalities (RGB, RGB difference, optical flow, warped optical flow) into multiple video classification network to train the spatio-temporal model. In static feature level, we crop all faces and bodies in an image with the state-of the-art human pose estimation method and train kinds of CNNs with the image-level labels of group emotions. Finally, we fuse all 14 models result together, and achieve the third place in this sub-challenge with classification accuracies of 71.93% and 70.77% on the validation set and test set, respectively.

查看原文本刊更多论文

基于时空和静态特征的多模态融合群体情绪识别

本文介绍了我们在EmotiW 2020中音频-视频组情感识别子挑战的方法。任务是将视频分类为一组情绪，如积极，中性和消极。我们的方法利用了两个不同的特征级别，时空特征和静态特征级别。在时空特征层面，我们将RGB、RGB差分、光流、扭曲光流等多种输入方式引入到多视频分类网络中，对时空模型进行训练。在静态特征层面，我们使用最先进的人体姿态估计方法裁剪图像中的所有面部和身体，并使用图像级别的群体情绪标签训练各种cnn。最后，我们将所有14个模型的结果融合在一起，在验证集和测试集上分别以71.93%和70.77%的分类准确率获得了该子挑战的第三名。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2020 International Conference on Multimodal Interaction

自引率

0.00%

发文量