{"title":"Audio-Visual Group-based Emotion Recognition using Local and Global Feature Aggregation based Multi-Task Learning","authors":"Sunan Li, Hailun Lian, Cheng Lu, Yan Zhao, Chuangao Tang, Yuan Zong, Wenming Zheng","doi":"10.1145/3577190.3616544","DOIUrl":null,"url":null,"abstract":"Audio-video group emotion recognition is a challenging task and has attracted more attention in recent decades. Recently, deep learning models have shown tremendous advances in analyzing human emotion. However, due to its difficulties such as hard to gather a broad range of potential information to obtain meaningful emotional representations and hard to associate implicit contextual knowledge like humans. To tackle these problems, in this paper, we proposed the Local and Global Feature Aggregation based Multi-Task Learning (LGFAM) method to tackle the Group Emotion Recognition problem. The framework consists of three parallel feature extraction networks that were verified in previous work. After that, an attention network using MLP as a backbone with specially designed loss functions was used to fuse features from different modalities. In the experiment section, we present its performance on the EmotiW2023 Audio-Visual Group-based Emotion Recognition subchallenge which aims to classify a video into one of the three emotions. According to the feedback results, the best result achieved 70.63 WAR and 70.38 UAR on the test set. Such improvement proves the effectiveness of our method.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Publication of the 2020 International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577190.3616544","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Audio-video group emotion recognition is a challenging task and has attracted more attention in recent decades. Recently, deep learning models have shown tremendous advances in analyzing human emotion. However, due to its difficulties such as hard to gather a broad range of potential information to obtain meaningful emotional representations and hard to associate implicit contextual knowledge like humans. To tackle these problems, in this paper, we proposed the Local and Global Feature Aggregation based Multi-Task Learning (LGFAM) method to tackle the Group Emotion Recognition problem. The framework consists of three parallel feature extraction networks that were verified in previous work. After that, an attention network using MLP as a backbone with specially designed loss functions was used to fuse features from different modalities. In the experiment section, we present its performance on the EmotiW2023 Audio-Visual Group-based Emotion Recognition subchallenge which aims to classify a video into one of the three emotions. According to the feedback results, the best result achieved 70.63 WAR and 70.38 UAR on the test set. Such improvement proves the effectiveness of our method.