{"title":"Fine-Grained Emotion Comprehension: Semisupervised Multimodal Emotion and Intensity Recognition","authors":"Zheng Fang;Zhen Liu;Tingting Liu;Chih-Chieh Hung","doi":"10.1109/TCSS.2024.3475511","DOIUrl":null,"url":null,"abstract":"The rapid advancement of deep learning and the exponential growth of multimodal data have led to increased attention on multimodal emotion analysis and comprehension in affect computing. While existing multimodal works have achieved notable results in emotion recognition, several challenges remain. First, the scarcity of public large-scale multimodal emotion datasets is attributed to the high cost of manual annotation and the subjectivity of handcrafted labels. Second, most approaches only focus on learning emotion category information, disregarding the crucial evaluation indicator of emotion intensity, which hampers the development of fine-grained emotion recognition. Third, a significant emotion semantic discrepancy exists in different modalities, and current methodologies struggle to bridge the cross-modal gap and effectively utilize a vast amount of unlabeled emotion data, hindering the production of high-quality pseudolabels and superior classification performance. To address these challenges, based on the multitask learning architecture, we propose a novel semisupervised fine-grained emotion recognition model SMEIR-net for multimodal emotion and intensity recognition. Concretely, in semisupervised learning (SSL) phase, we design multistage self-training and consistency regularization paradigm to generate high-quality pseudolabels. Then, in supervised learning phase, we leverage multimodal transformer fusion and adversarial learning to eliminate the cross-modal semantic discrepancy. Extensive experiments are conducted on three benchmark datasets, namely RAVDESS, eNTERFACE, and Lombard-GRID, to evaluate the proposed model. The series sets of experimental results demonstrate that our SSL model successfully utilizes multimodal data and available labels to transfer emotion and intensity information from labeled to unlabeled datasets. Moreover, the corresponding evaluation metrics demonstrate that the utilize high-quality pseudolabels can achieve superior emotion and intensity classification performance, which outperforms other state-of-the-art baselines under the same condition.","PeriodicalId":13044,"journal":{"name":"IEEE Transactions on Computational Social Systems","volume":"12 3","pages":"1145-1163"},"PeriodicalIF":4.5000,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computational Social Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10737896/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, CYBERNETICS","Score":null,"Total":0}
引用次数: 0
Abstract
The rapid advancement of deep learning and the exponential growth of multimodal data have led to increased attention on multimodal emotion analysis and comprehension in affect computing. While existing multimodal works have achieved notable results in emotion recognition, several challenges remain. First, the scarcity of public large-scale multimodal emotion datasets is attributed to the high cost of manual annotation and the subjectivity of handcrafted labels. Second, most approaches only focus on learning emotion category information, disregarding the crucial evaluation indicator of emotion intensity, which hampers the development of fine-grained emotion recognition. Third, a significant emotion semantic discrepancy exists in different modalities, and current methodologies struggle to bridge the cross-modal gap and effectively utilize a vast amount of unlabeled emotion data, hindering the production of high-quality pseudolabels and superior classification performance. To address these challenges, based on the multitask learning architecture, we propose a novel semisupervised fine-grained emotion recognition model SMEIR-net for multimodal emotion and intensity recognition. Concretely, in semisupervised learning (SSL) phase, we design multistage self-training and consistency regularization paradigm to generate high-quality pseudolabels. Then, in supervised learning phase, we leverage multimodal transformer fusion and adversarial learning to eliminate the cross-modal semantic discrepancy. Extensive experiments are conducted on three benchmark datasets, namely RAVDESS, eNTERFACE, and Lombard-GRID, to evaluate the proposed model. The series sets of experimental results demonstrate that our SSL model successfully utilizes multimodal data and available labels to transfer emotion and intensity information from labeled to unlabeled datasets. Moreover, the corresponding evaluation metrics demonstrate that the utilize high-quality pseudolabels can achieve superior emotion and intensity classification performance, which outperforms other state-of-the-art baselines under the same condition.
期刊介绍:
IEEE Transactions on Computational Social Systems focuses on such topics as modeling, simulation, analysis and understanding of social systems from the quantitative and/or computational perspective. "Systems" include man-man, man-machine and machine-machine organizations and adversarial situations as well as social media structures and their dynamics. More specifically, the proposed transactions publishes articles on modeling the dynamics of social systems, methodologies for incorporating and representing socio-cultural and behavioral aspects in computational modeling, analysis of social system behavior and structure, and paradigms for social systems modeling and simulation. The journal also features articles on social network dynamics, social intelligence and cognition, social systems design and architectures, socio-cultural modeling and representation, and computational behavior modeling, and their applications.