Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge最新文献

Hybrid Mutimodal Fusion for Dimensional Emotion Recognition 多维情感识别的混合多模态融合

Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge Pub Date : 2021-10-15 DOI: 10.1145/3475957.3484457

Ziyu Ma, Fuyan Ma, Bin Sun, Shutao Li

{"title":"Hybrid Mutimodal Fusion for Dimensional Emotion Recognition","authors":"Ziyu Ma, Fuyan Ma, Bin Sun, Shutao Li","doi":"10.1145/3475957.3484457","DOIUrl":"https://doi.org/10.1145/3475957.3484457","url":null,"abstract":"In this paper, we extensively present our solutions for the MuSe-Stress sub-challenge and the MuSe-Physio sub-challenge of Multimodal Sentiment Challenge (MuSe) 2021. The goal of MuSe-Stress sub-challenge is to predict the level of emotional arousal and valence in a time-continuous manner from audio-visual recordings and the goal of MuSe-Physio sub-challenge is to predict the level of psycho-physiological arousal from a) human annotations fused with b) galvanic skin response (also known as Electrodermal Activity (EDA)) signals from the stressed people. The Ulm-TSST dataset which is a novel subset of the audio-visual textual Ulm-Trier Social Stress dataset that features German speakers in a Trier Social Stress Test (TSST) induced stress situation is used in both sub-challenges. For the MuSe-Stress sub-challenge, we highlight our solutions in three aspects: 1) the audio-visual features and the bio-signal features are used for emotional state recognition. 2) the Long Short-Term Memory (LSTM) with the self-attention mechanism is utilized to capture complex temporal dependencies within the feature sequences. 3) the late fusion strategy is adopted to further boost the model's recognition performance by exploiting complementary information scattered across multimodal sequences. Our proposed model achieves CCC of 0.6159 and 0.4609 for valence and arousal respectively on the test set, which both rank in the top 3. For the MuSe-Physio sub-challenge, we first extract the audio-visual features and the bio-signal features from multiple modalities. Then, the LSTM module with the self-attention mechanism, and the Gated Convolutional Neural Networks (GCNN) as well as the LSTM network are utilized for modeling the complex temporal dependencies in the sequence. Finally, the late fusion strategy is used. Our proposed method also achieves CCC of 0.5412 on the test set, which ranks in the top 3.","PeriodicalId":313996,"journal":{"name":"Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122990422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Fusion of Acoustic and Linguistic Information using Supervised Autoencoder for Improved Emotion Recognition 基于监督自编码器的声音和语言信息融合改进情绪识别

Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge Pub Date : 2021-10-15 DOI: 10.1145/3475957.3484448

Bogdan Vlasenko, R. Prasad, M. Magimai.-Doss

{"title":"Fusion of Acoustic and Linguistic Information using Supervised Autoencoder for Improved Emotion Recognition","authors":"Bogdan Vlasenko, R. Prasad, M. Magimai.-Doss","doi":"10.1145/3475957.3484448","DOIUrl":"https://doi.org/10.1145/3475957.3484448","url":null,"abstract":"Automatic recognition of human emotion has a wide range of applications and has always attracted increasing attention. Expressions of human emotions can apparently be identified across different modalities of communication, such as speech, text, mimics, etc. The \"Multimodal Sentiment Analysis in Real-life Media' (MuSe) 2021 challenge provides an environment to develop new techniques to recognize human emotions or sentiments using multiple modalities (audio, video, and text) over in-the-wild data. The challenge encourages to jointly model the information across audio, video and text modalities, for improving emotion recognition. The present paper describes our attempt towards the MuSe-Sent task in the challenge. The goal of the sub-challenge is to perform turn-level prediction of emotions within the arousal and valence dimensions. In the paper, we investigate different approaches to optimally fuse linguistic and acoustic information for emotion recognition systems. The proposed systems employ features derived from these modalities, and uses different deep learning architectures to explore their cross-dependencies. Wide range of acoustic and linguistic features provided by organizers and recently established acoustic embedding wav2vec 2.0 are used for modeling the inherent emotions. In this paper we compare discriminative characteristics of hand-crafted and data-driven acoustic features in a context of emotional classification in arousal and valence dimensions. Ensemble based classifiers were compared with advanced supervised autoendcoder (SAE) technique with Bayesian Optimizer hyperparameter tuning approach. Comparison of uni- and bi-modal classification techniques showed that joint modeling of acoustic and linguistic cues could improve classification performance compared to individual modalities. Experimental results show improvement over the proposed baseline system, which focuses on fusion of acoustic and text based information, on the test set evaluation.","PeriodicalId":313996,"journal":{"name":"Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116460699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Multimodal Sentiment Analysis based on Recurrent Neural Network and Multimodal Attention 基于循环神经网络和多模态注意的多模态情感分析

Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge Pub Date : 2021-10-15 DOI: 10.1145/3475957.3484454

Cong Cai, Yu He, Licai Sun, Zheng Lian, B. Liu, J. Tao, Mingyu Xu, Kexin Wang

引用次数: 12

Multi-modal Stress Recognition Using Temporal Convolution and Recurrent Network with Positional Embedding 基于时间卷积和位置嵌入递归网络的多模态应力识别

Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge Pub Date : 2021-10-15 DOI: 10.1145/3475957.3484453

Anh-Quang Duong, Ngoc-Huynh Ho, Hyung-Jeong Yang, Gueesang Lee, Soohyung Kim

引用次数: 3

Multimodal Fusion Strategies for Physiological-emotion Analysis 生理情绪分析的多模态融合策略

Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge Pub Date : 2021-10-15 DOI: 10.1145/3475957.3484452

Tenggan Zhang, Zhaopei Huang, Ruichen Li, Jinming Zhao, Qin Jin

{"title":"Multimodal Fusion Strategies for Physiological-emotion Analysis","authors":"Tenggan Zhang, Zhaopei Huang, Ruichen Li, Jinming Zhao, Qin Jin","doi":"10.1145/3475957.3484452","DOIUrl":"https://doi.org/10.1145/3475957.3484452","url":null,"abstract":"Physiological-emotion analysis is a novel aspect of automatic emotion analysis. It can support revealing a subject's emotional state, even if he/she consciously suppresses the emotional expression. In this paper, we present our solutions for the MuSe-Physio sub-challenge of Multimodal Sentiment Analysis (MuSe) 2021. The aim of this task is to predict the level of psycho-physiological arousal from combined audio-visual signals and the galvanic skin response (also known as Electrodermal Activity signals) of subjects under a highly stress-induced free speech scenario. In the scenarios, the speaker's emotion can be conveyed in different modalities including acoustic, visual, textual, and physiological signal modalities. Due to the complementarity of different modalities, the fusion of the multiple modalities has a large impact on emotion analysis. In this paper, we highlight two aspects of our solutions: 1) we explore various efficient low-level and high-level features from different modalities for this task, 2) we propose two effective multi-modal fusion strategies to make full use of the different modalities. Our solutions achieve the best CCC performance of 0.5728 on the challenge testing set, which significantly outperforms the baseline system with corresponding CCC of 0.4908. The experimental results show that our proposed various effective features and efficient fusion strategies have a strong generalization ability and can bring more robust performance.","PeriodicalId":313996,"journal":{"name":"Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129445766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Multimodal Emotion Recognition and Sentiment Analysis via Attention Enhanced Recurrent Model 基于注意增强循环模型的多模态情绪识别与情绪分析

Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge Pub Date : 2021-10-15 DOI: 10.1145/3475957.3484456

Licai Sun, Mingyu Xu, Zheng Lian, B. Liu, J. Tao, Meng Wang, Yuan Cheng

{"title":"Multimodal Emotion Recognition and Sentiment Analysis via Attention Enhanced Recurrent Model","authors":"Licai Sun, Mingyu Xu, Zheng Lian, B. Liu, J. Tao, Meng Wang, Yuan Cheng","doi":"10.1145/3475957.3484456","DOIUrl":"https://doi.org/10.1145/3475957.3484456","url":null,"abstract":"With the proliferation of user-generated videos in online websites, it becomes particularly important to achieve automatic perception and understanding of human emotion/sentiment from these videos. In this paper, we present our solutions to the MuSe-Wilder and MuSe-Sent sub-challenges in MuSe 2021 Multimodal Sentiment Analysis Challenge. MuSe-Wilder focuses on continuous emotion (i.e., arousal and valence) recognition while the task of MuSe-Sent concentrates on discrete sentiment classification. To this end, we first extract a variety of features from three common modalities (i.e., audio, visual, and text), including both low-level handcrafted features and high-level deep representations from supervised/unsupervised pre-trained models. Then, the long short-term memory recurrent neural network, as well as the self-attention mechanism is employed to model the complex temporal dependencies in the feature sequence. The concordance correlation coefficient (CCC) loss and F1-loss are used to guide continuous regression and discrete classification, respectively. To further boost the model's performance, we adopt late fusion to exploit complementary information from different modalities. Our proposed method achieves CCCs of 0.4117 and 0.6649 for arousal and valence respectively on the test set of MuSe-Wilder, which outperforms the baseline system (i.e., 0.3386 and 0.5974) by a large margin. For MuSe-Sent, F1-scores of 0.3614 and 0.4451 for arousal and valence are obtained, which also outperforms the baseline system significantly (i.e., 0.3512 and 0.3291). With these promising results, we ranked top3 in both sub-challenges.","PeriodicalId":313996,"journal":{"name":"Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127120791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Getting Really Wild: Challenges and Opportunities of Real-World Multimodal Affect Detection 变得真正狂野:现实世界多模态情感检测的挑战和机遇

Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge Pub Date : 2021-10-15 DOI: 10.1145/3475957.3482900

S. D’Mello

引用次数: 1

Multi-modal Fusion for Continuous Emotion Recognition by Using Auto-Encoders 基于自编码器的多模态融合连续情绪识别

Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge Pub Date : 2021-10-15 DOI: 10.1145/3475957.3484455

Salam Hamieh, V. Heiries, Hussein Al Osman, C. Godin

{"title":"Multi-modal Fusion for Continuous Emotion Recognition by Using Auto-Encoders","authors":"Salam Hamieh, V. Heiries, Hussein Al Osman, C. Godin","doi":"10.1145/3475957.3484455","DOIUrl":"https://doi.org/10.1145/3475957.3484455","url":null,"abstract":"Human stress detection is of great importance for monitoring mental health. The Multimodal Sentiment Analysis Challenge (MuSe) 2021 focuses on emotion, physiological-emotion, and stress recognition as well as sentiment classification by exploiting several modalities. In this paper, we present our solution for the Muse-Stress sub-challenge. The target of this sub-challenge is continuous prediction of arousal and valence for people under stressful conditions where text transcripts, audio and video recordings are provided. To this end, we utilize bidirectional Long Short-Term Memory (LSTM) and Gated Recurrent Unit networks (GRU) to explore high-level and low-level features from different modalities. We employ Concordance Correlation Coefficient (CCC) as a loss function and evaluation metric for our model. To improve the unimodal predictions, we add difficulty indicators of the data obtained by using Auto-Encoders. Finally, we perform late fusion on our unimodal predictions in addition to the difficulty indicators to obtain our final predictions. With this approach, we achieve CCC of 0.4278 and 0.5951 for arousal and valence respectively on the test set, our submission to MuSe 2021 ranks in the top three for arousal, fourth for valence, and in top three for combined results.","PeriodicalId":313996,"journal":{"name":"Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge","volume":"424 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132682585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

New Directions in Emotion Theory 情绪理论的新方向

Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge Pub Date : 2021-10-15 DOI: 10.1145/3475957.3482901

Panagiotis Tzirakis

{"title":"New Directions in Emotion Theory","authors":"Panagiotis Tzirakis","doi":"10.1145/3475957.3482901","DOIUrl":"https://doi.org/10.1145/3475957.3482901","url":null,"abstract":"Emotional intelligence is a fundamental component towards a complete and natural interaction between human and machine. Towards this goal several emotion theories have been exploited in the affective computing domain. Along with the studies developed in the theories of emotion, there are two major approaches to characterize emotional models: categorical models and dimensional models. Whereas, categorical models indicate there are a few basic emotions that are independent on the race (e.g. Ekman's model), dimensional approaches suggest that emotions are not independent, but related to one another in a systematic manner (e.g. Circumplex of Affect). Although these models have been dominating in the affective computing research, recent studies in emotion theories have shown that these models only capture a small fraction of the variance of what people perceive. In this talk, I will present the new directions in emotion theory that can better capture the emotional behavior of individuals. First, I will discuss the statistical analysis behind key emotions that are conveyed in human vocalizations, speech prosody, and facial expressions, and how these relate to conventional categorical and dimensional models. Based on these new emotional models, I will describe new datasets we have collected at Hume AI, and show the different patterns captured when training deep neural network models.","PeriodicalId":313996,"journal":{"name":"Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116000126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Physiologically-Adapted Gold Standard for Arousal during Stress 生理适应的压力唤醒黄金标准

Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge Pub Date : 2021-07-27 DOI: 10.1145/3475957.3484446

Alice Baird, Lukas Stappen, Lukas Christ, Lea Schumann, Eva-Maria Messner, Björn Schuller

{"title":"A Physiologically-Adapted Gold Standard for Arousal during Stress","authors":"Alice Baird, Lukas Stappen, Lukas Christ, Lea Schumann, Eva-Maria Messner, Björn Schuller","doi":"10.1145/3475957.3484446","DOIUrl":"https://doi.org/10.1145/3475957.3484446","url":null,"abstract":"Emotion is an inherently subjective psycho-physiological human state and to produce an agreed-upon representation (gold standard) for continuously perceived emotion requires time-consuming and costly training of multiple human annotators. With this in mind, there is strong evidence in the literature that physiological signals are an objective marker for states of emotion, particularly arousal. In this contribution, we utilise a multimodal dataset captured during a Trier Social Stress Test to explore the benefit of fusing physiological signals - Heartbeats per Minute ($BPM$), Electrodermal Activity (EDA), and Respiration-rate - for recognition of continuously perceived arousal utilising a Long Short-Term Memory, Recurrent Neural Network architecture, and various audio, video, and textual based features. We use the MuSe-Toolbox to create a gold standard that considers annotator delay and agreement weighting. An improvement in Concordance Correlation Coefficient (CCC) is seen across features sets when fusing EDA with arousal, compared to the arousal only gold standard results. Additionally, BERT-based textual features' results improved for arousal plus all physiological signals, obtaining up to .3344 CCC (.2118 CCC for arousal only). Multimodal fusion also improves CCC. Audio plus video features obtain up to .6157 CCC for arousal plus EDA, BPM.","PeriodicalId":313996,"journal":{"name":"Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123173955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1