Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge最新文献

筛选
英文 中文
Towards Multimodal Prediction of Time-continuous Emotion using Pose Feature Engineering and a Transformer Encoder 基于姿态特征工程和变压器编码器的时间连续情绪多模态预测
Ho-min Park, Ilho Yun, Ajit Kumar, A. Singh, Bong Jun Choi, Dhananjay Singh, W. D. Neve
{"title":"Towards Multimodal Prediction of Time-continuous Emotion using Pose Feature Engineering and a Transformer Encoder","authors":"Ho-min Park, Ilho Yun, Ajit Kumar, A. Singh, Bong Jun Choi, Dhananjay Singh, W. D. Neve","doi":"10.1145/3551876.3554807","DOIUrl":"https://doi.org/10.1145/3551876.3554807","url":null,"abstract":"MuSe-Stress 2022 aims at building sequence regression models for predicting valence and physiological arousal levels of persons who are facing stressful conditions. To that end, audio-visual recordings, transcripts, and physiological signals can be leveraged. In this paper, we describe the approach we developed for Muse-Stress 2022. Specifically, we engineered a new pose feature that captures the movement of human body keypoints. We also trained a Long Short-Term Memory (LSTM) network and a Transformer encoder on different types of feature sequences and different combinations thereof. In addition, we adopted a two-pronged strategy to tune the hyperparameters that govern the different ways the available features can be used. Finally, we made use of late fusion to combine the predictions obtained for the different unimodal features. Our experimental results show that the newly engineered pose feature obtains the second highest development CCC among the seven unimodal features available. Furthermore, our Transformer encoder obtains the highest development CCC for five out of fourteen possible combinations of features and emotion dimensions, with this number increasing from five to nine when performing late fusion. In addition, when searching for optimal hyperparameter settings, our two-pronged hyperparameter tuning strategy leads to noticeable improvements in maximum development CCC, especially when the underlying models are based on an LSTM. In summary, we can conclude that our approach is able to achieve a test CCC of 0.6196 and 0.6351 for arousal and valence, respectively, securing a Top-3 rank in Muse-Stress 2022.","PeriodicalId":434392,"journal":{"name":"Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134182891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Emotional Reaction Analysis based on Multi-Label Graph Convolutional Networks and Dynamic Facial Expression Recognition Transformer 基于多标签图卷积网络和动态面部表情识别转换器的情绪反应分析
Kexin Wang, Zheng Lian, Licai Sun, B. Liu, Jianhua Tao, Yin Fan
{"title":"Emotional Reaction Analysis based on Multi-Label Graph Convolutional Networks and Dynamic Facial Expression Recognition Transformer","authors":"Kexin Wang, Zheng Lian, Licai Sun, B. Liu, Jianhua Tao, Yin Fan","doi":"10.1145/3551876.3554810","DOIUrl":"https://doi.org/10.1145/3551876.3554810","url":null,"abstract":"Automatically predicting and understanding human emotional reactions have wide applications in human-computer interaction. In this paper, we present our solutions to the MuSe-Reaction sub-challenge in MuSe 2022. The task of this sub-challenge is to predict the intensity of 7 emotional expressions from human reactions to a wide range of emotionally evocative stimuli. Specifically, we design an end-to-end model, which is composed of a Spatio-Temporal Transformer for dynamic facial representation learning and a multi-label graph convolutional network for emotion dependency modeling.We also explore the effects of a temporal model with a variety of features from acoustic and visual modalities. Our proposed method achieves mean Pearson's correlation coefficient of 0.3375 on the test set of MuSe-Reaction, which outperforms the baseline system(i.e., 0.2801) by a large margin.","PeriodicalId":434392,"journal":{"name":"Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130605203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Integrating Cross-modal Interactions via Latent Representation Shift for Multi-modal Humor Detection 基于潜在表征转移的跨模态交互整合多模态幽默检测
Chengxin Chen, Pengyuan Zhang
{"title":"Integrating Cross-modal Interactions via Latent Representation Shift for Multi-modal Humor Detection","authors":"Chengxin Chen, Pengyuan Zhang","doi":"10.1145/3551876.3554805","DOIUrl":"https://doi.org/10.1145/3551876.3554805","url":null,"abstract":"Multi-modal sentiment analysis has been an active research area and has attracted increasing attention from multi-disciplinary communities. However, it is still challenging to fuse the information from different modalities in an efficient way. In prior studies, the late fusion strategy has been commonly adopted due to its simplicity and efficacy. Unfortunately, it failed to model the interactions across different modalities. In this paper, we propose a transformer-based hierarchical framework to effectively model both the intrinsic semantics and cross-modal interactions of the relevant modalities. Specifically, the features from each modality are first encoded via standard transformers. Later, the cross-modal interactions from one modality to other modalities are calculated using cross-modal transformers. The derived intrinsic semantics and cross-modal interactions are used to determine the latent representation shift of a particular modality. We evaluate the proposed approach on the MuSe-Humor sub-challenge of Multi-modal Sentiment Analysis Challenge (MuSe) 2022. Experimental results show that an Area Under the Curve (AUC) of 0.9065 can be achieved on the test set of MuSe-Humor. With the promising results, our best submission ranked first place in the sub-challenge.","PeriodicalId":434392,"journal":{"name":"Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130753266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Bridging the Gap: End-to-End Domain Adaptation for Emotional Vocalization Classification using Adversarial Learning 弥合差距:使用对抗性学习的情绪发声分类的端到端领域适应
Dominik Schiller, Silvan Mertes, P. V. Rijn, E. André
{"title":"Bridging the Gap: End-to-End Domain Adaptation for Emotional Vocalization Classification using Adversarial Learning","authors":"Dominik Schiller, Silvan Mertes, P. V. Rijn, E. André","doi":"10.1145/3551876.3554816","DOIUrl":"https://doi.org/10.1145/3551876.3554816","url":null,"abstract":"Good classification performance on a hold-out partition can only be expected if the data distribution of the test data matches the training data. However, in many real-life use cases, this constraint is not met. In this work, we explore if it is feasible to use existing methods of an adversarial domain transfer to bridge this inter-domain gap. To do so, we use a CycleGAN that was trained on converting between the domains. We demonstrate that the quality of the generated data has a substantial impact on the effectiveness of the domain adaptation, and propose an additional step to overcome this problem. To evaluate the approach, we classify emotions in female and male vocalizations. Furthermore, we show that our model successfully approximates the distribution of acoustic features and that our approach can be employed to improve emotion classification performance. Since the presented approach is domain and feature independent it can therefore be applied to any classification task.","PeriodicalId":434392,"journal":{"name":"Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121472135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ViPER 毒蛇
L. Vaiani, Moreno La Quatra, Luca Cagliero, P. Garza
{"title":"ViPER","authors":"L. Vaiani, Moreno La Quatra, Luca Cagliero, P. Garza","doi":"10.1145/3551876.3554806","DOIUrl":"https://doi.org/10.1145/3551876.3554806","url":null,"abstract":"Recognizing human emotions from videos requires a deep understanding of the underlying multimodal sources, including images, audio, and text. Since the input data sources are highly variable across different modality combinations, leveraging multiple modalities often requires ad hoc fusion networks. To predict the emotional arousal of a person reacting to a given video clip we present ViPER, a multimodal architecture leveraging a modality-agnostic transformer based model to combine video frames, audio recordings, and textual annotations. Specifically, it relies on a modality-agnostic late fusion network which makes ViPER easily adaptable to different modalities. The experiments carried out on the Hume-Reaction datasets of the MuSe-Reaction challenge confirm the effectiveness of the proposed approach.","PeriodicalId":434392,"journal":{"name":"Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122467788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
A Personalised Approach to Audiovisual Humour Recognition and its Individual-level Fairness 视听幽默识别的个性化方法及其个体层面的公平性
Alexander Kathan, S. Amiriparian, Lukas Christ, Andreas Triantafyllopoulos, Niklas Müller, Andreas König, B. Schuller
{"title":"A Personalised Approach to Audiovisual Humour Recognition and its Individual-level Fairness","authors":"Alexander Kathan, S. Amiriparian, Lukas Christ, Andreas Triantafyllopoulos, Niklas Müller, Andreas König, B. Schuller","doi":"10.1145/3551876.3554800","DOIUrl":"https://doi.org/10.1145/3551876.3554800","url":null,"abstract":"Humour is one of the most subtle and contextualised behavioural patterns to study in social psychology and has a major impact on human emotions, social cognition, behaviour, and relations. Consequently, an automatic understanding of humour is crucial and challenging for a naturalistic human-robot interaction. Recent artificial intelligence (AI)-based methods have shown progress in multimodal humour recognition. However, such methods lack a mechanism in adapting to each individual's characteristics, resulting in a decreased performance, e.g., due to different facial expressions. Further, these models are faced with generalisation problems when being applied for recognition of different styles of humour. We aim to address these challenges by introducing a novel multimodal humour recognition approach in which the models are personalised for each individual in the Passau Spontaneous Football Coach Humour (Passau-SFCH) dataset. We begin by training a model on all individuals in the dataset. Subsequently, we fine-tune all layers of this model with the data from each individual. Finally, we use these models for the prediction task. Using the proposed personalised models, it is possible to significantly (two-tailed t-test, p < 0.05) outperform the non-personalised models. In particular, the mean Area Under the Curve (AUC) is increased from .7573 to .7731 for the audio modality, and from .9203 to .9256 for the video modality. In addition, we apply a weighted late fusion approach which increases the overall performance to an AUC of .9308, demonstrating the complementarity of the features. Finally, we evaluate the individual-level fairness of our approach and show which group of subjects benefits most of using personalisation.","PeriodicalId":434392,"journal":{"name":"Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134494823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
The Dos and Don'ts of Affect Analysis 影响分析的注意事项
S. Amiriparian
{"title":"The Dos and Don'ts of Affect Analysis","authors":"S. Amiriparian","doi":"10.1145/3551876.3554815","DOIUrl":"https://doi.org/10.1145/3551876.3554815","url":null,"abstract":"As an inseparable and crucial component of communication affects play a substantial role in human-device and human-human interaction. They convey information about a person's specific traits and states [1, 4, 5], how one feels about the aims of a conversation, the trustworthiness of one's verbal communication [3], and the degree of adaptation in interpersonal speech [2]. This multifaceted nature of human affects poses a great challenge when it comes to applying machine learning systems for their automatic recognition and understanding. Contemporary self-supervised learning architectures such as Transformers, which define state-of-the-art (SOTA) in this area, have shown noticeable deficits in terms of explainability, while more conventional, non-deep machine learning methods, which provide more transparency, often fall (far) behind SOTA systems. So, is it possible to get the best of these two 'worlds'? And more importantly, at what price? In this talk, I provide a set of Dos and Don'ts guidelines for addressing affective computing tasks w. r. t. (i) preserving privacy for affective data and individuals/groups, (ii) being efficient in computing such data in a transparent way, (iii) ensuring reproducibility of the results, (iv) knowing the differences between causation and correlation, and (v) properly applying social and ethical protocols.","PeriodicalId":434392,"journal":{"name":"Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127922781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Uncovering the Nuanced Structure of Expressive Behavior Across Modalities 揭示跨模态表达行为的细微结构
Alan S. Cowen
{"title":"Uncovering the Nuanced Structure of Expressive Behavior Across Modalities","authors":"Alan S. Cowen","doi":"10.1145/3551876.3554814","DOIUrl":"https://doi.org/10.1145/3551876.3554814","url":null,"abstract":"Guided by semantic space theory, large-scale computational studies have advanced our understanding of the structure and function of expressive behavior. I will integrate findings from experimental studies of facial expression (N=19,656), vocal bursts (N=12,616), speech prosody (N=20,109), multimodal reactions (N=8,056), and an ongoing study of dyadic interactions (N=1,000+). These studies combine methods from psychology and computer science to yield new insights into what expressive behaviors signal, how they are perceived, and how they shape social interaction. Using machine learning to extract cross-cultural dimensions of behavior while minimizing biases due to demographics and context, we arrive at objective measures of the structural dimensions that make up human expression. Expressions are consistently found to be high-dimensional and blended, with their meaning across cultures being efficiently conceptualized in terms of a wide range of specific emotion concepts. Altogether, these findings generate a comprehensive new atlas of expressive behavior, which I will explore through a variety of visualizations. This new taxonomy departs from models such as the basic six and affective circumplex, suggesting a new way forward for expression understanding and sentiment analysis.","PeriodicalId":434392,"journal":{"name":"Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge","volume":"71 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114157384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal Temporal Attention in Sentiment Analysis 情感分析中的多模态时间注意
Yu He, Licai Sun, Zheng Lian, B. Liu, Jianhua Tao, Meng Wang, Yuan Cheng
{"title":"Multimodal Temporal Attention in Sentiment Analysis","authors":"Yu He, Licai Sun, Zheng Lian, B. Liu, Jianhua Tao, Meng Wang, Yuan Cheng","doi":"10.1145/3551876.3554811","DOIUrl":"https://doi.org/10.1145/3551876.3554811","url":null,"abstract":"In this paper, we present the solution to the MuSe-Stress sub-challenge in the MuSe 2022 Multimodal Sentiment Analysis Challenge. The task of MuSe-Stress is to predict a time-continuous value (i.e., physiological arousal and valence) based on multimodal data of audio, visual, text, and physiological signals. In this competition, we find that multimodal fusion has good performance for physiological arousal on the validation set, but poor prediction performance on the test set. We believe that problem may be due to the over-fitting caused by the model's over-reliance on some specific modal features. To deal with the above problem, we propose Multimodal Temporal Attention (MMTA), which considers the temporal effects of all modalities on each unimodal branch, realizing the interaction between unimodal branches and adaptive inter-modal balance. The concordance correlation coefficient (CCC) of physiological arousal and valence are 0.6818 with MMTA and 0.6841 with early fusion, respectively, both ranking Top 1, outperforming the baseline system by a large margin (i.e., 0.4761 and 0.4931) on the test set.","PeriodicalId":434392,"journal":{"name":"Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128276502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Comparing Biosignal and Acoustic feature Representation for Continuous Emotion Recognition 生物信号和声学特征表示在连续情绪识别中的比较
Sarthak Yadav, Tilak Purohit, Z. Mostaani, Bogdan Vlasenko, M. Magimai.-Doss
{"title":"Comparing Biosignal and Acoustic feature Representation for Continuous Emotion Recognition","authors":"Sarthak Yadav, Tilak Purohit, Z. Mostaani, Bogdan Vlasenko, M. Magimai.-Doss","doi":"10.1145/3551876.3554812","DOIUrl":"https://doi.org/10.1145/3551876.3554812","url":null,"abstract":"Automatic recognition of human emotion has a wide range of applications. Human emotions can be identified across different modalities, such as biosignal, speech, text, and mimics. This paper is focusing on time-continuous prediction of level of valence and psycho-physiological arousal. In that regard, we investigate, (a) the use of different feature embeddings obtained from neural networks pre-trained on different speech tasks (e.g., phone classification, speech emotion recognition) and self-supervised neural networks, (b) estimation of arousal and valence from physiological signals in an end-to-end manner and (c) combining different neural embeddings. Our investigations on the MuSe-Stress sub-challenge shows that (a) the embeddings extracted from physiological signals using CNNs trained in an end-to-end manner improves over the baseline approach of modeling physiological signals, (b) neural embeddings obtained from phone classification neural network and speech emotion recognition neural network trained on auxiliary language data sets yield improvement over baseline systems purely trained on the target data, and (c) task-specific neural embeddings yield improved performance over self-supervised neural embeddings for both arousal and valence. Our best performing system on test-set surpass the DeepSpectrum baseline (combined score) by a relative 7.7% margin","PeriodicalId":434392,"journal":{"name":"Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge","volume":"11 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125764651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信