基于监督自编码器的声音和语言信息融合改进情绪识别

Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge Pub Date : 2021-10-15 DOI:10.1145/3475957.3484448

Bogdan Vlasenko, R. Prasad, M. Magimai.-Doss

{"title":"基于监督自编码器的声音和语言信息融合改进情绪识别","authors":"Bogdan Vlasenko, R. Prasad, M. Magimai.-Doss","doi":"10.1145/3475957.3484448","DOIUrl":null,"url":null,"abstract":"Automatic recognition of human emotion has a wide range of applications and has always attracted increasing attention. Expressions of human emotions can apparently be identified across different modalities of communication, such as speech, text, mimics, etc. The \"Multimodal Sentiment Analysis in Real-life Media' (MuSe) 2021 challenge provides an environment to develop new techniques to recognize human emotions or sentiments using multiple modalities (audio, video, and text) over in-the-wild data. The challenge encourages to jointly model the information across audio, video and text modalities, for improving emotion recognition. The present paper describes our attempt towards the MuSe-Sent task in the challenge. The goal of the sub-challenge is to perform turn-level prediction of emotions within the arousal and valence dimensions. In the paper, we investigate different approaches to optimally fuse linguistic and acoustic information for emotion recognition systems. The proposed systems employ features derived from these modalities, and uses different deep learning architectures to explore their cross-dependencies. Wide range of acoustic and linguistic features provided by organizers and recently established acoustic embedding wav2vec 2.0 are used for modeling the inherent emotions. In this paper we compare discriminative characteristics of hand-crafted and data-driven acoustic features in a context of emotional classification in arousal and valence dimensions. Ensemble based classifiers were compared with advanced supervised autoendcoder (SAE) technique with Bayesian Optimizer hyperparameter tuning approach. Comparison of uni- and bi-modal classification techniques showed that joint modeling of acoustic and linguistic cues could improve classification performance compared to individual modalities. Experimental results show improvement over the proposed baseline system, which focuses on fusion of acoustic and text based information, on the test set evaluation.","PeriodicalId":313996,"journal":{"name":"Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge","volume":"107 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Fusion of Acoustic and Linguistic Information using Supervised Autoencoder for Improved Emotion Recognition\",\"authors\":\"Bogdan Vlasenko, R. Prasad, M. Magimai.-Doss\",\"doi\":\"10.1145/3475957.3484448\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic recognition of human emotion has a wide range of applications and has always attracted increasing attention. Expressions of human emotions can apparently be identified across different modalities of communication, such as speech, text, mimics, etc. The \\\"Multimodal Sentiment Analysis in Real-life Media' (MuSe) 2021 challenge provides an environment to develop new techniques to recognize human emotions or sentiments using multiple modalities (audio, video, and text) over in-the-wild data. The challenge encourages to jointly model the information across audio, video and text modalities, for improving emotion recognition. The present paper describes our attempt towards the MuSe-Sent task in the challenge. The goal of the sub-challenge is to perform turn-level prediction of emotions within the arousal and valence dimensions. In the paper, we investigate different approaches to optimally fuse linguistic and acoustic information for emotion recognition systems. The proposed systems employ features derived from these modalities, and uses different deep learning architectures to explore their cross-dependencies. Wide range of acoustic and linguistic features provided by organizers and recently established acoustic embedding wav2vec 2.0 are used for modeling the inherent emotions. In this paper we compare discriminative characteristics of hand-crafted and data-driven acoustic features in a context of emotional classification in arousal and valence dimensions. Ensemble based classifiers were compared with advanced supervised autoendcoder (SAE) technique with Bayesian Optimizer hyperparameter tuning approach. Comparison of uni- and bi-modal classification techniques showed that joint modeling of acoustic and linguistic cues could improve classification performance compared to individual modalities. Experimental results show improvement over the proposed baseline system, which focuses on fusion of acoustic and text based information, on the test set evaluation.\",\"PeriodicalId\":313996,\"journal\":{\"name\":\"Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge\",\"volume\":\"107 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3475957.3484448\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3475957.3484448","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

人类情感的自动识别有着广泛的应用，一直受到越来越多的关注。人类情绪的表达显然可以通过不同的交流方式进行识别，如语音、文本、模仿等。“现实生活媒体中的多模态情感分析”(MuSe) 2021挑战赛为开发新技术提供了一个环境，该技术可以在野外数据上使用多种模态(音频、视频和文本)来识别人类情感或情感。这项挑战鼓励通过音频、视频和文本模式共同建模信息，以提高情绪识别能力。本文描述了我们在挑战中对muse - send任务的尝试。子挑战的目标是在唤醒和效价维度内对情绪进行回合水平预测。在本文中，我们研究了不同的方法，以最佳地融合语言和声音信息的情感识别系统。所提出的系统采用来自这些模式的特征，并使用不同的深度学习架构来探索它们的交叉依赖性。组织者提供的广泛的声学和语言特征以及最近建立的声学嵌入wav2vec 2.0用于建模固有情绪。在本文中，我们比较了在唤醒和效价维度的情绪分类背景下，手工制作和数据驱动的声学特征的区别特征。将基于集成的分类器与采用贝叶斯优化器超参数调优方法的高级监督式自端编码器(SAE)技术进行了比较。单模态和双模态分类技术的比较表明，与单个模态相比，声学和语言线索的联合建模可以提高分类性能。实验结果表明，基于声学和文本信息融合的基线系统在测试集评估方面有所改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fusion of Acoustic and Linguistic Information using Supervised Autoencoder for Improved Emotion Recognition

Automatic recognition of human emotion has a wide range of applications and has always attracted increasing attention. Expressions of human emotions can apparently be identified across different modalities of communication, such as speech, text, mimics, etc. The "Multimodal Sentiment Analysis in Real-life Media' (MuSe) 2021 challenge provides an environment to develop new techniques to recognize human emotions or sentiments using multiple modalities (audio, video, and text) over in-the-wild data. The challenge encourages to jointly model the information across audio, video and text modalities, for improving emotion recognition. The present paper describes our attempt towards the MuSe-Sent task in the challenge. The goal of the sub-challenge is to perform turn-level prediction of emotions within the arousal and valence dimensions. In the paper, we investigate different approaches to optimally fuse linguistic and acoustic information for emotion recognition systems. The proposed systems employ features derived from these modalities, and uses different deep learning architectures to explore their cross-dependencies. Wide range of acoustic and linguistic features provided by organizers and recently established acoustic embedding wav2vec 2.0 are used for modeling the inherent emotions. In this paper we compare discriminative characteristics of hand-crafted and data-driven acoustic features in a context of emotional classification in arousal and valence dimensions. Ensemble based classifiers were compared with advanced supervised autoendcoder (SAE) technique with Bayesian Optimizer hyperparameter tuning approach. Comparison of uni- and bi-modal classification techniques showed that joint modeling of acoustic and linguistic cues could improve classification performance compared to individual modalities. Experimental results show improvement over the proposed baseline system, which focuses on fusion of acoustic and text based information, on the test set evaluation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge

自引率

0.00%

发文量