紧急呼叫中心对话中语音情感识别的变压器编码器和融合策略研究。

Companion Publication of the 2022 International Conference on Multimodal Interaction Pub Date : 2022-11-07 DOI:10.1145/3536220.3558038

Théo Deschamps-Berger, L. Lamel, L. Devillers

{"title":"紧急呼叫中心对话中语音情感识别的变压器编码器和融合策略研究。","authors":"Théo Deschamps-Berger, L. Lamel, L. Devillers","doi":"10.1145/3536220.3558038","DOIUrl":null,"url":null,"abstract":"There has been growing interest in using deep learning techniques to recognize emotions from speech. However, real-life emotion datasets collected in call centers are relatively rare and small, making the use of deep learning techniques quite challenging. This research focuses on the study of Transformer-based models to improve the speech emotion recognition of patients’ speech in French emergency call center dialogues. The experiments were conducted on a corpus called CEMO, which was collected in a French emergency call center. It includes telephone conversations with more than 800 callers and 6 agents. Four emotion classes were selected for these experiments: Anger, Fear, Positive and Neutral state. We compare different Transformer encoders based on the wav2vec2 and BERT models, and explore their fine-tuning as well as fusion of the encoders for emotion recognition from speech. Our objective is to explore how to use these pre-trained models to improve model robustness in the context of a real-life application. We show that the use of specific pre-trained Transformer encoders improves the model performance for emotion recognition in the CEMO corpus. The Unweighted Accuracy (UA) of the french pre-trained wav2vec2 adapted to our task is 73.1%, whereas the UA of our baseline model (Temporal CNN-LSTM without pre-training) is 55.8%. We also tested BERT encoders models: in particular FlauBERT obtained good performance for both manual 67.1% and automatic 67.9% transcripts. The late and model-level fusion of the speech and text models also improve performance (77.1% (late) - 76.9% (model-level)) compared to our best speech pre-trained model, 73.1% UA. In order to place our work in the scientific community, we also report results on the widely used IEMOCAP corpus with our best fusion strategy, 70.8% UA. Our results are promising for constructing more robust speech emotion recognition system for real-world applications.","PeriodicalId":186796,"journal":{"name":"Companion Publication of the 2022 International Conference on Multimodal Interaction","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Investigating Transformer Encoders and Fusion Strategies for Speech Emotion Recognition in Emergency Call Center Conversations.\",\"authors\":\"Théo Deschamps-Berger, L. Lamel, L. Devillers\",\"doi\":\"10.1145/3536220.3558038\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"There has been growing interest in using deep learning techniques to recognize emotions from speech. However, real-life emotion datasets collected in call centers are relatively rare and small, making the use of deep learning techniques quite challenging. This research focuses on the study of Transformer-based models to improve the speech emotion recognition of patients’ speech in French emergency call center dialogues. The experiments were conducted on a corpus called CEMO, which was collected in a French emergency call center. It includes telephone conversations with more than 800 callers and 6 agents. Four emotion classes were selected for these experiments: Anger, Fear, Positive and Neutral state. We compare different Transformer encoders based on the wav2vec2 and BERT models, and explore their fine-tuning as well as fusion of the encoders for emotion recognition from speech. Our objective is to explore how to use these pre-trained models to improve model robustness in the context of a real-life application. We show that the use of specific pre-trained Transformer encoders improves the model performance for emotion recognition in the CEMO corpus. The Unweighted Accuracy (UA) of the french pre-trained wav2vec2 adapted to our task is 73.1%, whereas the UA of our baseline model (Temporal CNN-LSTM without pre-training) is 55.8%. We also tested BERT encoders models: in particular FlauBERT obtained good performance for both manual 67.1% and automatic 67.9% transcripts. The late and model-level fusion of the speech and text models also improve performance (77.1% (late) - 76.9% (model-level)) compared to our best speech pre-trained model, 73.1% UA. In order to place our work in the scientific community, we also report results on the widely used IEMOCAP corpus with our best fusion strategy, 70.8% UA. Our results are promising for constructing more robust speech emotion recognition system for real-world applications.\",\"PeriodicalId\":186796,\"journal\":{\"name\":\"Companion Publication of the 2022 International Conference on Multimodal Interaction\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Companion Publication of the 2022 International Conference on Multimodal Interaction\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3536220.3558038\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Publication of the 2022 International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3536220.3558038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

人们对使用深度学习技术从语音中识别情绪的兴趣越来越大。然而，在呼叫中心收集的真实情感数据集相对较少且规模较小，这使得使用深度学习技术相当具有挑战性。本研究主要研究基于transformer的模型，以提高法语急救呼叫中心对话中患者语音的语音情感识别。实验是在一个名为CEMO的语料库上进行的，该语料库是在法国紧急呼叫中心收集的。它包括与800多名呼叫者和6名代理人的电话交谈。实验选取了四个情绪类别:愤怒、恐惧、积极和中性状态。我们比较了基于wav2vec2和BERT模型的不同Transformer编码器，并探讨了它们的微调和融合编码器用于语音情感识别。我们的目标是探索如何使用这些预训练的模型来提高现实应用环境中的模型鲁棒性。我们表明，使用特定的预训练Transformer编码器可以提高CEMO语料库中情感识别的模型性能。法国预训练的wav2vec2适应我们的任务的未加权精度(UA)为73.1%，而我们的基线模型(未经预训练的Temporal CNN-LSTM)的UA为55.8%。我们还测试了BERT编码器模型:特别是福楼拜在手动67.1%和自动67.9%的转录本中都获得了良好的表现。与我们最好的语音预训练模型(73.1% UA)相比，语音和文本模型的后期和模型级融合也提高了性能(77.1%(后期)- 76.9%(模型级))。为了把我们的工作放在科学界，我们还报告了广泛使用的IEMOCAP语料库的结果，我们的最佳融合策略为70.8% UA。我们的研究结果有望为构建更强大的语音情感识别系统提供现实应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Investigating Transformer Encoders and Fusion Strategies for Speech Emotion Recognition in Emergency Call Center Conversations.

There has been growing interest in using deep learning techniques to recognize emotions from speech. However, real-life emotion datasets collected in call centers are relatively rare and small, making the use of deep learning techniques quite challenging. This research focuses on the study of Transformer-based models to improve the speech emotion recognition of patients’ speech in French emergency call center dialogues. The experiments were conducted on a corpus called CEMO, which was collected in a French emergency call center. It includes telephone conversations with more than 800 callers and 6 agents. Four emotion classes were selected for these experiments: Anger, Fear, Positive and Neutral state. We compare different Transformer encoders based on the wav2vec2 and BERT models, and explore their fine-tuning as well as fusion of the encoders for emotion recognition from speech. Our objective is to explore how to use these pre-trained models to improve model robustness in the context of a real-life application. We show that the use of specific pre-trained Transformer encoders improves the model performance for emotion recognition in the CEMO corpus. The Unweighted Accuracy (UA) of the french pre-trained wav2vec2 adapted to our task is 73.1%, whereas the UA of our baseline model (Temporal CNN-LSTM without pre-training) is 55.8%. We also tested BERT encoders models: in particular FlauBERT obtained good performance for both manual 67.1% and automatic 67.9% transcripts. The late and model-level fusion of the speech and text models also improve performance (77.1% (late) - 76.9% (model-level)) compared to our best speech pre-trained model, 73.1% UA. In order to place our work in the scientific community, we also report results on the widely used IEMOCAP corpus with our best fusion strategy, 70.8% UA. Our results are promising for constructing more robust speech emotion recognition system for real-world applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Companion Publication of the 2022 International Conference on Multimodal Interaction

自引率

0.00%

发文量