利用语音和语言表征对语音邮件中的客户满意度进行多模态评估

IF 2.9 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Digital Signal Processing Pub Date : 2024-10-28 DOI:10.1016/j.dsp.2024.104820

Luis Felipe Parra-Gallego , Tomás Arias-Vergara , Juan Rafael Orozco-Arroyave

{"title":"利用语音和语言表征对语音邮件中的客户满意度进行多模态评估","authors":"Luis Felipe Parra-Gallego , Tomás Arias-Vergara , Juan Rafael Orozco-Arroyave","doi":"10.1016/j.dsp.2024.104820","DOIUrl":null,"url":null,"abstract":"<div><div>Customer satisfaction (CS) evaluation in call centers is essential for assessing service quality but commonly relies on human evaluations. Automatic evaluation systems can be used to perform CS analyses, enabling the evaluation of larger datasets. This research paper focuses on CS analysis through a multimodal approach that employs speech and language representations derived from the real-world voicemails. Additionally, given the similarity between the evaluation of a provided service (which may elicit different emotions in customers) and the automatic classification of emotions in speech, we also explore the topic of emotion recognition with the well-known corpus IEMOCAP which comprises 4-classes corresponding to different emotional states. We incorporated a language representation with word embeddings based on a CNN-LSTM model, and three different self-supervised learning (SSL) speech encoders, namely Wav2Vec2.0, HuBERT, and WavLM. A bidirectional alignment network based on attention mechanisms is employed for synchronizing speech and language representations. Three different fusion strategies are also explored in the paper. According to our results, the GGF model outperformed both, unimodal and other multimodal methods in the 4-class emotion recognition task on the IEMOCAP dataset and the binary CS classification task on the KONECTADB dataset. The study also demonstrated superior performance of our methodology compared to previous works on KONECTADB in both unimodal and multimodal approaches.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"156 ","pages":"Article 104820"},"PeriodicalIF":2.9000,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal evaluation of customer satisfaction from voicemails using speech and language representations\",\"authors\":\"Luis Felipe Parra-Gallego , Tomás Arias-Vergara , Juan Rafael Orozco-Arroyave\",\"doi\":\"10.1016/j.dsp.2024.104820\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Customer satisfaction (CS) evaluation in call centers is essential for assessing service quality but commonly relies on human evaluations. Automatic evaluation systems can be used to perform CS analyses, enabling the evaluation of larger datasets. This research paper focuses on CS analysis through a multimodal approach that employs speech and language representations derived from the real-world voicemails. Additionally, given the similarity between the evaluation of a provided service (which may elicit different emotions in customers) and the automatic classification of emotions in speech, we also explore the topic of emotion recognition with the well-known corpus IEMOCAP which comprises 4-classes corresponding to different emotional states. We incorporated a language representation with word embeddings based on a CNN-LSTM model, and three different self-supervised learning (SSL) speech encoders, namely Wav2Vec2.0, HuBERT, and WavLM. A bidirectional alignment network based on attention mechanisms is employed for synchronizing speech and language representations. Three different fusion strategies are also explored in the paper. According to our results, the GGF model outperformed both, unimodal and other multimodal methods in the 4-class emotion recognition task on the IEMOCAP dataset and the binary CS classification task on the KONECTADB dataset. The study also demonstrated superior performance of our methodology compared to previous works on KONECTADB in both unimodal and multimodal approaches.</div></div>\",\"PeriodicalId\":51011,\"journal\":{\"name\":\"Digital Signal Processing\",\"volume\":\"156 \",\"pages\":\"Article 104820\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-10-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1051200424004457\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200424004457","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

呼叫中心的客户满意度（CS）评估对于评估服务质量至关重要，但通常依赖于人工评估。自动评估系统可用于执行 CS 分析，从而对更大的数据集进行评估。本研究论文侧重于通过多模态方法进行 CS 分析，该方法采用了从真实世界语音邮件中提取的语音和语言表征。此外，鉴于对所提供服务的评估（可能会引发客户的不同情绪）与语音中情绪的自动分类之间存在相似性，我们还利用著名的语料库 IEMOCAP 探索了情绪识别的主题，该语料库由对应于不同情绪状态的 4 个类别组成。我们采用了基于 CNN-LSTM 模型的单词嵌入语言表示法，以及三种不同的自监督学习（SSL）语音编码器，即 Wav2Vec2.0、HuBERT 和 WavLM。在同步语音和语言表征时，采用了基于注意力机制的双向对齐网络。文中还探讨了三种不同的融合策略。研究结果表明，在 IEMOCAP 数据集的四类情感识别任务和 KONECTADB 数据集的二元 CS 分类任务中，GGF 模型的表现优于单模态方法和其他多模态方法。这项研究还表明，与之前在 KONECTADB 上使用的单模态和多模态方法相比，我们的方法具有更优越的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multimodal evaluation of customer satisfaction from voicemails using speech and language representations

Customer satisfaction (CS) evaluation in call centers is essential for assessing service quality but commonly relies on human evaluations. Automatic evaluation systems can be used to perform CS analyses, enabling the evaluation of larger datasets. This research paper focuses on CS analysis through a multimodal approach that employs speech and language representations derived from the real-world voicemails. Additionally, given the similarity between the evaluation of a provided service (which may elicit different emotions in customers) and the automatic classification of emotions in speech, we also explore the topic of emotion recognition with the well-known corpus IEMOCAP which comprises 4-classes corresponding to different emotional states. We incorporated a language representation with word embeddings based on a CNN-LSTM model, and three different self-supervised learning (SSL) speech encoders, namely Wav2Vec2.0, HuBERT, and WavLM. A bidirectional alignment network based on attention mechanisms is employed for synchronizing speech and language representations. Three different fusion strategies are also explored in the paper. According to our results, the GGF model outperformed both, unimodal and other multimodal methods in the 4-class emotion recognition task on the IEMOCAP dataset and the binary CS classification task on the KONECTADB dataset. The study also demonstrated superior performance of our methodology compared to previous works on KONECTADB in both unimodal and multimodal approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Digital Signal Processing 工程技术-工程：电子与电气

CiteScore

5.30

自引率

17.20%

发文量

435

审稿时长

66 days

期刊介绍： Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal. The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as: • big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,