Luis Felipe Parra-Gallego , Tomás Arias-Vergara , Juan Rafael Orozco-Arroyave
{"title":"利用语音和语言表征对语音邮件中的客户满意度进行多模态评估","authors":"Luis Felipe Parra-Gallego , Tomás Arias-Vergara , Juan Rafael Orozco-Arroyave","doi":"10.1016/j.dsp.2024.104820","DOIUrl":null,"url":null,"abstract":"<div><div>Customer satisfaction (CS) evaluation in call centers is essential for assessing service quality but commonly relies on human evaluations. Automatic evaluation systems can be used to perform CS analyses, enabling the evaluation of larger datasets. This research paper focuses on CS analysis through a multimodal approach that employs speech and language representations derived from the real-world voicemails. Additionally, given the similarity between the evaluation of a provided service (which may elicit different emotions in customers) and the automatic classification of emotions in speech, we also explore the topic of emotion recognition with the well-known corpus IEMOCAP which comprises 4-classes corresponding to different emotional states. We incorporated a language representation with word embeddings based on a CNN-LSTM model, and three different self-supervised learning (SSL) speech encoders, namely Wav2Vec2.0, HuBERT, and WavLM. A bidirectional alignment network based on attention mechanisms is employed for synchronizing speech and language representations. Three different fusion strategies are also explored in the paper. According to our results, the GGF model outperformed both, unimodal and other multimodal methods in the 4-class emotion recognition task on the IEMOCAP dataset and the binary CS classification task on the KONECTADB dataset. The study also demonstrated superior performance of our methodology compared to previous works on KONECTADB in both unimodal and multimodal approaches.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"156 ","pages":"Article 104820"},"PeriodicalIF":2.9000,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal evaluation of customer satisfaction from voicemails using speech and language representations\",\"authors\":\"Luis Felipe Parra-Gallego , Tomás Arias-Vergara , Juan Rafael Orozco-Arroyave\",\"doi\":\"10.1016/j.dsp.2024.104820\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Customer satisfaction (CS) evaluation in call centers is essential for assessing service quality but commonly relies on human evaluations. Automatic evaluation systems can be used to perform CS analyses, enabling the evaluation of larger datasets. This research paper focuses on CS analysis through a multimodal approach that employs speech and language representations derived from the real-world voicemails. Additionally, given the similarity between the evaluation of a provided service (which may elicit different emotions in customers) and the automatic classification of emotions in speech, we also explore the topic of emotion recognition with the well-known corpus IEMOCAP which comprises 4-classes corresponding to different emotional states. We incorporated a language representation with word embeddings based on a CNN-LSTM model, and three different self-supervised learning (SSL) speech encoders, namely Wav2Vec2.0, HuBERT, and WavLM. A bidirectional alignment network based on attention mechanisms is employed for synchronizing speech and language representations. Three different fusion strategies are also explored in the paper. According to our results, the GGF model outperformed both, unimodal and other multimodal methods in the 4-class emotion recognition task on the IEMOCAP dataset and the binary CS classification task on the KONECTADB dataset. The study also demonstrated superior performance of our methodology compared to previous works on KONECTADB in both unimodal and multimodal approaches.</div></div>\",\"PeriodicalId\":51011,\"journal\":{\"name\":\"Digital Signal Processing\",\"volume\":\"156 \",\"pages\":\"Article 104820\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-10-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1051200424004457\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200424004457","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Multimodal evaluation of customer satisfaction from voicemails using speech and language representations
Customer satisfaction (CS) evaluation in call centers is essential for assessing service quality but commonly relies on human evaluations. Automatic evaluation systems can be used to perform CS analyses, enabling the evaluation of larger datasets. This research paper focuses on CS analysis through a multimodal approach that employs speech and language representations derived from the real-world voicemails. Additionally, given the similarity between the evaluation of a provided service (which may elicit different emotions in customers) and the automatic classification of emotions in speech, we also explore the topic of emotion recognition with the well-known corpus IEMOCAP which comprises 4-classes corresponding to different emotional states. We incorporated a language representation with word embeddings based on a CNN-LSTM model, and three different self-supervised learning (SSL) speech encoders, namely Wav2Vec2.0, HuBERT, and WavLM. A bidirectional alignment network based on attention mechanisms is employed for synchronizing speech and language representations. Three different fusion strategies are also explored in the paper. According to our results, the GGF model outperformed both, unimodal and other multimodal methods in the 4-class emotion recognition task on the IEMOCAP dataset and the binary CS classification task on the KONECTADB dataset. The study also demonstrated superior performance of our methodology compared to previous works on KONECTADB in both unimodal and multimodal approaches.
期刊介绍:
Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal.
The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as:
• big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,