Kshirod Sarmah, Swapnanil Gogoi, Hem Chandra Das, Bikram Patir, M. J. Sarma
{"title":"深度学习技术在语音情感识别中的应用现状综述","authors":"Kshirod Sarmah, Swapnanil Gogoi, Hem Chandra Das, Bikram Patir, M. J. Sarma","doi":"10.52783/jes.3745","DOIUrl":null,"url":null,"abstract":"In sophisticated Human-Computer Interfaces (HCI), the emotional state of the user is becoming a crucial component that is closely linked to emotional speech recognition. Spoken expressions, which can be a part of human-machine interaction, are an important source of emotional information. Speech emotion recognition (SER) in deep learning (DL) continues to be a hot topic, especially in the field of emotional computing. Current deep learning (DL) and neural network methods are applied in this highly active field of research. This is as a result of its expanding potential, advancements in algorithms, and practical uses. Quantitative factors such as pitch, intensity, accent and Mel-Frequency Cepstral Coefficients (MFCC) can be employed to model the paralinguistic data contained in human speech. To achieve SER, three key procedures are usually followed: data processing, feature selection/extraction, and classification based on the underlying emotional qualities. The nature of these processes and the peculiarities of human speech lend support to the employment of DL techniques for SER implementation. A variety of DL methods have been used for SER tasks in recent affective computing research works; however, only a small number of them capture the underlying ideas and methodologies that can be used to facilitate the three main steps of SER implementation. With a focus on the three SER implementation processes, we provide a state-of-the-art assessment of research conducted over the last ten years that tackled SER tasks from DL perspectives in this work. Various issues are covered in detail, including the problem of low classification accuracy of Speaker-Independent experiments and the related remedies. The review offers principles for SER evaluation as well, emphasizing indicators that can be experimented with and common baselines. ","PeriodicalId":44451,"journal":{"name":"Journal of Electrical Systems","volume":null,"pages":null},"PeriodicalIF":0.5000,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A State-of-arts Review of Deep Learning Techniques for Speech Emotion Recognition\",\"authors\":\"Kshirod Sarmah, Swapnanil Gogoi, Hem Chandra Das, Bikram Patir, M. J. Sarma\",\"doi\":\"10.52783/jes.3745\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In sophisticated Human-Computer Interfaces (HCI), the emotional state of the user is becoming a crucial component that is closely linked to emotional speech recognition. Spoken expressions, which can be a part of human-machine interaction, are an important source of emotional information. Speech emotion recognition (SER) in deep learning (DL) continues to be a hot topic, especially in the field of emotional computing. Current deep learning (DL) and neural network methods are applied in this highly active field of research. This is as a result of its expanding potential, advancements in algorithms, and practical uses. Quantitative factors such as pitch, intensity, accent and Mel-Frequency Cepstral Coefficients (MFCC) can be employed to model the paralinguistic data contained in human speech. To achieve SER, three key procedures are usually followed: data processing, feature selection/extraction, and classification based on the underlying emotional qualities. The nature of these processes and the peculiarities of human speech lend support to the employment of DL techniques for SER implementation. A variety of DL methods have been used for SER tasks in recent affective computing research works; however, only a small number of them capture the underlying ideas and methodologies that can be used to facilitate the three main steps of SER implementation. With a focus on the three SER implementation processes, we provide a state-of-the-art assessment of research conducted over the last ten years that tackled SER tasks from DL perspectives in this work. Various issues are covered in detail, including the problem of low classification accuracy of Speaker-Independent experiments and the related remedies. The review offers principles for SER evaluation as well, emphasizing indicators that can be experimented with and common baselines. \",\"PeriodicalId\":44451,\"journal\":{\"name\":\"Journal of Electrical Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.5000,\"publicationDate\":\"2024-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Electrical Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.52783/jes.3745\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Electrical Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.52783/jes.3745","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
摘要
在复杂的人机交互界面(HCI)中,用户的情绪状态正成为与情绪语音识别密切相关的重要组成部分。作为人机交互的一部分,口语表达是情感信息的重要来源。深度学习(DL)中的语音情感识别(SER)仍然是一个热门话题,尤其是在情感计算领域。当前的深度学习(DL)和神经网络方法被应用于这一高度活跃的研究领域。这得益于其不断扩大的潜力、算法的进步和实际用途。音高、强度、重音和梅尔频率倒频谱系数(MFCC)等定量因素可用于对人类语音中包含的副语言数据进行建模。要实现 SER,通常需要遵循三个关键程序:数据处理、特征选择/提取和基于基本情感质量的分类。这些过程的性质和人类语音的特殊性支持使用 DL 技术来实现 SER。在最近的情感计算研究工作中,有多种 DL 方法被用于 SER 任务;但是,只有少数方法捕捉到了可用于促进 SER 实施的三个主要步骤的基本思想和方法。在本作品中,我们将重点放在三个 SER 实施过程上,对过去十年中从 DL 角度处理 SER 任务的研究进行了最新评估。其中详细讨论了各种问题,包括与说话人无关的实验分类准确率低的问题及相关补救措施。综述还提供了 SER 评估的原则,强调了可进行实验的指标和通用基线。
A State-of-arts Review of Deep Learning Techniques for Speech Emotion Recognition
In sophisticated Human-Computer Interfaces (HCI), the emotional state of the user is becoming a crucial component that is closely linked to emotional speech recognition. Spoken expressions, which can be a part of human-machine interaction, are an important source of emotional information. Speech emotion recognition (SER) in deep learning (DL) continues to be a hot topic, especially in the field of emotional computing. Current deep learning (DL) and neural network methods are applied in this highly active field of research. This is as a result of its expanding potential, advancements in algorithms, and practical uses. Quantitative factors such as pitch, intensity, accent and Mel-Frequency Cepstral Coefficients (MFCC) can be employed to model the paralinguistic data contained in human speech. To achieve SER, three key procedures are usually followed: data processing, feature selection/extraction, and classification based on the underlying emotional qualities. The nature of these processes and the peculiarities of human speech lend support to the employment of DL techniques for SER implementation. A variety of DL methods have been used for SER tasks in recent affective computing research works; however, only a small number of them capture the underlying ideas and methodologies that can be used to facilitate the three main steps of SER implementation. With a focus on the three SER implementation processes, we provide a state-of-the-art assessment of research conducted over the last ten years that tackled SER tasks from DL perspectives in this work. Various issues are covered in detail, including the problem of low classification accuracy of Speaker-Independent experiments and the related remedies. The review offers principles for SER evaluation as well, emphasizing indicators that can be experimented with and common baselines.