基于集成学习的音频处理方法在ASD儿童语音情绪识别中的应用

2021 IEEE World AI IoT Congress (AIIoT) Pub Date : 2021-05-10 DOI:10.1109/AIIoT52608.2021.9454174

Damian Valles, Rezwan Matin

{"title":"基于集成学习的音频处理方法在ASD儿童语音情绪识别中的应用","authors":"Damian Valles, Rezwan Matin","doi":"10.1109/AIIoT52608.2021.9454174","DOIUrl":null,"url":null,"abstract":"Children with Autism Spectrum Disorder (ASD) find it difficult to detect human emotions in social interactions. A speech emotion recognition system was developed in this work, which aims to help these children to better identify the emotions of their communication partner. The system was developed using machine learning and deep learning techniques. Through the use of ensemble learning, multiple machine learning algorithms were joined to provide a final prediction on the recorded input utterances. The ensemble of models includes a Support Vector Machine (SVM), a Multi-Layer Perceptron (MLP), and a Recurrent Neural Network (RNN). All three models were trained on the Ryerson Audio-Visual Database of Emotional Speech and Songs (RAVDESS), the Toronto Emotional Speech Set (TESS), and the Crowd-sourced Emotional Multimodal Actors Dataset (CREMA-D). A fourth dataset was used, which was created by adding background noise to the clean speech files from the datasets previously mentioned. The paper describes the audio processing of the samples, the techniques used to include the background noise, and the feature extraction coefficients considered for the development and training of the models. This study presents the performance evaluation of the individual models to each of the datasets, inclusion of the background noises, and the combination of using all of the samples in all three datasets. The evaluation was made to select optimal hyperparameters configuration of the models to evaluate the performance of the ensemble learning approach through majority voting. The overall performance of the ensemble learning reached a peak accuracy of 66.5%, reaching a higher performance emotion classification accuracy than the MLP model which reached 65.7%.","PeriodicalId":443405,"journal":{"name":"2021 IEEE World AI IoT Congress (AIIoT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"An Audio Processing Approach using Ensemble Learning for Speech-Emotion Recognition for Children with ASD\",\"authors\":\"Damian Valles, Rezwan Matin\",\"doi\":\"10.1109/AIIoT52608.2021.9454174\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Children with Autism Spectrum Disorder (ASD) find it difficult to detect human emotions in social interactions. A speech emotion recognition system was developed in this work, which aims to help these children to better identify the emotions of their communication partner. The system was developed using machine learning and deep learning techniques. Through the use of ensemble learning, multiple machine learning algorithms were joined to provide a final prediction on the recorded input utterances. The ensemble of models includes a Support Vector Machine (SVM), a Multi-Layer Perceptron (MLP), and a Recurrent Neural Network (RNN). All three models were trained on the Ryerson Audio-Visual Database of Emotional Speech and Songs (RAVDESS), the Toronto Emotional Speech Set (TESS), and the Crowd-sourced Emotional Multimodal Actors Dataset (CREMA-D). A fourth dataset was used, which was created by adding background noise to the clean speech files from the datasets previously mentioned. The paper describes the audio processing of the samples, the techniques used to include the background noise, and the feature extraction coefficients considered for the development and training of the models. This study presents the performance evaluation of the individual models to each of the datasets, inclusion of the background noises, and the combination of using all of the samples in all three datasets. The evaluation was made to select optimal hyperparameters configuration of the models to evaluate the performance of the ensemble learning approach through majority voting. The overall performance of the ensemble learning reached a peak accuracy of 66.5%, reaching a higher performance emotion classification accuracy than the MLP model which reached 65.7%.\",\"PeriodicalId\":443405,\"journal\":{\"name\":\"2021 IEEE World AI IoT Congress (AIIoT)\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE World AI IoT Congress (AIIoT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AIIoT52608.2021.9454174\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE World AI IoT Congress (AIIoT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AIIoT52608.2021.9454174","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

患有自闭症谱系障碍(ASD)的儿童很难在社会交往中察觉到人类的情绪。本研究开发了一个语音情绪识别系统，旨在帮助这些孩子更好地识别他们的交流伙伴的情绪。该系统是使用机器学习和深度学习技术开发的。通过使用集成学习，将多个机器学习算法结合起来，对记录的输入话语提供最终预测。模型集成包括支持向量机(SVM)、多层感知器(MLP)和递归神经网络(RNN)。这三个模型都是在瑞尔森情感语音和歌曲视听数据库(RAVDESS)、多伦多情感语音集(TESS)和众包情感多模态演员数据集(CREMA-D)上进行训练的。使用了第四个数据集，它是通过将背景噪声添加到前面提到的数据集的干净语音文件中来创建的。本文描述了样本的音频处理，用于包含背景噪声的技术，以及用于开发和训练模型的特征提取系数。本研究提出了对每个数据集的单个模型的性能评估，包括背景噪声，以及在所有三个数据集中使用所有样本的组合。选择模型的最优超参数配置进行评价，通过多数投票来评价集成学习方法的性能。集成学习的整体性能达到了66.5%的峰值准确率，达到了高于MLP模型65.7%的性能情绪分类准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Audio Processing Approach using Ensemble Learning for Speech-Emotion Recognition for Children with ASD

Children with Autism Spectrum Disorder (ASD) find it difficult to detect human emotions in social interactions. A speech emotion recognition system was developed in this work, which aims to help these children to better identify the emotions of their communication partner. The system was developed using machine learning and deep learning techniques. Through the use of ensemble learning, multiple machine learning algorithms were joined to provide a final prediction on the recorded input utterances. The ensemble of models includes a Support Vector Machine (SVM), a Multi-Layer Perceptron (MLP), and a Recurrent Neural Network (RNN). All three models were trained on the Ryerson Audio-Visual Database of Emotional Speech and Songs (RAVDESS), the Toronto Emotional Speech Set (TESS), and the Crowd-sourced Emotional Multimodal Actors Dataset (CREMA-D). A fourth dataset was used, which was created by adding background noise to the clean speech files from the datasets previously mentioned. The paper describes the audio processing of the samples, the techniques used to include the background noise, and the feature extraction coefficients considered for the development and training of the models. This study presents the performance evaluation of the individual models to each of the datasets, inclusion of the background noises, and the combination of using all of the samples in all three datasets. The evaluation was made to select optimal hyperparameters configuration of the models to evaluate the performance of the ensemble learning approach through majority voting. The overall performance of the ensemble learning reached a peak accuracy of 66.5%, reaching a higher performance emotion classification accuracy than the MLP model which reached 65.7%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE World AI IoT Congress (AIIoT)

自引率

0.00%

发文量