{"title":"一种基于CNN和LSTM网络的语音情感模型","authors":"Benguo Ye, Xiaofeng Yuan, Gang Peng, Weizhen Zeng","doi":"10.1109/ACAIT56212.2022.10137926","DOIUrl":null,"url":null,"abstract":"LSTM is a sequential model containing the long short-term memory cells gated recurrent units. Compared to the traditional RNN, LSTM introduces three gates which solve the exploding and vanishing gradient problems of RNN. In this paper, we propose a new speech emotion model by combining CNN and LSTM. The model is implemented based on the CASIA data sets, the Python librosa library and the opensmile tool to get the speech emotion features by extracting the multi-feature of the fusion acoustics which would then be compared to the features based on different configurations to evaluate the recognition accuracy. The experimental results show that the features extracted from the emobase2010 configuration can achieve 84% recognition accuracy based on the CASIA dataset. Compared with other models, the recognition accuracy of the model introduced in1 this paper is 3.3% higher than that of the SVM model, but 6.3% lower than that of the ConvLSTM model.","PeriodicalId":398228,"journal":{"name":"2022 6th Asian Conference on Artificial Intelligence Technology (ACAIT)","volume":"2013 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Novel Speech Emotion Model Based on CNN and LSTM Networks\",\"authors\":\"Benguo Ye, Xiaofeng Yuan, Gang Peng, Weizhen Zeng\",\"doi\":\"10.1109/ACAIT56212.2022.10137926\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"LSTM is a sequential model containing the long short-term memory cells gated recurrent units. Compared to the traditional RNN, LSTM introduces three gates which solve the exploding and vanishing gradient problems of RNN. In this paper, we propose a new speech emotion model by combining CNN and LSTM. The model is implemented based on the CASIA data sets, the Python librosa library and the opensmile tool to get the speech emotion features by extracting the multi-feature of the fusion acoustics which would then be compared to the features based on different configurations to evaluate the recognition accuracy. The experimental results show that the features extracted from the emobase2010 configuration can achieve 84% recognition accuracy based on the CASIA dataset. Compared with other models, the recognition accuracy of the model introduced in1 this paper is 3.3% higher than that of the SVM model, but 6.3% lower than that of the ConvLSTM model.\",\"PeriodicalId\":398228,\"journal\":{\"name\":\"2022 6th Asian Conference on Artificial Intelligence Technology (ACAIT)\",\"volume\":\"2013 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 6th Asian Conference on Artificial Intelligence Technology (ACAIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ACAIT56212.2022.10137926\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 6th Asian Conference on Artificial Intelligence Technology (ACAIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACAIT56212.2022.10137926","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Novel Speech Emotion Model Based on CNN and LSTM Networks
LSTM is a sequential model containing the long short-term memory cells gated recurrent units. Compared to the traditional RNN, LSTM introduces three gates which solve the exploding and vanishing gradient problems of RNN. In this paper, we propose a new speech emotion model by combining CNN and LSTM. The model is implemented based on the CASIA data sets, the Python librosa library and the opensmile tool to get the speech emotion features by extracting the multi-feature of the fusion acoustics which would then be compared to the features based on different configurations to evaluate the recognition accuracy. The experimental results show that the features extracted from the emobase2010 configuration can achieve 84% recognition accuracy based on the CASIA dataset. Compared with other models, the recognition accuracy of the model introduced in1 this paper is 3.3% higher than that of the SVM model, but 6.3% lower than that of the ConvLSTM model.