{"title":"SeeSpeech: See Emotions in The Speech","authors":"Jianing Geng, Hao Zhu, Xiang-Yang Li","doi":"10.1145/3456126.3456129","DOIUrl":null,"url":null,"abstract":"At present, the understanding of speech by machines mostly focuses on the understanding of semantics, but speech should also include emotions in the speech. Emotion can not only strengthen semantics, but can even change semantic information. The paper discusses how to realize the emotion classification, which is called SeeSpeech. SeeSpeech chooses MCEP as the speech emotion feature, and inputs it into CNN and Transformer respectively. In order to obtain richer features, CNN uses batch normalization, while Transformer uses layer normalization, and then combines the output of CNN and Transformer. Finally, the type of emotion is obtained through SoftMax. SeeSpeech obtained the highest classification accuracy rate of 97% on the RAVDESS data set, and also obtained the classification accuracy rate of 85% on the actual edge gateway test. It can be seen from the results that SeeSpeech has encouraging performance in speech emotion classification and has a wide range of application prospects in human-computer interaction.","PeriodicalId":431685,"journal":{"name":"2021 2nd Asia Service Sciences and Software Engineering Conference","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 2nd Asia Service Sciences and Software Engineering Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3456126.3456129","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
At present, the understanding of speech by machines mostly focuses on the understanding of semantics, but speech should also include emotions in the speech. Emotion can not only strengthen semantics, but can even change semantic information. The paper discusses how to realize the emotion classification, which is called SeeSpeech. SeeSpeech chooses MCEP as the speech emotion feature, and inputs it into CNN and Transformer respectively. In order to obtain richer features, CNN uses batch normalization, while Transformer uses layer normalization, and then combines the output of CNN and Transformer. Finally, the type of emotion is obtained through SoftMax. SeeSpeech obtained the highest classification accuracy rate of 97% on the RAVDESS data set, and also obtained the classification accuracy rate of 85% on the actual edge gateway test. It can be seen from the results that SeeSpeech has encouraging performance in speech emotion classification and has a wide range of application prospects in human-computer interaction.