Felicia Andayani, Lau Bee Theng, Mark Tee Kit Tsun, C. Chua
{"title":"Recognition of Emotion in Speech-related Audio Files with LSTM-Transformer","authors":"Felicia Andayani, Lau Bee Theng, Mark Tee Kit Tsun, C. Chua","doi":"10.1109/icci54321.2022.9756100","DOIUrl":null,"url":null,"abstract":"In our everyday audio events, there is some emotional information in almost any speech audio received by humans. Thus, Speech Emotion Recognition (SER) has become an important research field in the last decade. SER recognizes human emotional states through human speech or daily conversation. It plays a crucial role in developing Human-Computer Interaction (HCI) and signals processing systems. Moreover, human emotions change naturally over time. Thus, it requires a good model for learning the long-term dependencies in the speech signal. In this paper, a hybrid model which combines two widely used deep learning methods is proposed. The proposed model combines the Long-Short Term Memory (LSTM) and Transformer architectures to learn the long-term dependencies through the extracted Mel Frequency Cepstral Coefficient (MFCC) features. The preliminary results of the proposed model evaluated on the publicly available dataset called RAVDESS are presented. The model achieved 75.33% of weighted accuracy (WA) and 73.12% of unweighted accuracy (UA) over the RAVDESS dataset. The experiment's result indicates the effectiveness of the proposed model in learning the temporal information from the frequency distributions according to the MFCC features.","PeriodicalId":122550,"journal":{"name":"2022 5th International Conference on Computing and Informatics (ICCI)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 5th International Conference on Computing and Informatics (ICCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icci54321.2022.9756100","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
In our everyday audio events, there is some emotional information in almost any speech audio received by humans. Thus, Speech Emotion Recognition (SER) has become an important research field in the last decade. SER recognizes human emotional states through human speech or daily conversation. It plays a crucial role in developing Human-Computer Interaction (HCI) and signals processing systems. Moreover, human emotions change naturally over time. Thus, it requires a good model for learning the long-term dependencies in the speech signal. In this paper, a hybrid model which combines two widely used deep learning methods is proposed. The proposed model combines the Long-Short Term Memory (LSTM) and Transformer architectures to learn the long-term dependencies through the extracted Mel Frequency Cepstral Coefficient (MFCC) features. The preliminary results of the proposed model evaluated on the publicly available dataset called RAVDESS are presented. The model achieved 75.33% of weighted accuracy (WA) and 73.12% of unweighted accuracy (UA) over the RAVDESS dataset. The experiment's result indicates the effectiveness of the proposed model in learning the temporal information from the frequency distributions according to the MFCC features.