{"title":"基于Conv1D和Conv2D网络的泰语语音情感识别特征提取技术","authors":"Naris Prombut, S. Waijanya, Nuttachot Promrit","doi":"10.1145/3508230.3508238","DOIUrl":null,"url":null,"abstract":"Speech Emotion Recognition is one of the challenges in Natural Language Processing (NLP) area. There are many factors used to identify emotions in speech, such as pitch, intensity, frequency, duration, and speakers' nationality. This paper implements a speech emotion recognition model specifically for Thai language by classifying it into 5 emotions: Angry, Frustrated, Neutral, Sad, and Happy. This research uses a dataset from VISTEC-depa AI Research Institute of Thailand. There are 21,562 sounds (scripts) divided into 70% of training data and 30% of test data. We use the Mel spectrogram and Mel-frequency Cepstral Coefficients (MFCC) technique for feature extraction and 1D Convolutional Neural Network (Conv1D) all together with 2D Convolutional Neural Network (Conv2D), to classify emotions. With respect to the result, MFCC with Conv2D provides the highest accuracy at 80.59%, and is higher than the baseline study, which is of 71.35%.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Feature Extraction Technique Based on Conv1D and Conv2D Network for Thai Speech Emotion Recognition\",\"authors\":\"Naris Prombut, S. Waijanya, Nuttachot Promrit\",\"doi\":\"10.1145/3508230.3508238\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech Emotion Recognition is one of the challenges in Natural Language Processing (NLP) area. There are many factors used to identify emotions in speech, such as pitch, intensity, frequency, duration, and speakers' nationality. This paper implements a speech emotion recognition model specifically for Thai language by classifying it into 5 emotions: Angry, Frustrated, Neutral, Sad, and Happy. This research uses a dataset from VISTEC-depa AI Research Institute of Thailand. There are 21,562 sounds (scripts) divided into 70% of training data and 30% of test data. We use the Mel spectrogram and Mel-frequency Cepstral Coefficients (MFCC) technique for feature extraction and 1D Convolutional Neural Network (Conv1D) all together with 2D Convolutional Neural Network (Conv2D), to classify emotions. With respect to the result, MFCC with Conv2D provides the highest accuracy at 80.59%, and is higher than the baseline study, which is of 71.35%.\",\"PeriodicalId\":252146,\"journal\":{\"name\":\"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3508230.3508238\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3508230.3508238","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Feature Extraction Technique Based on Conv1D and Conv2D Network for Thai Speech Emotion Recognition
Speech Emotion Recognition is one of the challenges in Natural Language Processing (NLP) area. There are many factors used to identify emotions in speech, such as pitch, intensity, frequency, duration, and speakers' nationality. This paper implements a speech emotion recognition model specifically for Thai language by classifying it into 5 emotions: Angry, Frustrated, Neutral, Sad, and Happy. This research uses a dataset from VISTEC-depa AI Research Institute of Thailand. There are 21,562 sounds (scripts) divided into 70% of training data and 30% of test data. We use the Mel spectrogram and Mel-frequency Cepstral Coefficients (MFCC) technique for feature extraction and 1D Convolutional Neural Network (Conv1D) all together with 2D Convolutional Neural Network (Conv2D), to classify emotions. With respect to the result, MFCC with Conv2D provides the highest accuracy at 80.59%, and is higher than the baseline study, which is of 71.35%.