Sumeth Yuenyong, Narit Hnoohom, K. Wongpatikaseree, Sattaya Singkul
{"title":"基于时域对比预测编码和反tasnet的实时泰语情感识别","authors":"Sumeth Yuenyong, Narit Hnoohom, K. Wongpatikaseree, Sattaya Singkul","doi":"10.1109/ICBIR54589.2022.9786444","DOIUrl":null,"url":null,"abstract":"Speech emotion recognition (SER) is an important part of human-computer interaction. SER face many challenges such as acoustic environment of speech, and the amount of data available for training. For Thai in particular, there is additional challenge from the language using tones, and the size of available dataset is relatively small. In this work we propose Thai Speech Emotion Recognition With Speech Enhancement (TH-SERSE). TH-SERSE consists of speech enhancement using Conv-TasNet followed by pre-training using contrastive predictive coding. The pre-trained model was then finetuned for emotion classification. We experimented on two datasets: EMOLA and ThaiSER that has open and closed acoustic environments, respectively. The experiments show that our method outperforms recently proposed methods.","PeriodicalId":216904,"journal":{"name":"2022 7th International Conference on Business and Industrial Research (ICBIR)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Real-Time Thai Speech Emotion Recognition With Speech Enhancement Using Time-Domain Contrastive Predictive Coding and Conv-Tasnet\",\"authors\":\"Sumeth Yuenyong, Narit Hnoohom, K. Wongpatikaseree, Sattaya Singkul\",\"doi\":\"10.1109/ICBIR54589.2022.9786444\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech emotion recognition (SER) is an important part of human-computer interaction. SER face many challenges such as acoustic environment of speech, and the amount of data available for training. For Thai in particular, there is additional challenge from the language using tones, and the size of available dataset is relatively small. In this work we propose Thai Speech Emotion Recognition With Speech Enhancement (TH-SERSE). TH-SERSE consists of speech enhancement using Conv-TasNet followed by pre-training using contrastive predictive coding. The pre-trained model was then finetuned for emotion classification. We experimented on two datasets: EMOLA and ThaiSER that has open and closed acoustic environments, respectively. The experiments show that our method outperforms recently proposed methods.\",\"PeriodicalId\":216904,\"journal\":{\"name\":\"2022 7th International Conference on Business and Industrial Research (ICBIR)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 7th International Conference on Business and Industrial Research (ICBIR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICBIR54589.2022.9786444\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Business and Industrial Research (ICBIR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICBIR54589.2022.9786444","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Real-Time Thai Speech Emotion Recognition With Speech Enhancement Using Time-Domain Contrastive Predictive Coding and Conv-Tasnet
Speech emotion recognition (SER) is an important part of human-computer interaction. SER face many challenges such as acoustic environment of speech, and the amount of data available for training. For Thai in particular, there is additional challenge from the language using tones, and the size of available dataset is relatively small. In this work we propose Thai Speech Emotion Recognition With Speech Enhancement (TH-SERSE). TH-SERSE consists of speech enhancement using Conv-TasNet followed by pre-training using contrastive predictive coding. The pre-trained model was then finetuned for emotion classification. We experimented on two datasets: EMOLA and ThaiSER that has open and closed acoustic environments, respectively. The experiments show that our method outperforms recently proposed methods.