Sumeth Yuenyong, Narit Hnoohom, K. Wongpatikaseree, Sattaya Singkul
{"title":"Real-Time Thai Speech Emotion Recognition With Speech Enhancement Using Time-Domain Contrastive Predictive Coding and Conv-Tasnet","authors":"Sumeth Yuenyong, Narit Hnoohom, K. Wongpatikaseree, Sattaya Singkul","doi":"10.1109/ICBIR54589.2022.9786444","DOIUrl":null,"url":null,"abstract":"Speech emotion recognition (SER) is an important part of human-computer interaction. SER face many challenges such as acoustic environment of speech, and the amount of data available for training. For Thai in particular, there is additional challenge from the language using tones, and the size of available dataset is relatively small. In this work we propose Thai Speech Emotion Recognition With Speech Enhancement (TH-SERSE). TH-SERSE consists of speech enhancement using Conv-TasNet followed by pre-training using contrastive predictive coding. The pre-trained model was then finetuned for emotion classification. We experimented on two datasets: EMOLA and ThaiSER that has open and closed acoustic environments, respectively. The experiments show that our method outperforms recently proposed methods.","PeriodicalId":216904,"journal":{"name":"2022 7th International Conference on Business and Industrial Research (ICBIR)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Business and Industrial Research (ICBIR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICBIR54589.2022.9786444","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Speech emotion recognition (SER) is an important part of human-computer interaction. SER face many challenges such as acoustic environment of speech, and the amount of data available for training. For Thai in particular, there is additional challenge from the language using tones, and the size of available dataset is relatively small. In this work we propose Thai Speech Emotion Recognition With Speech Enhancement (TH-SERSE). TH-SERSE consists of speech enhancement using Conv-TasNet followed by pre-training using contrastive predictive coding. The pre-trained model was then finetuned for emotion classification. We experimented on two datasets: EMOLA and ThaiSER that has open and closed acoustic environments, respectively. The experiments show that our method outperforms recently proposed methods.