{"title":"基于asr和语音知识表示的语音情感识别","authors":"Shuang Liang, Xiang Xie, Qingran Zhan, Hao-bo Cheng","doi":"10.1145/3529466.3529488","DOIUrl":null,"url":null,"abstract":"Speech emotion recognition (SER) is a challenging problem due to the insufficient dataset. This paper deals with this problem from two aspects. First, we exploit two levels of speech representations for SER task, one for automatic speech recognition (ASR)-based representations and the other for phonological knowledge representations. Second, we use transfer learning, pre-train models and transfer knowledge from other large corpus for none-SER task. In our system, the whole model is divided into two parts: two-representation learning module and SER module. We fuse acoustic features with ASR-based and phonological knowledge representations which are both extracted from pre-trained models, and the fusion features are used in SER training. Then a novel multi-task learning approach is proposed where a shared encoder-multi decoder model is used for the phonological knowledge representation learning. The Conformer structure is introduced for the SER task, and our study indicates that Conformer is effective for SER. Finally, experimental results on IEMOCAP show that the proposed method can achieve 77.35 weighted accuracy and 77.99 unweighted accuracy respectively.","PeriodicalId":375562,"journal":{"name":"Proceedings of the 2022 6th International Conference on Innovation in Artificial Intelligence","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Speech Emotion Recognition Exploiting ASR-based and Phonological Knowledge Representations\",\"authors\":\"Shuang Liang, Xiang Xie, Qingran Zhan, Hao-bo Cheng\",\"doi\":\"10.1145/3529466.3529488\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech emotion recognition (SER) is a challenging problem due to the insufficient dataset. This paper deals with this problem from two aspects. First, we exploit two levels of speech representations for SER task, one for automatic speech recognition (ASR)-based representations and the other for phonological knowledge representations. Second, we use transfer learning, pre-train models and transfer knowledge from other large corpus for none-SER task. In our system, the whole model is divided into two parts: two-representation learning module and SER module. We fuse acoustic features with ASR-based and phonological knowledge representations which are both extracted from pre-trained models, and the fusion features are used in SER training. Then a novel multi-task learning approach is proposed where a shared encoder-multi decoder model is used for the phonological knowledge representation learning. The Conformer structure is introduced for the SER task, and our study indicates that Conformer is effective for SER. Finally, experimental results on IEMOCAP show that the proposed method can achieve 77.35 weighted accuracy and 77.99 unweighted accuracy respectively.\",\"PeriodicalId\":375562,\"journal\":{\"name\":\"Proceedings of the 2022 6th International Conference on Innovation in Artificial Intelligence\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 6th International Conference on Innovation in Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3529466.3529488\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 6th International Conference on Innovation in Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3529466.3529488","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Speech Emotion Recognition Exploiting ASR-based and Phonological Knowledge Representations
Speech emotion recognition (SER) is a challenging problem due to the insufficient dataset. This paper deals with this problem from two aspects. First, we exploit two levels of speech representations for SER task, one for automatic speech recognition (ASR)-based representations and the other for phonological knowledge representations. Second, we use transfer learning, pre-train models and transfer knowledge from other large corpus for none-SER task. In our system, the whole model is divided into two parts: two-representation learning module and SER module. We fuse acoustic features with ASR-based and phonological knowledge representations which are both extracted from pre-trained models, and the fusion features are used in SER training. Then a novel multi-task learning approach is proposed where a shared encoder-multi decoder model is used for the phonological knowledge representation learning. The Conformer structure is introduced for the SER task, and our study indicates that Conformer is effective for SER. Finally, experimental results on IEMOCAP show that the proposed method can achieve 77.35 weighted accuracy and 77.99 unweighted accuracy respectively.