基于asr和语音知识表示的语音情感识别

Proceedings of the 2022 6th International Conference on Innovation in Artificial Intelligence Pub Date : 2022-03-04 DOI:10.1145/3529466.3529488

Shuang Liang, Xiang Xie, Qingran Zhan, Hao-bo Cheng

{"title":"基于asr和语音知识表示的语音情感识别","authors":"Shuang Liang, Xiang Xie, Qingran Zhan, Hao-bo Cheng","doi":"10.1145/3529466.3529488","DOIUrl":null,"url":null,"abstract":"Speech emotion recognition (SER) is a challenging problem due to the insufficient dataset. This paper deals with this problem from two aspects. First, we exploit two levels of speech representations for SER task, one for automatic speech recognition (ASR)-based representations and the other for phonological knowledge representations. Second, we use transfer learning, pre-train models and transfer knowledge from other large corpus for none-SER task. In our system, the whole model is divided into two parts: two-representation learning module and SER module. We fuse acoustic features with ASR-based and phonological knowledge representations which are both extracted from pre-trained models, and the fusion features are used in SER training. Then a novel multi-task learning approach is proposed where a shared encoder-multi decoder model is used for the phonological knowledge representation learning. The Conformer structure is introduced for the SER task, and our study indicates that Conformer is effective for SER. Finally, experimental results on IEMOCAP show that the proposed method can achieve 77.35 weighted accuracy and 77.99 unweighted accuracy respectively.","PeriodicalId":375562,"journal":{"name":"Proceedings of the 2022 6th International Conference on Innovation in Artificial Intelligence","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Speech Emotion Recognition Exploiting ASR-based and Phonological Knowledge Representations\",\"authors\":\"Shuang Liang, Xiang Xie, Qingran Zhan, Hao-bo Cheng\",\"doi\":\"10.1145/3529466.3529488\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech emotion recognition (SER) is a challenging problem due to the insufficient dataset. This paper deals with this problem from two aspects. First, we exploit two levels of speech representations for SER task, one for automatic speech recognition (ASR)-based representations and the other for phonological knowledge representations. Second, we use transfer learning, pre-train models and transfer knowledge from other large corpus for none-SER task. In our system, the whole model is divided into two parts: two-representation learning module and SER module. We fuse acoustic features with ASR-based and phonological knowledge representations which are both extracted from pre-trained models, and the fusion features are used in SER training. Then a novel multi-task learning approach is proposed where a shared encoder-multi decoder model is used for the phonological knowledge representation learning. The Conformer structure is introduced for the SER task, and our study indicates that Conformer is effective for SER. Finally, experimental results on IEMOCAP show that the proposed method can achieve 77.35 weighted accuracy and 77.99 unweighted accuracy respectively.\",\"PeriodicalId\":375562,\"journal\":{\"name\":\"Proceedings of the 2022 6th International Conference on Innovation in Artificial Intelligence\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 6th International Conference on Innovation in Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3529466.3529488\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 6th International Conference on Innovation in Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3529466.3529488","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

由于数据集不足，语音情感识别是一个具有挑战性的问题。本文从两个方面探讨了这一问题。首先，我们利用了基于自动语音识别(ASR)的语音表示和基于语音知识的语音表示两种层次的语音表示。其次，我们使用迁移学习、预训练模型和从其他大型语料库迁移知识来完成非ser任务。在我们的系统中，整个模型分为两个部分:双表示学习模块和SER模块。我们将声学特征与从预训练模型中提取的基于asr的语音知识表示融合，并将融合特征用于SER训练。然后提出了一种新的多任务学习方法，该方法采用共享编码器-多解码器模型进行语音知识表示学习。在SER任务中引入了Conformer结构，我们的研究表明Conformer对SER任务是有效的。最后，在IEMOCAP上的实验结果表明，该方法的加权精度和非加权精度分别达到77.35和77.99。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Speech Emotion Recognition Exploiting ASR-based and Phonological Knowledge Representations

Speech emotion recognition (SER) is a challenging problem due to the insufficient dataset. This paper deals with this problem from two aspects. First, we exploit two levels of speech representations for SER task, one for automatic speech recognition (ASR)-based representations and the other for phonological knowledge representations. Second, we use transfer learning, pre-train models and transfer knowledge from other large corpus for none-SER task. In our system, the whole model is divided into two parts: two-representation learning module and SER module. We fuse acoustic features with ASR-based and phonological knowledge representations which are both extracted from pre-trained models, and the fusion features are used in SER training. Then a novel multi-task learning approach is proposed where a shared encoder-multi decoder model is used for the phonological knowledge representation learning. The Conformer structure is introduced for the SER task, and our study indicates that Conformer is effective for SER. Finally, experimental results on IEMOCAP show that the proposed method can achieve 77.35 weighted accuracy and 77.99 unweighted accuracy respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2022 6th International Conference on Innovation in Artificial Intelligence

自引率

0.00%

发文量