基于知识提取的语音情感识别方法

IF 9.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Affective Computing Pub Date : 2025-06-04 DOI:10.1109/TAFFC.2025.3574178

Ziping Zhao;Jixin Liu;Haishuai Wang;Danushka Bandara;Jianhua Tao

{"title":"基于知识提取的语音情感识别方法","authors":"Ziping Zhao;Jixin Liu;Haishuai Wang;Danushka Bandara;Jianhua Tao","doi":"10.1109/TAFFC.2025.3574178","DOIUrl":null,"url":null,"abstract":"Due to rapid advancements in deep learning, Transformer-based architectures have proven effective in speech emotion recognition (SER), largely due to their ability to model long-term dependencies more effectively than recurrent networks. The current Transformer architecture is not well-suited for SER because its large parameter number demands significant computational resources, making it less feasible in environments with limited resources. Furthermore, its application to SER is limited because human emotions, which are expressed in long segments of continuous speech, are inherently complex and ambiguous. Therefore, designing specialized Transformer models tailored for SER is essential. To address these challenges, we propose a novel knowledge distillation framework that combines meta-knowledge and curriculum-based distillation. Specifically, we fine-tune the teacher model to optimize it for the SER task. For the student model, we embed individual sequence time points into variable tokens, which are used to aggregate the global speech representation. Additionally, we combine supervised contrastive and cross-entropy loss to increase the inter-class distance between learnable features. Finally, we optimize the student model using both meta-knowledge and the curriculum-based distillation framework. Experimental results on two benchmark datasets, IEMOCAP and MELD, demonstrate that our method performs competitively with state-of-the-art approaches in SER.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 3","pages":"1307-1317"},"PeriodicalIF":9.8000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Knowledge Distillation-Based Approach to Speech Emotion Recognition\",\"authors\":\"Ziping Zhao;Jixin Liu;Haishuai Wang;Danushka Bandara;Jianhua Tao\",\"doi\":\"10.1109/TAFFC.2025.3574178\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Due to rapid advancements in deep learning, Transformer-based architectures have proven effective in speech emotion recognition (SER), largely due to their ability to model long-term dependencies more effectively than recurrent networks. The current Transformer architecture is not well-suited for SER because its large parameter number demands significant computational resources, making it less feasible in environments with limited resources. Furthermore, its application to SER is limited because human emotions, which are expressed in long segments of continuous speech, are inherently complex and ambiguous. Therefore, designing specialized Transformer models tailored for SER is essential. To address these challenges, we propose a novel knowledge distillation framework that combines meta-knowledge and curriculum-based distillation. Specifically, we fine-tune the teacher model to optimize it for the SER task. For the student model, we embed individual sequence time points into variable tokens, which are used to aggregate the global speech representation. Additionally, we combine supervised contrastive and cross-entropy loss to increase the inter-class distance between learnable features. Finally, we optimize the student model using both meta-knowledge and the curriculum-based distillation framework. Experimental results on two benchmark datasets, IEMOCAP and MELD, demonstrate that our method performs competitively with state-of-the-art approaches in SER.\",\"PeriodicalId\":13131,\"journal\":{\"name\":\"IEEE Transactions on Affective Computing\",\"volume\":\"16 3\",\"pages\":\"1307-1317\"},\"PeriodicalIF\":9.8000,\"publicationDate\":\"2025-06-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Affective Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11023201/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11023201/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

由于深度学习的快速发展，基于transformer的架构已被证明在语音情感识别（SER）中是有效的，这主要是因为它们能够比循环网络更有效地建模长期依赖关系。当前的Transformer体系结构不太适合SER，因为它的大参数数量需要大量的计算资源，使得它在资源有限的环境中不太可行。此外，它在SER中的应用受到限制，因为人类的情感是用长段连续的言语表达的，本质上是复杂和模糊的。因此，为SER设计专门的Transformer模型是必要的。为了应对这些挑战，我们提出了一种结合元知识和基于课程的知识蒸馏的新型知识蒸馏框架。具体地说，我们对教师模型进行了微调，使其适合SER任务。对于学生模型，我们将单个序列时间点嵌入到变量令牌中，这些令牌用于聚合全局语音表示。此外，我们结合监督对比和交叉熵损失来增加可学习特征之间的类间距离。最后，我们使用元知识和基于课程的精馏框架对学生模型进行优化。在IEMOCAP和MELD两个基准数据集上的实验结果表明，我们的方法与最先进的SER方法相比具有竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Knowledge Distillation-Based Approach to Speech Emotion Recognition

Due to rapid advancements in deep learning, Transformer-based architectures have proven effective in speech emotion recognition (SER), largely due to their ability to model long-term dependencies more effectively than recurrent networks. The current Transformer architecture is not well-suited for SER because its large parameter number demands significant computational resources, making it less feasible in environments with limited resources. Furthermore, its application to SER is limited because human emotions, which are expressed in long segments of continuous speech, are inherently complex and ambiguous. Therefore, designing specialized Transformer models tailored for SER is essential. To address these challenges, we propose a novel knowledge distillation framework that combines meta-knowledge and curriculum-based distillation. Specifically, we fine-tune the teacher model to optimize it for the SER task. For the student model, we embed individual sequence time points into variable tokens, which are used to aggregate the global speech representation. Additionally, we combine supervised contrastive and cross-entropy loss to increase the inter-class distance between learnable features. Finally, we optimize the student model using both meta-knowledge and the curriculum-based distillation framework. Experimental results on two benchmark datasets, IEMOCAP and MELD, demonstrate that our method performs competitively with state-of-the-art approaches in SER.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Affective Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS

CiteScore

15.00

自引率

6.20%

发文量

174

期刊介绍： The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.