Ziping Zhao;Jixin Liu;Haishuai Wang;Danushka Bandara;Jianhua Tao
{"title":"基于知识提取的语音情感识别方法","authors":"Ziping Zhao;Jixin Liu;Haishuai Wang;Danushka Bandara;Jianhua Tao","doi":"10.1109/TAFFC.2025.3574178","DOIUrl":null,"url":null,"abstract":"Due to rapid advancements in deep learning, Transformer-based architectures have proven effective in speech emotion recognition (SER), largely due to their ability to model long-term dependencies more effectively than recurrent networks. The current Transformer architecture is not well-suited for SER because its large parameter number demands significant computational resources, making it less feasible in environments with limited resources. Furthermore, its application to SER is limited because human emotions, which are expressed in long segments of continuous speech, are inherently complex and ambiguous. Therefore, designing specialized Transformer models tailored for SER is essential. To address these challenges, we propose a novel knowledge distillation framework that combines meta-knowledge and curriculum-based distillation. Specifically, we fine-tune the teacher model to optimize it for the SER task. For the student model, we embed individual sequence time points into variable tokens, which are used to aggregate the global speech representation. Additionally, we combine supervised contrastive and cross-entropy loss to increase the inter-class distance between learnable features. Finally, we optimize the student model using both meta-knowledge and the curriculum-based distillation framework. Experimental results on two benchmark datasets, IEMOCAP and MELD, demonstrate that our method performs competitively with state-of-the-art approaches in SER.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 3","pages":"1307-1317"},"PeriodicalIF":9.8000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Knowledge Distillation-Based Approach to Speech Emotion Recognition\",\"authors\":\"Ziping Zhao;Jixin Liu;Haishuai Wang;Danushka Bandara;Jianhua Tao\",\"doi\":\"10.1109/TAFFC.2025.3574178\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Due to rapid advancements in deep learning, Transformer-based architectures have proven effective in speech emotion recognition (SER), largely due to their ability to model long-term dependencies more effectively than recurrent networks. The current Transformer architecture is not well-suited for SER because its large parameter number demands significant computational resources, making it less feasible in environments with limited resources. Furthermore, its application to SER is limited because human emotions, which are expressed in long segments of continuous speech, are inherently complex and ambiguous. Therefore, designing specialized Transformer models tailored for SER is essential. To address these challenges, we propose a novel knowledge distillation framework that combines meta-knowledge and curriculum-based distillation. Specifically, we fine-tune the teacher model to optimize it for the SER task. For the student model, we embed individual sequence time points into variable tokens, which are used to aggregate the global speech representation. Additionally, we combine supervised contrastive and cross-entropy loss to increase the inter-class distance between learnable features. Finally, we optimize the student model using both meta-knowledge and the curriculum-based distillation framework. Experimental results on two benchmark datasets, IEMOCAP and MELD, demonstrate that our method performs competitively with state-of-the-art approaches in SER.\",\"PeriodicalId\":13131,\"journal\":{\"name\":\"IEEE Transactions on Affective Computing\",\"volume\":\"16 3\",\"pages\":\"1307-1317\"},\"PeriodicalIF\":9.8000,\"publicationDate\":\"2025-06-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Affective Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11023201/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11023201/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
A Knowledge Distillation-Based Approach to Speech Emotion Recognition
Due to rapid advancements in deep learning, Transformer-based architectures have proven effective in speech emotion recognition (SER), largely due to their ability to model long-term dependencies more effectively than recurrent networks. The current Transformer architecture is not well-suited for SER because its large parameter number demands significant computational resources, making it less feasible in environments with limited resources. Furthermore, its application to SER is limited because human emotions, which are expressed in long segments of continuous speech, are inherently complex and ambiguous. Therefore, designing specialized Transformer models tailored for SER is essential. To address these challenges, we propose a novel knowledge distillation framework that combines meta-knowledge and curriculum-based distillation. Specifically, we fine-tune the teacher model to optimize it for the SER task. For the student model, we embed individual sequence time points into variable tokens, which are used to aggregate the global speech representation. Additionally, we combine supervised contrastive and cross-entropy loss to increase the inter-class distance between learnable features. Finally, we optimize the student model using both meta-knowledge and the curriculum-based distillation framework. Experimental results on two benchmark datasets, IEMOCAP and MELD, demonstrate that our method performs competitively with state-of-the-art approaches in SER.
期刊介绍:
The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.