Xin Qi, Yujun Wen, Junpeng Gong, Pengzhou Zhang, Yao Zheng
{"title":"语音情感识别的多模态解纠缠隐式蒸馏","authors":"Xin Qi, Yujun Wen, Junpeng Gong, Pengzhou Zhang, Yao Zheng","doi":"10.1016/j.ipm.2025.104213","DOIUrl":null,"url":null,"abstract":"<div><div>Audio signals are generally utilized with textual data for speech emotion recognition. Nevertheless, cross-modal interactions suffer from distribution discrepancy and information redundancy, leading to an inaccurate multimodal representation. Hence, this paper proposes a multimodal disentanglement implicit distillation model (MDID) that excavates and exploits each modality’s sentiment and specific characteristics. Specifically, the pre-trained models extract high-level acoustic and textual features and align them via an attention mechanism. Then, each modality is disentangled into modality sentiment-specific features. Subsequently, feature-level and logit-level distillation distill the purified modality-specific feature into the modality-sentiment feature. Compared to the adaptive fusion feature, solely employing the refined modality-sentiment feature yields superior performance for emotion recognition. Comprehensive experiments on the IEMOCAP and RAVDESS datasets indicate that MDID outperforms state-of-the-art approaches.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 5","pages":"Article 104213"},"PeriodicalIF":7.4000,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal disentanglement implicit distillation for speech emotion recognition\",\"authors\":\"Xin Qi, Yujun Wen, Junpeng Gong, Pengzhou Zhang, Yao Zheng\",\"doi\":\"10.1016/j.ipm.2025.104213\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Audio signals are generally utilized with textual data for speech emotion recognition. Nevertheless, cross-modal interactions suffer from distribution discrepancy and information redundancy, leading to an inaccurate multimodal representation. Hence, this paper proposes a multimodal disentanglement implicit distillation model (MDID) that excavates and exploits each modality’s sentiment and specific characteristics. Specifically, the pre-trained models extract high-level acoustic and textual features and align them via an attention mechanism. Then, each modality is disentangled into modality sentiment-specific features. Subsequently, feature-level and logit-level distillation distill the purified modality-specific feature into the modality-sentiment feature. Compared to the adaptive fusion feature, solely employing the refined modality-sentiment feature yields superior performance for emotion recognition. Comprehensive experiments on the IEMOCAP and RAVDESS datasets indicate that MDID outperforms state-of-the-art approaches.</div></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":\"62 5\",\"pages\":\"Article 104213\"},\"PeriodicalIF\":7.4000,\"publicationDate\":\"2025-05-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457325001542\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325001542","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Multimodal disentanglement implicit distillation for speech emotion recognition
Audio signals are generally utilized with textual data for speech emotion recognition. Nevertheless, cross-modal interactions suffer from distribution discrepancy and information redundancy, leading to an inaccurate multimodal representation. Hence, this paper proposes a multimodal disentanglement implicit distillation model (MDID) that excavates and exploits each modality’s sentiment and specific characteristics. Specifically, the pre-trained models extract high-level acoustic and textual features and align them via an attention mechanism. Then, each modality is disentangled into modality sentiment-specific features. Subsequently, feature-level and logit-level distillation distill the purified modality-specific feature into the modality-sentiment feature. Compared to the adaptive fusion feature, solely employing the refined modality-sentiment feature yields superior performance for emotion recognition. Comprehensive experiments on the IEMOCAP and RAVDESS datasets indicate that MDID outperforms state-of-the-art approaches.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.