DISD-Net: A Dynamic Interactive Network With Self-Distillation for Cross-Subject Multi-Modal Emotion Recognition

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-01-29 DOI:10.1109/TMM.2025.3535344

Cheng Cheng;Wenzhe Liu;Xinying Wang;Lin Feng;Ziyu Jia

{"title":"DISD-Net: A Dynamic Interactive Network With Self-Distillation for Cross-Subject Multi-Modal Emotion Recognition","authors":"Cheng Cheng;Wenzhe Liu;Xinying Wang;Lin Feng;Ziyu Jia","doi":"10.1109/TMM.2025.3535344","DOIUrl":null,"url":null,"abstract":"Multi-modal Emotion Recognition (MER) has demonstrated competitive performance in affective computing, owing to synthesizing information from diverse modalities. However, many existing approaches still face unresolved challenges, such as: (i) how to learn compact yet representative features from multi-modal data simultaneously and (ii) how to address differences among subjects and enhance the generalization of the emotion recognition model, given the diverse nature of individual biological signals. To this end, we propose a Dynamic Interactive Network with Self-Distillation (DISD-Net) for cross-subject MER. The DISD-Net incorporates a dynamin interactive module to capture the intra- and inter-modal interactions from multi-modal data. Additionally, to enhance compactness in modal representations, we leverage the soft labels generated by the DISD-Net model as supplemental training guidance. This involves incorporating self-distillation, aiming to transfer the knowledge that the DISD-Net model contains hard and soft labels to each modality. Finally, domain adaptation (DA) is seamlessly integrated into the dynamic interactive and self-distillation components, forming a unified framework to extract subject-invariant multi-modal emotional features. Experimental results indicate that the proposed model achieves a mean accuracy of 75.00% with a standard deviation of 7.68% for the DEAP dataset and a mean accuracy of 65.65% with a standard deviation of 5.08% for the SEED-IV dataset.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4643-4655"},"PeriodicalIF":9.7000,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10857425/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Multi-modal Emotion Recognition (MER) has demonstrated competitive performance in affective computing, owing to synthesizing information from diverse modalities. However, many existing approaches still face unresolved challenges, such as: (i) how to learn compact yet representative features from multi-modal data simultaneously and (ii) how to address differences among subjects and enhance the generalization of the emotion recognition model, given the diverse nature of individual biological signals. To this end, we propose a Dynamic Interactive Network with Self-Distillation (DISD-Net) for cross-subject MER. The DISD-Net incorporates a dynamin interactive module to capture the intra- and inter-modal interactions from multi-modal data. Additionally, to enhance compactness in modal representations, we leverage the soft labels generated by the DISD-Net model as supplemental training guidance. This involves incorporating self-distillation, aiming to transfer the knowledge that the DISD-Net model contains hard and soft labels to each modality. Finally, domain adaptation (DA) is seamlessly integrated into the dynamic interactive and self-distillation components, forming a unified framework to extract subject-invariant multi-modal emotional features. Experimental results indicate that the proposed model achieves a mean accuracy of 75.00% with a standard deviation of 7.68% for the DEAP dataset and a mean accuracy of 65.65% with a standard deviation of 5.08% for the SEED-IV dataset.

查看原文本刊更多论文

面向跨主题多模态情感识别的自蒸馏动态交互网络

多模态情感识别（MER）在情感计算中表现出竞争力，因为它综合了来自不同模态的信息。然而，许多现有的方法仍然面临着未解决的挑战，例如：(i)如何同时从多模态数据中学习紧凑但具有代表性的特征；（ii）考虑到个体生物信号的多样性，如何解决受试者之间的差异并增强情感识别模型的泛化。为此，我们提出了一个基于自蒸馏的动态交互网络（DISD-Net）。DISD-Net包含一个动态交互模块，用于从多模态数据中捕获模态内和模态间的交互。此外，为了增强模态表示的紧凑性，我们利用由DISD-Net模型生成的软标签作为补充训练指导。这包括结合自蒸馏，目的是将DISD-Net模型包含硬标签和软标签的知识转移到每个模态。最后，将领域自适应（DA）与动态交互和自蒸馏组件无缝集成，形成一个统一的框架来提取主题不变的多模态情感特征。实验结果表明，该模型对DEAP数据集的平均准确率为75.00%，标准差为7.68%；对SEED-IV数据集的平均准确率为65.65%，标准差为5.08%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.