对话中多模态情感识别的交叉模态门控特征增强。

IF 3.9 2区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

Scientific Reports Pub Date : 2025-08-16 DOI:10.1038/s41598-025-11989-6

Shiyun Zhao, Jinchang Ren, Xiaojuan Zhou

{"title":"对话中多模态情感识别的交叉模态门控特征增强。","authors":"Shiyun Zhao, Jinchang Ren, Xiaojuan Zhou","doi":"10.1038/s41598-025-11989-6","DOIUrl":null,"url":null,"abstract":"Emotion recognition in conversations (ERC), which involves identifying the emotional state of each utterance within a dialogue, plays a vital role in developing empathetic artificial intelligence systems. In practical applications, such as video-based recruitment interviews, customer service, health monitoring, intelligent personal assistants, and online education, ERC can facilitate the analysis of emotional cues, improve decision-making processes, and enhance user interaction and satisfaction. Current multimodal emotion recognition research faces several challenges, such as ineffective emotional information extraction from single modalities, underused complementary features, and inter-modal redundancy. To tackle these issues, this paper introduces a cross-modal gated attention mechanism for emotion recognition. The method extracts and fuses visual, textual, and auditory features to enhance accuracy and stability. A cross-modal guided gating mechanism is designed to strengthen single-modality features and utilize a third modality to improve bimodal feature fusion, boosting multimodal feature representation. Furthermore, a cross-modal distillation loss function is proposed to reduce redundancy and improve feature discrimination. This function employs a dual-supervision mechanism with teacher and student models, ensuring consistency in single-modal, bimodal, and trimodal feature representations. Experimental results on the IEMOCAP and MELD datasets indicate that the proposed method achieves higher accuracy and comparable F1 scores than existing approaches, highlighting its effectiveness in capturing multimodal dependencies and balancing modality contributions.","PeriodicalId":21811,"journal":{"name":"Scientific Reports","volume":"15 1","pages":"30004"},"PeriodicalIF":3.9000,"publicationDate":"2025-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12357892/pdf/","citationCount":"0","resultStr":"{\"title\":\"Cross-modal gated feature enhancement for multimodal emotion recognition in conversations.\",\"authors\":\"Shiyun Zhao, Jinchang Ren, Xiaojuan Zhou\",\"doi\":\"10.1038/s41598-025-11989-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emotion recognition in conversations (ERC), which involves identifying the emotional state of each utterance within a dialogue, plays a vital role in developing empathetic artificial intelligence systems. In practical applications, such as video-based recruitment interviews, customer service, health monitoring, intelligent personal assistants, and online education, ERC can facilitate the analysis of emotional cues, improve decision-making processes, and enhance user interaction and satisfaction. Current multimodal emotion recognition research faces several challenges, such as ineffective emotional information extraction from single modalities, underused complementary features, and inter-modal redundancy. To tackle these issues, this paper introduces a cross-modal gated attention mechanism for emotion recognition. The method extracts and fuses visual, textual, and auditory features to enhance accuracy and stability. A cross-modal guided gating mechanism is designed to strengthen single-modality features and utilize a third modality to improve bimodal feature fusion, boosting multimodal feature representation. Furthermore, a cross-modal distillation loss function is proposed to reduce redundancy and improve feature discrimination. This function employs a dual-supervision mechanism with teacher and student models, ensuring consistency in single-modal, bimodal, and trimodal feature representations. Experimental results on the IEMOCAP and MELD datasets indicate that the proposed method achieves higher accuracy and comparable F1 scores than existing approaches, highlighting its effectiveness in capturing multimodal dependencies and balancing modality contributions.\",\"PeriodicalId\":21811,\"journal\":{\"name\":\"Scientific Reports\",\"volume\":\"15 1\",\"pages\":\"30004\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2025-08-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12357892/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Scientific Reports\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.1038/s41598-025-11989-6\",\"RegionNum\":2,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Reports","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41598-025-11989-6","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

对话中的情绪识别（ERC）涉及识别对话中每个话语的情绪状态，在开发移情人工智能系统中起着至关重要的作用。在实际应用中，例如基于视频的招聘面试、客户服务、健康监测、智能个人助理和在线教育，ERC可以促进对情感线索的分析，改进决策过程，增强用户交互和满意度。当前的多模态情感识别研究面临着从单一模态中提取情感信息效果不佳、互补特征未充分利用、多模态冗余等挑战。为了解决这些问题，本文引入了一种用于情绪识别的跨模态门控注意机制。该方法提取并融合视觉、文字和听觉特征，以提高准确性和稳定性。设计了一种跨模态引导门控机制来增强单模态特征，并利用第三模态来改善双模态特征融合，提高多模态特征表示。此外，提出了一种跨模态蒸馏损失函数来减少冗余，提高特征识别能力。该功能采用教师和学生模型的双重监督机制，确保单模态、双模态和三模态特征表示的一致性。在IEMOCAP和MELD数据集上的实验结果表明，该方法比现有方法具有更高的准确性和可比较的F1分数，突出了其在捕获多模态依赖性和平衡模态贡献方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Cross-modal gated feature enhancement for multimodal emotion recognition in conversations.

查看原文本刊更多论文

Cross-modal gated feature enhancement for multimodal emotion recognition in conversations.

Emotion recognition in conversations (ERC), which involves identifying the emotional state of each utterance within a dialogue, plays a vital role in developing empathetic artificial intelligence systems. In practical applications, such as video-based recruitment interviews, customer service, health monitoring, intelligent personal assistants, and online education, ERC can facilitate the analysis of emotional cues, improve decision-making processes, and enhance user interaction and satisfaction. Current multimodal emotion recognition research faces several challenges, such as ineffective emotional information extraction from single modalities, underused complementary features, and inter-modal redundancy. To tackle these issues, this paper introduces a cross-modal gated attention mechanism for emotion recognition. The method extracts and fuses visual, textual, and auditory features to enhance accuracy and stability. A cross-modal guided gating mechanism is designed to strengthen single-modality features and utilize a third modality to improve bimodal feature fusion, boosting multimodal feature representation. Furthermore, a cross-modal distillation loss function is proposed to reduce redundancy and improve feature discrimination. This function employs a dual-supervision mechanism with teacher and student models, ensuring consistency in single-modal, bimodal, and trimodal feature representations. Experimental results on the IEMOCAP and MELD datasets indicate that the proposed method achieves higher accuracy and comparable F1 scores than existing approaches, highlighting its effectiveness in capturing multimodal dependencies and balancing modality contributions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Scientific Reports Natural Science Disciplines-

CiteScore

7.50

自引率

4.30%

发文量

19567

审稿时长

3.9 months

期刊介绍： We publish original research from all areas of the natural sciences, psychology, medicine and engineering. You can learn more about what we publish by browsing our specific scientific subject areas below or explore Scientific Reports by browsing all articles and collections. Scientific Reports has a 2-year impact factor: 4.380 (2021), and is the 6th most-cited journal in the world, with more than 540,000 citations in 2020 (Clarivate Analytics, 2021). •Engineering Engineering covers all aspects of engineering, technology, and applied science. It plays a crucial role in the development of technologies to address some of the world''s biggest challenges, helping to save lives and improve the way we live. •Physical sciences Physical sciences are those academic disciplines that aim to uncover the underlying laws of nature — often written in the language of mathematics. It is a collective term for areas of study including astronomy, chemistry, materials science and physics. •Earth and environmental sciences Earth and environmental sciences cover all aspects of Earth and planetary science and broadly encompass solid Earth processes, surface and atmospheric dynamics, Earth system history, climate and climate change, marine and freshwater systems, and ecology. It also considers the interactions between humans and these systems. •Biological sciences Biological sciences encompass all the divisions of natural sciences examining various aspects of vital processes. The concept includes anatomy, physiology, cell biology, biochemistry and biophysics, and covers all organisms from microorganisms, animals to plants. •Health sciences The health sciences study health, disease and healthcare. This field of study aims to develop knowledge, interventions and technology for use in healthcare to improve the treatment of patients.