为谈话中情绪识别建立说话人特定的长期语境模型

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-10-01 DOI:10.1016/j.inffus.2025.103785

Haifeng Chen , Jing Li , Yan Li , Jian Li , Lang He , Dongmei Jiang

{"title":"为谈话中情绪识别建立说话人特定的长期语境模型","authors":"Haifeng Chen , Jing Li , Yan Li , Jian Li , Lang He , Dongmei Jiang","doi":"10.1016/j.inffus.2025.103785","DOIUrl":null,"url":null,"abstract":"<div><div>Emotion recognition in conversation (ERC) is essential for enabling empathetic responses and fostering harmonious human-computer interaction. Modeling speaker-specific temporal dependencies can enhance the capture of speaker-sensitive emotional representations, thereby improving the understanding of emotional dynamics among speakers within a conversation. However, prior research has primarily focused on information available during speaking moments, neglecting contextual cues during silent moments, leading to incomplete and discontinuous representation of each speaker’s emotional context. This study addresses these limitations by proposing a novel framework named the Speaker-specific Long-term Context Encoding Network (SLCNet) for the ERC task. SLCNet is designed to capture the complete speaker-specific long-term context, including both speaking and non-speaking moments. Specifically, an attention-based multimodal fusion network is first employed to dynamically focus on key modalities for effective multimodal fusion. Then, two well-designed graph neural networks are utilized for feature completion by leveraging intra-speaker temporal context and inter-speaker interaction influence, respectively. Finally, a shared LSTM models the temporally complete and speaker-sensitive context for each speaker. The proposed SLCNet is jointly optimized for multiple speakers and trained in an end-to-end manner. Extensive experiments on benchmark datasets demonstrate the superior performance of SLCNet and its ability to effectively complete emotional representations during silent moments, highlighting its potential to advance ERC research.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103785"},"PeriodicalIF":15.5000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Modeling speaker-specific long-term context for emotion recognition in conversation\",\"authors\":\"Haifeng Chen , Jing Li , Yan Li , Jian Li , Lang He , Dongmei Jiang\",\"doi\":\"10.1016/j.inffus.2025.103785\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Emotion recognition in conversation (ERC) is essential for enabling empathetic responses and fostering harmonious human-computer interaction. Modeling speaker-specific temporal dependencies can enhance the capture of speaker-sensitive emotional representations, thereby improving the understanding of emotional dynamics among speakers within a conversation. However, prior research has primarily focused on information available during speaking moments, neglecting contextual cues during silent moments, leading to incomplete and discontinuous representation of each speaker’s emotional context. This study addresses these limitations by proposing a novel framework named the Speaker-specific Long-term Context Encoding Network (SLCNet) for the ERC task. SLCNet is designed to capture the complete speaker-specific long-term context, including both speaking and non-speaking moments. Specifically, an attention-based multimodal fusion network is first employed to dynamically focus on key modalities for effective multimodal fusion. Then, two well-designed graph neural networks are utilized for feature completion by leveraging intra-speaker temporal context and inter-speaker interaction influence, respectively. Finally, a shared LSTM models the temporally complete and speaker-sensitive context for each speaker. The proposed SLCNet is jointly optimized for multiple speakers and trained in an end-to-end manner. Extensive experiments on benchmark datasets demonstrate the superior performance of SLCNet and its ability to effectively complete emotional representations during silent moments, highlighting its potential to advance ERC research.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"127 \",\"pages\":\"Article 103785\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525008474\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525008474","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

对话中的情感识别（ERC）对于实现移情反应和促进和谐的人机交互至关重要。建模特定于说话人的时间依赖性可以增强对说话人敏感的情感表征的捕捉，从而提高对谈话中说话人之间情感动态的理解。然而，先前的研究主要集中在说话时刻的可用信息上，忽视了沉默时刻的语境线索，导致对每个说话者的情感语境的不完整和不连续的表征。本研究通过提出一种新的ERC任务框架来解决这些限制，该框架被称为说话人特定的长期上下文编码网络（SLCNet）。SLCNet旨在捕捉完整的说话者特定的长期语境，包括说话和非说话时刻。具体而言，首先采用基于注意力的多模态融合网络动态聚焦关键模态，实现有效的多模态融合。然后，利用两个精心设计的图神经网络分别利用说话人内部时间上下文和说话人之间的交互影响进行特征补全。最后，一个共享的LSTM为每个说话者建立暂时完整且对说话者敏感的上下文模型。提出的SLCNet针对多个说话者进行了联合优化，并以端到端方式进行了训练。在基准数据集上进行的大量实验表明，SLCNet具有优越的性能，能够在沉默时刻有效地完成情绪表征，这凸显了其推进ERC研究的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Modeling speaker-specific long-term context for emotion recognition in conversation

Emotion recognition in conversation (ERC) is essential for enabling empathetic responses and fostering harmonious human-computer interaction. Modeling speaker-specific temporal dependencies can enhance the capture of speaker-sensitive emotional representations, thereby improving the understanding of emotional dynamics among speakers within a conversation. However, prior research has primarily focused on information available during speaking moments, neglecting contextual cues during silent moments, leading to incomplete and discontinuous representation of each speaker’s emotional context. This study addresses these limitations by proposing a novel framework named the Speaker-specific Long-term Context Encoding Network (SLCNet) for the ERC task. SLCNet is designed to capture the complete speaker-specific long-term context, including both speaking and non-speaking moments. Specifically, an attention-based multimodal fusion network is first employed to dynamically focus on key modalities for effective multimodal fusion. Then, two well-designed graph neural networks are utilized for feature completion by leveraging intra-speaker temporal context and inter-speaker interaction influence, respectively. Finally, a shared LSTM models the temporally complete and speaker-sensitive context for each speaker. The proposed SLCNet is jointly optimized for multiple speakers and trained in an end-to-end manner. Extensive experiments on benchmark datasets demonstrate the superior performance of SLCNet and its ability to effectively complete emotional representations during silent moments, highlighting its potential to advance ERC research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.