Haifeng Chen , Jing Li , Yan Li , Jian Li , Lang He , Dongmei Jiang
{"title":"为谈话中情绪识别建立说话人特定的长期语境模型","authors":"Haifeng Chen , Jing Li , Yan Li , Jian Li , Lang He , Dongmei Jiang","doi":"10.1016/j.inffus.2025.103785","DOIUrl":null,"url":null,"abstract":"<div><div>Emotion recognition in conversation (ERC) is essential for enabling empathetic responses and fostering harmonious human-computer interaction. Modeling speaker-specific temporal dependencies can enhance the capture of speaker-sensitive emotional representations, thereby improving the understanding of emotional dynamics among speakers within a conversation. However, prior research has primarily focused on information available during speaking moments, neglecting contextual cues during silent moments, leading to incomplete and discontinuous representation of each speaker’s emotional context. This study addresses these limitations by proposing a novel framework named the Speaker-specific Long-term Context Encoding Network (SLCNet) for the ERC task. SLCNet is designed to capture the complete speaker-specific long-term context, including both speaking and non-speaking moments. Specifically, an attention-based multimodal fusion network is first employed to dynamically focus on key modalities for effective multimodal fusion. Then, two well-designed graph neural networks are utilized for feature completion by leveraging intra-speaker temporal context and inter-speaker interaction influence, respectively. Finally, a shared LSTM models the temporally complete and speaker-sensitive context for each speaker. The proposed SLCNet is jointly optimized for multiple speakers and trained in an end-to-end manner. Extensive experiments on benchmark datasets demonstrate the superior performance of SLCNet and its ability to effectively complete emotional representations during silent moments, highlighting its potential to advance ERC research.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103785"},"PeriodicalIF":15.5000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Modeling speaker-specific long-term context for emotion recognition in conversation\",\"authors\":\"Haifeng Chen , Jing Li , Yan Li , Jian Li , Lang He , Dongmei Jiang\",\"doi\":\"10.1016/j.inffus.2025.103785\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Emotion recognition in conversation (ERC) is essential for enabling empathetic responses and fostering harmonious human-computer interaction. Modeling speaker-specific temporal dependencies can enhance the capture of speaker-sensitive emotional representations, thereby improving the understanding of emotional dynamics among speakers within a conversation. However, prior research has primarily focused on information available during speaking moments, neglecting contextual cues during silent moments, leading to incomplete and discontinuous representation of each speaker’s emotional context. This study addresses these limitations by proposing a novel framework named the Speaker-specific Long-term Context Encoding Network (SLCNet) for the ERC task. SLCNet is designed to capture the complete speaker-specific long-term context, including both speaking and non-speaking moments. Specifically, an attention-based multimodal fusion network is first employed to dynamically focus on key modalities for effective multimodal fusion. Then, two well-designed graph neural networks are utilized for feature completion by leveraging intra-speaker temporal context and inter-speaker interaction influence, respectively. Finally, a shared LSTM models the temporally complete and speaker-sensitive context for each speaker. The proposed SLCNet is jointly optimized for multiple speakers and trained in an end-to-end manner. Extensive experiments on benchmark datasets demonstrate the superior performance of SLCNet and its ability to effectively complete emotional representations during silent moments, highlighting its potential to advance ERC research.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"127 \",\"pages\":\"Article 103785\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525008474\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525008474","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Modeling speaker-specific long-term context for emotion recognition in conversation
Emotion recognition in conversation (ERC) is essential for enabling empathetic responses and fostering harmonious human-computer interaction. Modeling speaker-specific temporal dependencies can enhance the capture of speaker-sensitive emotional representations, thereby improving the understanding of emotional dynamics among speakers within a conversation. However, prior research has primarily focused on information available during speaking moments, neglecting contextual cues during silent moments, leading to incomplete and discontinuous representation of each speaker’s emotional context. This study addresses these limitations by proposing a novel framework named the Speaker-specific Long-term Context Encoding Network (SLCNet) for the ERC task. SLCNet is designed to capture the complete speaker-specific long-term context, including both speaking and non-speaking moments. Specifically, an attention-based multimodal fusion network is first employed to dynamically focus on key modalities for effective multimodal fusion. Then, two well-designed graph neural networks are utilized for feature completion by leveraging intra-speaker temporal context and inter-speaker interaction influence, respectively. Finally, a shared LSTM models the temporally complete and speaker-sensitive context for each speaker. The proposed SLCNet is jointly optimized for multiple speakers and trained in an end-to-end manner. Extensive experiments on benchmark datasets demonstrate the superior performance of SLCNet and its ability to effectively complete emotional representations during silent moments, highlighting its potential to advance ERC research.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.