运用递进对比深度监督研究谈话中说话人未知情绪识别

IF 9.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Affective Computing Pub Date : 2025-04-04 DOI:10.1109/TAFFC.2025.3558222

Siyuan Shen;Feng Liu;Hanyang Wang;Aimin Zhou

{"title":"运用递进对比深度监督研究谈话中说话人未知情绪识别","authors":"Siyuan Shen;Feng Liu;Hanyang Wang;Aimin Zhou","doi":"10.1109/TAFFC.2025.3558222","DOIUrl":null,"url":null,"abstract":"Emotion recognition in conversation has attained increasing attention for perceiving user emotion in practical conversational applications. Conversational utterances spoken alternately by different speakers inspire most studies to leverage speaker information based on golden speaker labels. In this work, we challenge the existing paradigm of utilizing available speaker labels with a more realistic scenario, where the speaker identity of each utterance is unknown during inference. We propose Progressive Contrastive Deep Supervision for multimodal emotion recognition in conversation (PCDS), incorporating speaker diarization and emotion recognition into one unified framework. To facilitate joint task learning, we inject speaker and emotion bias into the network progressively via contrastive deep supervision, with the task-irrelevant contrast being the intermediate transition. To obtain explicit speaker dependency, we propose a speaker contrast and clustering module (SCC) to endow the capability of partitioning speakers into groups even when neither speaker label nor number of speakers is known as a priori. Experiments on two ERC benchmarks, including IEMOCAP and MELD demonstrate the effectiveness of the proposed method. We also show that progressive contrastive deep supervision helps reconcile the underlying tension between speaker diarization and emotion recognition.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 3","pages":"2261-2273"},"PeriodicalIF":9.8000,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Towards Speaker-Unknown Emotion Recognition in Conversation via Progressive Contrastive Deep Supervision\",\"authors\":\"Siyuan Shen;Feng Liu;Hanyang Wang;Aimin Zhou\",\"doi\":\"10.1109/TAFFC.2025.3558222\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emotion recognition in conversation has attained increasing attention for perceiving user emotion in practical conversational applications. Conversational utterances spoken alternately by different speakers inspire most studies to leverage speaker information based on golden speaker labels. In this work, we challenge the existing paradigm of utilizing available speaker labels with a more realistic scenario, where the speaker identity of each utterance is unknown during inference. We propose Progressive Contrastive Deep Supervision for multimodal emotion recognition in conversation (PCDS), incorporating speaker diarization and emotion recognition into one unified framework. To facilitate joint task learning, we inject speaker and emotion bias into the network progressively via contrastive deep supervision, with the task-irrelevant contrast being the intermediate transition. To obtain explicit speaker dependency, we propose a speaker contrast and clustering module (SCC) to endow the capability of partitioning speakers into groups even when neither speaker label nor number of speakers is known as a priori. Experiments on two ERC benchmarks, including IEMOCAP and MELD demonstrate the effectiveness of the proposed method. We also show that progressive contrastive deep supervision helps reconcile the underlying tension between speaker diarization and emotion recognition.\",\"PeriodicalId\":13131,\"journal\":{\"name\":\"IEEE Transactions on Affective Computing\",\"volume\":\"16 3\",\"pages\":\"2261-2273\"},\"PeriodicalIF\":9.8000,\"publicationDate\":\"2025-04-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Affective Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10949847/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10949847/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

会话情感识别在实际会话应用中对用户情感的感知越来越受到关注。不同说话者交替说的会话话语激发了大多数研究利用基于黄金说话者标签的说话者信息。在这项工作中，我们用更现实的场景挑战了利用可用说话人标签的现有范例，其中每个话语的说话人身份在推理过程中是未知的。我们提出了一种将说话人特征化和情感识别整合到一个统一的框架中，用于会话多模态情感识别的递进对比深度监督方法。为了促进联合任务学习，我们通过对比深度监督将说话人和情绪偏见逐步注入网络，而与任务无关的对比是中间过渡。为了获得明确的说话人依赖关系，我们提出了一个说话人对比和聚类模块（SCC），赋予说话人分组的能力，即使说话人的标签和说话人的数量都不是先验的。在IEMOCAP和MELD两个ERC基准上的实验证明了该方法的有效性。我们还表明，渐进对比深度监督有助于调和说话人化和情绪识别之间潜在的紧张关系。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Towards Speaker-Unknown Emotion Recognition in Conversation via Progressive Contrastive Deep Supervision

Emotion recognition in conversation has attained increasing attention for perceiving user emotion in practical conversational applications. Conversational utterances spoken alternately by different speakers inspire most studies to leverage speaker information based on golden speaker labels. In this work, we challenge the existing paradigm of utilizing available speaker labels with a more realistic scenario, where the speaker identity of each utterance is unknown during inference. We propose Progressive Contrastive Deep Supervision for multimodal emotion recognition in conversation (PCDS), incorporating speaker diarization and emotion recognition into one unified framework. To facilitate joint task learning, we inject speaker and emotion bias into the network progressively via contrastive deep supervision, with the task-irrelevant contrast being the intermediate transition. To obtain explicit speaker dependency, we propose a speaker contrast and clustering module (SCC) to endow the capability of partitioning speakers into groups even when neither speaker label nor number of speakers is known as a priori. Experiments on two ERC benchmarks, including IEMOCAP and MELD demonstrate the effectiveness of the proposed method. We also show that progressive contrastive deep supervision helps reconcile the underlying tension between speaker diarization and emotion recognition.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Affective Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS

CiteScore

15.00

自引率

6.20%

发文量

174

期刊介绍： The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.