{"title":"运用递进对比深度监督研究谈话中说话人未知情绪识别","authors":"Siyuan Shen;Feng Liu;Hanyang Wang;Aimin Zhou","doi":"10.1109/TAFFC.2025.3558222","DOIUrl":null,"url":null,"abstract":"Emotion recognition in conversation has attained increasing attention for perceiving user emotion in practical conversational applications. Conversational utterances spoken alternately by different speakers inspire most studies to leverage speaker information based on golden speaker labels. In this work, we challenge the existing paradigm of utilizing available speaker labels with a more realistic scenario, where the speaker identity of each utterance is unknown during inference. We propose Progressive Contrastive Deep Supervision for multimodal emotion recognition in conversation (PCDS), incorporating speaker diarization and emotion recognition into one unified framework. To facilitate joint task learning, we inject speaker and emotion bias into the network progressively via contrastive deep supervision, with the task-irrelevant contrast being the intermediate transition. To obtain explicit speaker dependency, we propose a speaker contrast and clustering module (SCC) to endow the capability of partitioning speakers into groups even when neither speaker label nor number of speakers is known as a priori. Experiments on two ERC benchmarks, including IEMOCAP and MELD demonstrate the effectiveness of the proposed method. We also show that progressive contrastive deep supervision helps reconcile the underlying tension between speaker diarization and emotion recognition.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 3","pages":"2261-2273"},"PeriodicalIF":9.8000,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Towards Speaker-Unknown Emotion Recognition in Conversation via Progressive Contrastive Deep Supervision\",\"authors\":\"Siyuan Shen;Feng Liu;Hanyang Wang;Aimin Zhou\",\"doi\":\"10.1109/TAFFC.2025.3558222\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emotion recognition in conversation has attained increasing attention for perceiving user emotion in practical conversational applications. Conversational utterances spoken alternately by different speakers inspire most studies to leverage speaker information based on golden speaker labels. In this work, we challenge the existing paradigm of utilizing available speaker labels with a more realistic scenario, where the speaker identity of each utterance is unknown during inference. We propose Progressive Contrastive Deep Supervision for multimodal emotion recognition in conversation (PCDS), incorporating speaker diarization and emotion recognition into one unified framework. To facilitate joint task learning, we inject speaker and emotion bias into the network progressively via contrastive deep supervision, with the task-irrelevant contrast being the intermediate transition. To obtain explicit speaker dependency, we propose a speaker contrast and clustering module (SCC) to endow the capability of partitioning speakers into groups even when neither speaker label nor number of speakers is known as a priori. Experiments on two ERC benchmarks, including IEMOCAP and MELD demonstrate the effectiveness of the proposed method. We also show that progressive contrastive deep supervision helps reconcile the underlying tension between speaker diarization and emotion recognition.\",\"PeriodicalId\":13131,\"journal\":{\"name\":\"IEEE Transactions on Affective Computing\",\"volume\":\"16 3\",\"pages\":\"2261-2273\"},\"PeriodicalIF\":9.8000,\"publicationDate\":\"2025-04-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Affective Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10949847/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10949847/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Towards Speaker-Unknown Emotion Recognition in Conversation via Progressive Contrastive Deep Supervision
Emotion recognition in conversation has attained increasing attention for perceiving user emotion in practical conversational applications. Conversational utterances spoken alternately by different speakers inspire most studies to leverage speaker information based on golden speaker labels. In this work, we challenge the existing paradigm of utilizing available speaker labels with a more realistic scenario, where the speaker identity of each utterance is unknown during inference. We propose Progressive Contrastive Deep Supervision for multimodal emotion recognition in conversation (PCDS), incorporating speaker diarization and emotion recognition into one unified framework. To facilitate joint task learning, we inject speaker and emotion bias into the network progressively via contrastive deep supervision, with the task-irrelevant contrast being the intermediate transition. To obtain explicit speaker dependency, we propose a speaker contrast and clustering module (SCC) to endow the capability of partitioning speakers into groups even when neither speaker label nor number of speakers is known as a priori. Experiments on two ERC benchmarks, including IEMOCAP and MELD demonstrate the effectiveness of the proposed method. We also show that progressive contrastive deep supervision helps reconcile the underlying tension between speaker diarization and emotion recognition.
期刊介绍:
The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.