{"title":"提取和融合情感线索,实现对话中的多模态情感预测和识别","authors":"Haoxiang Shi, Ziqi Liang, Jun Yu","doi":"arxiv-2408.04547","DOIUrl":null,"url":null,"abstract":"Emotion Prediction in Conversation (EPC) aims to forecast the emotions of\nforthcoming utterances by utilizing preceding dialogues. Previous EPC\napproaches relied on simple context modeling for emotion extraction,\noverlooking fine-grained emotion cues at the word level. Additionally, prior\nworks failed to account for the intrinsic differences between modalities,\nresulting in redundant information. To overcome these limitations, we propose\nan emotional cues extraction and fusion network, which consists of two stages:\na modality-specific learning stage that utilizes word-level labels and prosody\nlearning to construct emotion embedding spaces for each modality, and a\ntwo-step fusion stage for integrating multi-modal features. Moreover, the\nemotion features extracted by our model are also applicable to the Emotion\nRecognition in Conversation (ERC) task. Experimental results validate the\nefficacy of the proposed method, demonstrating superior performance on both\nIEMOCAP and MELD datasets.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Emotional Cues Extraction and Fusion for Multi-modal Emotion Prediction and Recognition in Conversation\",\"authors\":\"Haoxiang Shi, Ziqi Liang, Jun Yu\",\"doi\":\"arxiv-2408.04547\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emotion Prediction in Conversation (EPC) aims to forecast the emotions of\\nforthcoming utterances by utilizing preceding dialogues. Previous EPC\\napproaches relied on simple context modeling for emotion extraction,\\noverlooking fine-grained emotion cues at the word level. Additionally, prior\\nworks failed to account for the intrinsic differences between modalities,\\nresulting in redundant information. To overcome these limitations, we propose\\nan emotional cues extraction and fusion network, which consists of two stages:\\na modality-specific learning stage that utilizes word-level labels and prosody\\nlearning to construct emotion embedding spaces for each modality, and a\\ntwo-step fusion stage for integrating multi-modal features. Moreover, the\\nemotion features extracted by our model are also applicable to the Emotion\\nRecognition in Conversation (ERC) task. Experimental results validate the\\nefficacy of the proposed method, demonstrating superior performance on both\\nIEMOCAP and MELD datasets.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"6 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.04547\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.04547","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Emotional Cues Extraction and Fusion for Multi-modal Emotion Prediction and Recognition in Conversation
Emotion Prediction in Conversation (EPC) aims to forecast the emotions of
forthcoming utterances by utilizing preceding dialogues. Previous EPC
approaches relied on simple context modeling for emotion extraction,
overlooking fine-grained emotion cues at the word level. Additionally, prior
works failed to account for the intrinsic differences between modalities,
resulting in redundant information. To overcome these limitations, we propose
an emotional cues extraction and fusion network, which consists of two stages:
a modality-specific learning stage that utilizes word-level labels and prosody
learning to construct emotion embedding spaces for each modality, and a
two-step fusion stage for integrating multi-modal features. Moreover, the
emotion features extracted by our model are also applicable to the Emotion
Recognition in Conversation (ERC) task. Experimental results validate the
efficacy of the proposed method, demonstrating superior performance on both
IEMOCAP and MELD datasets.