Qifei Li, Yingming Gao, Yuhua Wen, Cong Wang, Ya Li
{"title":"通过对齐和标签匹配加强模态融合,实现多模态情感识别","authors":"Qifei Li, Yingming Gao, Yuhua Wen, Cong Wang, Ya Li","doi":"arxiv-2408.09438","DOIUrl":null,"url":null,"abstract":"To address the limitation in multimodal emotion recognition (MER) performance\narising from inter-modal information fusion, we propose a novel MER framework\nbased on multitask learning where fusion occurs after alignment, called\nFoal-Net. The framework is designed to enhance the effectiveness of modality\nfusion and includes two auxiliary tasks: audio-video emotion alignment (AVEL)\nand cross-modal emotion label matching (MEM). First, AVEL achieves alignment of\nemotional information in audio-video representations through contrastive\nlearning. Then, a modal fusion network integrates the aligned features.\nMeanwhile, MEM assesses whether the emotions of the current sample pair are the\nsame, providing assistance for modal information fusion and guiding the model\nto focus more on emotional information. The experimental results conducted on\nIEMOCAP corpus show that Foal-Net outperforms the state-of-the-art methods and\nemotion alignment is necessary before modal fusion.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition\",\"authors\":\"Qifei Li, Yingming Gao, Yuhua Wen, Cong Wang, Ya Li\",\"doi\":\"arxiv-2408.09438\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"To address the limitation in multimodal emotion recognition (MER) performance\\narising from inter-modal information fusion, we propose a novel MER framework\\nbased on multitask learning where fusion occurs after alignment, called\\nFoal-Net. The framework is designed to enhance the effectiveness of modality\\nfusion and includes two auxiliary tasks: audio-video emotion alignment (AVEL)\\nand cross-modal emotion label matching (MEM). First, AVEL achieves alignment of\\nemotional information in audio-video representations through contrastive\\nlearning. Then, a modal fusion network integrates the aligned features.\\nMeanwhile, MEM assesses whether the emotions of the current sample pair are the\\nsame, providing assistance for modal information fusion and guiding the model\\nto focus more on emotional information. The experimental results conducted on\\nIEMOCAP corpus show that Foal-Net outperforms the state-of-the-art methods and\\nemotion alignment is necessary before modal fusion.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.09438\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.09438","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition
To address the limitation in multimodal emotion recognition (MER) performance
arising from inter-modal information fusion, we propose a novel MER framework
based on multitask learning where fusion occurs after alignment, called
Foal-Net. The framework is designed to enhance the effectiveness of modality
fusion and includes two auxiliary tasks: audio-video emotion alignment (AVEL)
and cross-modal emotion label matching (MEM). First, AVEL achieves alignment of
emotional information in audio-video representations through contrastive
learning. Then, a modal fusion network integrates the aligned features.
Meanwhile, MEM assesses whether the emotions of the current sample pair are the
same, providing assistance for modal information fusion and guiding the model
to focus more on emotional information. The experimental results conducted on
IEMOCAP corpus show that Foal-Net outperforms the state-of-the-art methods and
emotion alignment is necessary before modal fusion.