基于模态配对的多模态情感识别交叉融合自举学习

IF 6.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Yong Zhang , Yongqing Liu , HongKai Li , Cheng Cheng , Ziyu Jia
{"title":"基于模态配对的多模态情感识别交叉融合自举学习","authors":"Yong Zhang ,&nbsp;Yongqing Liu ,&nbsp;HongKai Li ,&nbsp;Cheng Cheng ,&nbsp;Ziyu Jia","doi":"10.1016/j.neucom.2025.131577","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal emotion recognition (MER), a key technology in human-computer interaction, deciphers complex emotional states by integrating heterogeneous data sources such as text, audio, and video. However, previous works either retained only private information or focused solely on public information, resulting in a conflict between the strategies used in each approach. Existing methods often lose critical modality-specific attributes during feature extraction or struggle to align semantically divergent representations across modalities during fusion, resulting in incomplete emotional context modeling. To address these challenges, we propose the Modal Pairing-based Cross-Fusion Bootstrap Learning (MPFBL) framework, which integrates modal feature extraction, cross-modal bootstrap learning, and multi-modal cross-fusion into a unified approach. Firstly, the feature extraction module employs a Uni-Modal Transformer (UMT) and a Multi-Modal Transformer (MMT) to jointly capture modality-specific and modality-invariant information, addressing feature degradation in single-encoder paradigms, while alleviating inter-modal heterogeneity by explicitly distinguishing between modality-specific and shared representations. Subsequently, cross-modal bootstrap learning employs attention-guided optimization to align heterogeneous modalities and refine modality-specific representations, enhancing semantic consistency. Finally, a multi-modal cross-fusion network integrates convolutional mapping and adaptive attention to dynamically weight cross-modal dependencies, mitigating spatial-semantic misalignment induced by inter-modal heterogeneity in fusion processes. Extensive experimental results on CMU-MOSEI and CMU-MOSI demonstrate that MPFBL outperforms state-of-the-art methods, while ablation studies further confirm its effectiveness.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"658 ","pages":"Article 131577"},"PeriodicalIF":6.5000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MPFBL: Modal pairing-based cross-fusion bootstrap learning for multimodal emotion recognition\",\"authors\":\"Yong Zhang ,&nbsp;Yongqing Liu ,&nbsp;HongKai Li ,&nbsp;Cheng Cheng ,&nbsp;Ziyu Jia\",\"doi\":\"10.1016/j.neucom.2025.131577\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal emotion recognition (MER), a key technology in human-computer interaction, deciphers complex emotional states by integrating heterogeneous data sources such as text, audio, and video. However, previous works either retained only private information or focused solely on public information, resulting in a conflict between the strategies used in each approach. Existing methods often lose critical modality-specific attributes during feature extraction or struggle to align semantically divergent representations across modalities during fusion, resulting in incomplete emotional context modeling. To address these challenges, we propose the Modal Pairing-based Cross-Fusion Bootstrap Learning (MPFBL) framework, which integrates modal feature extraction, cross-modal bootstrap learning, and multi-modal cross-fusion into a unified approach. Firstly, the feature extraction module employs a Uni-Modal Transformer (UMT) and a Multi-Modal Transformer (MMT) to jointly capture modality-specific and modality-invariant information, addressing feature degradation in single-encoder paradigms, while alleviating inter-modal heterogeneity by explicitly distinguishing between modality-specific and shared representations. Subsequently, cross-modal bootstrap learning employs attention-guided optimization to align heterogeneous modalities and refine modality-specific representations, enhancing semantic consistency. Finally, a multi-modal cross-fusion network integrates convolutional mapping and adaptive attention to dynamically weight cross-modal dependencies, mitigating spatial-semantic misalignment induced by inter-modal heterogeneity in fusion processes. Extensive experimental results on CMU-MOSEI and CMU-MOSI demonstrate that MPFBL outperforms state-of-the-art methods, while ablation studies further confirm its effectiveness.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"658 \",\"pages\":\"Article 131577\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2025-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225022490\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225022490","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

多模态情绪识别(MER)是人机交互中的一项关键技术,它通过整合文本、音频和视频等异构数据源来解读复杂的情绪状态。然而,以往的作品要么只保留私人信息,要么只关注公共信息,导致每种方法使用的策略之间存在冲突。现有的方法经常在特征提取过程中失去关键的特定于模态的属性,或者在融合过程中难以对齐跨模态的语义分歧表示,从而导致不完整的情感上下文建模。为了解决这些挑战,我们提出了基于模态配对的交叉融合引导学习(MPFBL)框架,该框架将模态特征提取、跨模态引导学习和多模态交叉融合集成到一个统一的方法中。首先,特征提取模块采用单模态转换器(UMT)和多模态转换器(MMT)来联合捕获模态特定和模态不变的信息,解决单编码器范式中的特征退化问题,同时通过明确区分模态特定和共享表示来减轻模态间的异质性。随后,跨模态自举学习采用注意引导优化来对齐异构模态并细化模态特定表征,从而增强语义一致性。最后,一个多模态交叉融合网络集成了卷积映射和自适应关注来动态加权跨模态依赖,减轻了融合过程中由多模态异质性引起的空间语义错位。CMU-MOSEI和CMU-MOSI的大量实验结果表明,MPFBL优于最先进的方法,而烧蚀研究进一步证实了其有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
MPFBL: Modal pairing-based cross-fusion bootstrap learning for multimodal emotion recognition
Multimodal emotion recognition (MER), a key technology in human-computer interaction, deciphers complex emotional states by integrating heterogeneous data sources such as text, audio, and video. However, previous works either retained only private information or focused solely on public information, resulting in a conflict between the strategies used in each approach. Existing methods often lose critical modality-specific attributes during feature extraction or struggle to align semantically divergent representations across modalities during fusion, resulting in incomplete emotional context modeling. To address these challenges, we propose the Modal Pairing-based Cross-Fusion Bootstrap Learning (MPFBL) framework, which integrates modal feature extraction, cross-modal bootstrap learning, and multi-modal cross-fusion into a unified approach. Firstly, the feature extraction module employs a Uni-Modal Transformer (UMT) and a Multi-Modal Transformer (MMT) to jointly capture modality-specific and modality-invariant information, addressing feature degradation in single-encoder paradigms, while alleviating inter-modal heterogeneity by explicitly distinguishing between modality-specific and shared representations. Subsequently, cross-modal bootstrap learning employs attention-guided optimization to align heterogeneous modalities and refine modality-specific representations, enhancing semantic consistency. Finally, a multi-modal cross-fusion network integrates convolutional mapping and adaptive attention to dynamically weight cross-modal dependencies, mitigating spatial-semantic misalignment induced by inter-modal heterogeneity in fusion processes. Extensive experimental results on CMU-MOSEI and CMU-MOSI demonstrate that MPFBL outperforms state-of-the-art methods, while ablation studies further confirm its effectiveness.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Neurocomputing
Neurocomputing 工程技术-计算机:人工智能
CiteScore
13.10
自引率
10.00%
发文量
1382
审稿时长
70 days
期刊介绍: Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信