MPFBL: Modal pairing-based cross-fusion bootstrap learning for multimodal emotion recognition

IF 6.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2025-09-30 DOI:10.1016/j.neucom.2025.131577

Yong Zhang , Yongqing Liu , HongKai Li , Cheng Cheng , Ziyu Jia

{"title":"MPFBL: Modal pairing-based cross-fusion bootstrap learning for multimodal emotion recognition","authors":"Yong Zhang , Yongqing Liu , HongKai Li , Cheng Cheng , Ziyu Jia","doi":"10.1016/j.neucom.2025.131577","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal emotion recognition (MER), a key technology in human-computer interaction, deciphers complex emotional states by integrating heterogeneous data sources such as text, audio, and video. However, previous works either retained only private information or focused solely on public information, resulting in a conflict between the strategies used in each approach. Existing methods often lose critical modality-specific attributes during feature extraction or struggle to align semantically divergent representations across modalities during fusion, resulting in incomplete emotional context modeling. To address these challenges, we propose the Modal Pairing-based Cross-Fusion Bootstrap Learning (MPFBL) framework, which integrates modal feature extraction, cross-modal bootstrap learning, and multi-modal cross-fusion into a unified approach. Firstly, the feature extraction module employs a Uni-Modal Transformer (UMT) and a Multi-Modal Transformer (MMT) to jointly capture modality-specific and modality-invariant information, addressing feature degradation in single-encoder paradigms, while alleviating inter-modal heterogeneity by explicitly distinguishing between modality-specific and shared representations. Subsequently, cross-modal bootstrap learning employs attention-guided optimization to align heterogeneous modalities and refine modality-specific representations, enhancing semantic consistency. Finally, a multi-modal cross-fusion network integrates convolutional mapping and adaptive attention to dynamically weight cross-modal dependencies, mitigating spatial-semantic misalignment induced by inter-modal heterogeneity in fusion processes. Extensive experimental results on CMU-MOSEI and CMU-MOSI demonstrate that MPFBL outperforms state-of-the-art methods, while ablation studies further confirm its effectiveness.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"658 ","pages":"Article 131577"},"PeriodicalIF":6.5000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225022490","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Multimodal emotion recognition (MER), a key technology in human-computer interaction, deciphers complex emotional states by integrating heterogeneous data sources such as text, audio, and video. However, previous works either retained only private information or focused solely on public information, resulting in a conflict between the strategies used in each approach. Existing methods often lose critical modality-specific attributes during feature extraction or struggle to align semantically divergent representations across modalities during fusion, resulting in incomplete emotional context modeling. To address these challenges, we propose the Modal Pairing-based Cross-Fusion Bootstrap Learning (MPFBL) framework, which integrates modal feature extraction, cross-modal bootstrap learning, and multi-modal cross-fusion into a unified approach. Firstly, the feature extraction module employs a Uni-Modal Transformer (UMT) and a Multi-Modal Transformer (MMT) to jointly capture modality-specific and modality-invariant information, addressing feature degradation in single-encoder paradigms, while alleviating inter-modal heterogeneity by explicitly distinguishing between modality-specific and shared representations. Subsequently, cross-modal bootstrap learning employs attention-guided optimization to align heterogeneous modalities and refine modality-specific representations, enhancing semantic consistency. Finally, a multi-modal cross-fusion network integrates convolutional mapping and adaptive attention to dynamically weight cross-modal dependencies, mitigating spatial-semantic misalignment induced by inter-modal heterogeneity in fusion processes. Extensive experimental results on CMU-MOSEI and CMU-MOSI demonstrate that MPFBL outperforms state-of-the-art methods, while ablation studies further confirm its effectiveness.

查看原文本刊更多论文

基于模态配对的多模态情感识别交叉融合自举学习

多模态情绪识别（MER）是人机交互中的一项关键技术，它通过整合文本、音频和视频等异构数据源来解读复杂的情绪状态。然而，以往的作品要么只保留私人信息，要么只关注公共信息，导致每种方法使用的策略之间存在冲突。现有的方法经常在特征提取过程中失去关键的特定于模态的属性，或者在融合过程中难以对齐跨模态的语义分歧表示，从而导致不完整的情感上下文建模。为了解决这些挑战，我们提出了基于模态配对的交叉融合引导学习（MPFBL）框架，该框架将模态特征提取、跨模态引导学习和多模态交叉融合集成到一个统一的方法中。首先，特征提取模块采用单模态转换器（UMT）和多模态转换器（MMT）来联合捕获模态特定和模态不变的信息，解决单编码器范式中的特征退化问题，同时通过明确区分模态特定和共享表示来减轻模态间的异质性。随后，跨模态自举学习采用注意引导优化来对齐异构模态并细化模态特定表征，从而增强语义一致性。最后，一个多模态交叉融合网络集成了卷积映射和自适应关注来动态加权跨模态依赖，减轻了融合过程中由多模态异质性引起的空间语义错位。CMU-MOSEI和CMU-MOSI的大量实验结果表明，MPFBL优于最先进的方法，而烧蚀研究进一步证实了其有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.