Yong Zhang , Yongqing Liu , HongKai Li , Cheng Cheng , Ziyu Jia
{"title":"MPFBL: Modal pairing-based cross-fusion bootstrap learning for multimodal emotion recognition","authors":"Yong Zhang , Yongqing Liu , HongKai Li , Cheng Cheng , Ziyu Jia","doi":"10.1016/j.neucom.2025.131577","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal emotion recognition (MER), a key technology in human-computer interaction, deciphers complex emotional states by integrating heterogeneous data sources such as text, audio, and video. However, previous works either retained only private information or focused solely on public information, resulting in a conflict between the strategies used in each approach. Existing methods often lose critical modality-specific attributes during feature extraction or struggle to align semantically divergent representations across modalities during fusion, resulting in incomplete emotional context modeling. To address these challenges, we propose the Modal Pairing-based Cross-Fusion Bootstrap Learning (MPFBL) framework, which integrates modal feature extraction, cross-modal bootstrap learning, and multi-modal cross-fusion into a unified approach. Firstly, the feature extraction module employs a Uni-Modal Transformer (UMT) and a Multi-Modal Transformer (MMT) to jointly capture modality-specific and modality-invariant information, addressing feature degradation in single-encoder paradigms, while alleviating inter-modal heterogeneity by explicitly distinguishing between modality-specific and shared representations. Subsequently, cross-modal bootstrap learning employs attention-guided optimization to align heterogeneous modalities and refine modality-specific representations, enhancing semantic consistency. Finally, a multi-modal cross-fusion network integrates convolutional mapping and adaptive attention to dynamically weight cross-modal dependencies, mitigating spatial-semantic misalignment induced by inter-modal heterogeneity in fusion processes. Extensive experimental results on CMU-MOSEI and CMU-MOSI demonstrate that MPFBL outperforms state-of-the-art methods, while ablation studies further confirm its effectiveness.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"658 ","pages":"Article 131577"},"PeriodicalIF":6.5000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225022490","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal emotion recognition (MER), a key technology in human-computer interaction, deciphers complex emotional states by integrating heterogeneous data sources such as text, audio, and video. However, previous works either retained only private information or focused solely on public information, resulting in a conflict between the strategies used in each approach. Existing methods often lose critical modality-specific attributes during feature extraction or struggle to align semantically divergent representations across modalities during fusion, resulting in incomplete emotional context modeling. To address these challenges, we propose the Modal Pairing-based Cross-Fusion Bootstrap Learning (MPFBL) framework, which integrates modal feature extraction, cross-modal bootstrap learning, and multi-modal cross-fusion into a unified approach. Firstly, the feature extraction module employs a Uni-Modal Transformer (UMT) and a Multi-Modal Transformer (MMT) to jointly capture modality-specific and modality-invariant information, addressing feature degradation in single-encoder paradigms, while alleviating inter-modal heterogeneity by explicitly distinguishing between modality-specific and shared representations. Subsequently, cross-modal bootstrap learning employs attention-guided optimization to align heterogeneous modalities and refine modality-specific representations, enhancing semantic consistency. Finally, a multi-modal cross-fusion network integrates convolutional mapping and adaptive attention to dynamically weight cross-modal dependencies, mitigating spatial-semantic misalignment induced by inter-modal heterogeneity in fusion processes. Extensive experimental results on CMU-MOSEI and CMU-MOSI demonstrate that MPFBL outperforms state-of-the-art methods, while ablation studies further confirm its effectiveness.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.