{"title":"HRDF-MER: Hierarchical feature refinement and cascaded dynamic fusion for multimodal emotion recognition","authors":"Jianjun Lei , Zhenmei Mu , Ying Wang","doi":"10.1016/j.csl.2026.101978","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal Emotion Recognition (MER) is challenged by modality misalignment, shallow temporal cue modeling, and inefficient fusion. This paper proposes HRDF-MER, a framework that integrates hierarchical refinement and cascaded dynamic fusion for more robust emotion recognition. To improve cross-modal alignment and unimodal representation, HRDF-MER introduces a novel Hierarchical Cross-modal Feature Refinement (HCFR) strategy, which integrates Cross-modal Adaptive Alignment (CAA) and Hierarchical Feature Enhancement (HFE). The CAA module employs multi-head cross-attention to construct hierarchical correlation matrices for precise acoustic-text alignment, and the HFE employs a Transformer with cross-modal residual connections to further enhance unimodal representations for robust feature learning. We further propose a Cascaded Multimodal Dynamic Fusion (CMDF) strategy, where a cross-attention encoder captures fine-grained inter-modal dependencies and a gated fusion unit adaptively weights modalities to progressively produce highly discriminative multimodal representations. Moreover, a multi-objective training scheme is proposed to jointly optimize feature alignment and classification by integrating Cross-modal Label Contrastive Loss (CLC Loss) with cross-entropy loss. Extensive experiments on the IEMOCAP and MELD datasets demonstrate that HRDF-MER significantly outperforms state-of-the-art models, while ablation studies further confirm the effectiveness and necessity of each proposed component.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101978"},"PeriodicalIF":3.4000,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230826000410","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/3/6 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal Emotion Recognition (MER) is challenged by modality misalignment, shallow temporal cue modeling, and inefficient fusion. This paper proposes HRDF-MER, a framework that integrates hierarchical refinement and cascaded dynamic fusion for more robust emotion recognition. To improve cross-modal alignment and unimodal representation, HRDF-MER introduces a novel Hierarchical Cross-modal Feature Refinement (HCFR) strategy, which integrates Cross-modal Adaptive Alignment (CAA) and Hierarchical Feature Enhancement (HFE). The CAA module employs multi-head cross-attention to construct hierarchical correlation matrices for precise acoustic-text alignment, and the HFE employs a Transformer with cross-modal residual connections to further enhance unimodal representations for robust feature learning. We further propose a Cascaded Multimodal Dynamic Fusion (CMDF) strategy, where a cross-attention encoder captures fine-grained inter-modal dependencies and a gated fusion unit adaptively weights modalities to progressively produce highly discriminative multimodal representations. Moreover, a multi-objective training scheme is proposed to jointly optimize feature alignment and classification by integrating Cross-modal Label Contrastive Loss (CLC Loss) with cross-entropy loss. Extensive experiments on the IEMOCAP and MELD datasets demonstrate that HRDF-MER significantly outperforms state-of-the-art models, while ablation studies further confirm the effectiveness and necessity of each proposed component.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.