HRDF-MER: Hierarchical feature refinement and cascaded dynamic fusion for multimodal emotion recognition

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2026-10-01 Epub Date: 2026-03-06 DOI:10.1016/j.csl.2026.101978

Jianjun Lei , Zhenmei Mu , Ying Wang

{"title":"HRDF-MER: Hierarchical feature refinement and cascaded dynamic fusion for multimodal emotion recognition","authors":"Jianjun Lei , Zhenmei Mu , Ying Wang","doi":"10.1016/j.csl.2026.101978","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal Emotion Recognition (MER) is challenged by modality misalignment, shallow temporal cue modeling, and inefficient fusion. This paper proposes HRDF-MER, a framework that integrates hierarchical refinement and cascaded dynamic fusion for more robust emotion recognition. To improve cross-modal alignment and unimodal representation, HRDF-MER introduces a novel Hierarchical Cross-modal Feature Refinement (HCFR) strategy, which integrates Cross-modal Adaptive Alignment (CAA) and Hierarchical Feature Enhancement (HFE). The CAA module employs multi-head cross-attention to construct hierarchical correlation matrices for precise acoustic-text alignment, and the HFE employs a Transformer with cross-modal residual connections to further enhance unimodal representations for robust feature learning. We further propose a Cascaded Multimodal Dynamic Fusion (CMDF) strategy, where a cross-attention encoder captures fine-grained inter-modal dependencies and a gated fusion unit adaptively weights modalities to progressively produce highly discriminative multimodal representations. Moreover, a multi-objective training scheme is proposed to jointly optimize feature alignment and classification by integrating Cross-modal Label Contrastive Loss (CLC Loss) with cross-entropy loss. Extensive experiments on the IEMOCAP and MELD datasets demonstrate that HRDF-MER significantly outperforms state-of-the-art models, while ablation studies further confirm the effectiveness and necessity of each proposed component.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101978"},"PeriodicalIF":3.4000,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230826000410","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/3/6 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Multimodal Emotion Recognition (MER) is challenged by modality misalignment, shallow temporal cue modeling, and inefficient fusion. This paper proposes HRDF-MER, a framework that integrates hierarchical refinement and cascaded dynamic fusion for more robust emotion recognition. To improve cross-modal alignment and unimodal representation, HRDF-MER introduces a novel Hierarchical Cross-modal Feature Refinement (HCFR) strategy, which integrates Cross-modal Adaptive Alignment (CAA) and Hierarchical Feature Enhancement (HFE). The CAA module employs multi-head cross-attention to construct hierarchical correlation matrices for precise acoustic-text alignment, and the HFE employs a Transformer with cross-modal residual connections to further enhance unimodal representations for robust feature learning. We further propose a Cascaded Multimodal Dynamic Fusion (CMDF) strategy, where a cross-attention encoder captures fine-grained inter-modal dependencies and a gated fusion unit adaptively weights modalities to progressively produce highly discriminative multimodal representations. Moreover, a multi-objective training scheme is proposed to jointly optimize feature alignment and classification by integrating Cross-modal Label Contrastive Loss (CLC Loss) with cross-entropy loss. Extensive experiments on the IEMOCAP and MELD datasets demonstrate that HRDF-MER significantly outperforms state-of-the-art models, while ablation studies further confirm the effectiveness and necessity of each proposed component.

查看原文本刊更多论文

基于层次特征细化和级联动态融合的多模态情感识别

多模态情绪识别（MER）面临着模态偏差、浅时间线索建模和低效率融合等问题。本文提出了HRDF-MER框架，该框架集成了层次细化和级联动态融合，以实现更鲁棒的情感识别。为了改善跨模态对齐和单模态表示，HRDF-MER引入了一种新的分层跨模态特征优化（HCFR）策略，该策略集成了跨模态自适应对齐（CAA）和分层特征增强（HFE）。CAA模块采用多头交叉注意来构建分层相关矩阵，以实现精确的声文对齐，HFE采用具有跨模态残差连接的Transformer来进一步增强单模态表示，以实现鲁棒特征学习。我们进一步提出了一种级联多模态动态融合（CMDF）策略，其中交叉注意编码器捕获细粒度的模态间依赖关系，门控融合单元自适应地对模态进行加权，以逐步产生高度判别的多模态表示。在此基础上，提出了一种结合交叉熵损失和交叉模态标签对比损失的多目标训练方案，共同优化特征对齐和分类。在IEMOCAP和MELD数据集上进行的大量实验表明，HRDF-MER显著优于最先进的模型，而烧蚀研究进一步证实了每个提议组件的有效性和必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.