Multimodal speech emotion recognition via modality constraint with hierarchical bottleneck feature fusion

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication Pub Date : 2025-07-10 DOI:10.1016/j.specom.2025.103278

Ying Wang , Jianjun Lei , Xiangwei Zhu , Tao Zhang

{"title":"Multimodal speech emotion recognition via modality constraint with hierarchical bottleneck feature fusion","authors":"Ying Wang , Jianjun Lei , Xiangwei Zhu , Tao Zhang","doi":"10.1016/j.specom.2025.103278","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal can combine different channels of information simultaneously to improve the modeling capabilities. Many recent studies focus on overcoming challenges arising from inter-modal conflicts and incomplete intra-modal learning for multimodal architectures. In this paper, we propose a scalable multimodal speech emotion recognition (SER) framework incorporating a hierarchical bottleneck feature (HBF) fusion approach. Furthermore, we design an intra-modal and inter-modal contrastive learning mechanism that enables self-supervised calibration of both modality-specific and cross-modal feature distributions. This approach achieves adaptive feature fusion and alignment while significantly reducing reliance on rigid feature alignment constraints. Meanwhile, by restricting the learning path of modality encoders, we design a modality representation constraint (MRC) method to mitigate conflicts between modalities. Furthermore, we present a modality bargaining (MB) strategy that facilitates learning within modalities through a mechanism of mutual bargaining and balance, which can avoid falling into suboptimal modal representations by allowing the learning of different modalities to perform alternating interchangeability. Our aggressive and disciplined training strategies enable our architecture to perform well on some multimodal emotion datasets such as CREMA-D, IEMOCAP, and MELD. Finally, we also conduct extensive experiments to demonstrate the effectiveness of our proposed architecture on various modal encoders and different modal combination methods.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103278"},"PeriodicalIF":2.4000,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639325000937","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Multimodal can combine different channels of information simultaneously to improve the modeling capabilities. Many recent studies focus on overcoming challenges arising from inter-modal conflicts and incomplete intra-modal learning for multimodal architectures. In this paper, we propose a scalable multimodal speech emotion recognition (SER) framework incorporating a hierarchical bottleneck feature (HBF) fusion approach. Furthermore, we design an intra-modal and inter-modal contrastive learning mechanism that enables self-supervised calibration of both modality-specific and cross-modal feature distributions. This approach achieves adaptive feature fusion and alignment while significantly reducing reliance on rigid feature alignment constraints. Meanwhile, by restricting the learning path of modality encoders, we design a modality representation constraint (MRC) method to mitigate conflicts between modalities. Furthermore, we present a modality bargaining (MB) strategy that facilitates learning within modalities through a mechanism of mutual bargaining and balance, which can avoid falling into suboptimal modal representations by allowing the learning of different modalities to perform alternating interchangeability. Our aggressive and disciplined training strategies enable our architecture to perform well on some multimodal emotion datasets such as CREMA-D, IEMOCAP, and MELD. Finally, we also conduct extensive experiments to demonstrate the effectiveness of our proposed architecture on various modal encoders and different modal combination methods.

查看原文本刊更多论文

基于层次化瓶颈特征融合的多模态语音情感识别

多模态模型可以同时结合不同的信息通道，提高建模能力。近年来，许多研究都集中在克服多模式架构的多模态冲突和不完全的内模态学习所带来的挑战。在本文中，我们提出了一个可扩展的多模态语音情感识别（SER）框架，该框架结合了分层瓶颈特征（HBF）融合方法。此外，我们设计了一个模态内和模态间的对比学习机制，使模态特定和跨模态特征分布的自监督校准成为可能。该方法实现了自适应特征融合和对齐，同时显著减少了对刚性特征对齐约束的依赖。同时，通过限制模态编码器的学习路径，设计了模态表示约束（MRC）方法来缓解模态之间的冲突。此外，我们提出了一种模态讨价还价（MB）策略，该策略通过相互讨价还价和平衡机制促进模态内的学习，通过允许不同模态的学习执行交替互换性，可以避免陷入次优模态表征。我们积极和有纪律的训练策略使我们的架构能够在一些多模态情感数据集（如CREMA-D， IEMOCAP和MELD）上表现良好。最后，我们还进行了大量的实验来证明我们提出的架构在各种模态编码器和不同模态组合方法上的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.