MER-CAPF: Audio-text emotion recognition through cross-attention mechanism and multi-granularity pooling strategy

IF 3.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Pattern Recognition Letters Pub Date : 2026-03-01 Epub Date: 2026-01-13 DOI:10.1016/j.patrec.2026.01.008
Chengming Chen, Pengyuan Liu, Zhicheng Dong, Zhuo He, Zhijian Li
{"title":"MER-CAPF: Audio-text emotion recognition through cross-attention mechanism and multi-granularity pooling strategy","authors":"Chengming Chen,&nbsp;Pengyuan Liu,&nbsp;Zhicheng Dong,&nbsp;Zhuo He,&nbsp;Zhijian Li","doi":"10.1016/j.patrec.2026.01.008","DOIUrl":null,"url":null,"abstract":"<div><div>In the field of Human–Computer Interaction (HCI), emotion recognition is regarded as a critical yet challenging task due to its multimodal nature and limitations in data acquisition. To achieve accurate recognition of multimodal emotional information such as speech and text, this paper proposes a novel multimodal emotion recognition framework, MER-CAPF (Multimodal Emotion Recognition with Cross-Attention and Pooling Fusion). The framework employs a hierarchically frozen BERT model and a depthwise separable convolutional neural network (DSCNN) combined with a Bi-LSTM to extract features from the text and audio modalities, respectively. During the feature fusion stage, a multi-head cross-attention mechanism and multi-granularity pooling strategy are introduced to fully capture semantic and acoustic associations across modalities. In addition, the model incorporates parallel modality encoders with a progressive modality alignment mechanism to achieve synergistic alignment and deep interaction between speech and text features. Experiments conducted on three public benchmark datasets—IEMOCAP, MELD, and CMU-MOSEI demonstrate that MER-CAPF achieves accuracies of 74.73%, 63.26% and 67.38% on the IEMOCAP, MELD and CMU-MOSEI respectively, outperforming most existing methods and reaching a level comparable to recent state-of-the-art models, thereby validating the effectiveness and robustness of the proposed framework.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"201 ","pages":"Pages 125-131"},"PeriodicalIF":3.3000,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865526000140","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/13 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

In the field of Human–Computer Interaction (HCI), emotion recognition is regarded as a critical yet challenging task due to its multimodal nature and limitations in data acquisition. To achieve accurate recognition of multimodal emotional information such as speech and text, this paper proposes a novel multimodal emotion recognition framework, MER-CAPF (Multimodal Emotion Recognition with Cross-Attention and Pooling Fusion). The framework employs a hierarchically frozen BERT model and a depthwise separable convolutional neural network (DSCNN) combined with a Bi-LSTM to extract features from the text and audio modalities, respectively. During the feature fusion stage, a multi-head cross-attention mechanism and multi-granularity pooling strategy are introduced to fully capture semantic and acoustic associations across modalities. In addition, the model incorporates parallel modality encoders with a progressive modality alignment mechanism to achieve synergistic alignment and deep interaction between speech and text features. Experiments conducted on three public benchmark datasets—IEMOCAP, MELD, and CMU-MOSEI demonstrate that MER-CAPF achieves accuracies of 74.73%, 63.26% and 67.38% on the IEMOCAP, MELD and CMU-MOSEI respectively, outperforming most existing methods and reaching a level comparable to recent state-of-the-art models, thereby validating the effectiveness and robustness of the proposed framework.
MER-CAPF:基于交叉注意机制和多粒度池化策略的声文情感识别
在人机交互(HCI)领域,情感识别由于其多模态特性和数据获取的局限性而被认为是一项关键而具有挑战性的任务。为了实现语音和文本等多模态情感信息的准确识别,本文提出了一种新的多模态情感识别框架MER-CAPF (multimodal emotion recognition with Cross-Attention and Pooling Fusion)。该框架采用层次冻结BERT模型和深度可分离卷积神经网络(DSCNN)结合Bi-LSTM分别从文本和音频模式中提取特征。在特征融合阶段,引入了多头交叉注意机制和多粒度池策略,以充分捕获跨模态的语义和声学关联。此外,该模型还结合了并行模态编码器和渐进式模态对齐机制,实现了语音和文本特征之间的协同对齐和深度交互。在IEMOCAP、MELD和CMU-MOSEI三个公共基准数据集上进行的实验表明,MER-CAPF在IEMOCAP、MELD和CMU-MOSEI上分别达到了74.73%、63.26%和67.38%的准确率,优于大多数现有方法,达到了与最近最先进模型相当的水平,从而验证了所提出框架的有效性和鲁棒性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Pattern Recognition Letters
Pattern Recognition Letters 工程技术-计算机:人工智能
CiteScore
12.40
自引率
5.90%
发文量
287
审稿时长
9.1 months
期刊介绍: Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书