多模态情感识别的混合注意卷积和压缩融合网络

IF 2.9 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Digital Signal Processing Pub Date : 2025-04-25 DOI:10.1016/j.dsp.2025.105261

Lixun Xie , Weiqing Sun , Jingyi Zhang , Xiaohu Zhao

{"title":"多模态情感识别的混合注意卷积和压缩融合网络","authors":"Lixun Xie , Weiqing Sun , Jingyi Zhang , Xiaohu Zhao","doi":"10.1016/j.dsp.2025.105261","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid development of human-computer interaction systems can help computers better understand human intentions. How to comprehensively and accurately recognize emotions is a key link in human-computer interaction. However, the current research on multimodal sentiment analysis tends to extract the information of a single modal separately and then simply combine the features of each mode. In this process, there is a lack of effective interaction between the features of each modal extraction, and the close relationship between the multi-modal data and the recognition task is ignored. In addition, the multimodal features are simply added and connected, ignoring the differences in the degree of correlation and contribution of different modal information to the final recognition result. In view of this, this paper proposes the Hybrid Attention Convolution and Compression Fusion Unit for Multimodal Emotion Recognition, namely AC2Net. We have designed a hybrid attention convolution model, which focuses on the interactive extraction of multi-modal features and can accurately capture key emotional perception features among three signal features, namely facial image sequence, EEG, and peripheral physiological signals. A compression fusion unit is designed to aggregate the multi-modal features, and the deep fusion of multi-modal heterogeneous information is realized by compressing and exciting the dense layers in the multi-modal branches. Finally, the proposed model was verified on the DEAP dataset, and the accuracy of valence and arousal dimension two-classification recognition reached 99.39%, 99.37%, and four-classification recognition accuracy reached 99.08%. Compared with the existing single-mode and multi-mode emotion recognition methods, the performance of this model is excellent.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"164 ","pages":"Article 105261"},"PeriodicalIF":2.9000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"AC2Net: Hybrid attention convolution and compression fusion network for multimodal emotion recognition\",\"authors\":\"Lixun Xie , Weiqing Sun , Jingyi Zhang , Xiaohu Zhao\",\"doi\":\"10.1016/j.dsp.2025.105261\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The rapid development of human-computer interaction systems can help computers better understand human intentions. How to comprehensively and accurately recognize emotions is a key link in human-computer interaction. However, the current research on multimodal sentiment analysis tends to extract the information of a single modal separately and then simply combine the features of each mode. In this process, there is a lack of effective interaction between the features of each modal extraction, and the close relationship between the multi-modal data and the recognition task is ignored. In addition, the multimodal features are simply added and connected, ignoring the differences in the degree of correlation and contribution of different modal information to the final recognition result. In view of this, this paper proposes the Hybrid Attention Convolution and Compression Fusion Unit for Multimodal Emotion Recognition, namely AC2Net. We have designed a hybrid attention convolution model, which focuses on the interactive extraction of multi-modal features and can accurately capture key emotional perception features among three signal features, namely facial image sequence, EEG, and peripheral physiological signals. A compression fusion unit is designed to aggregate the multi-modal features, and the deep fusion of multi-modal heterogeneous information is realized by compressing and exciting the dense layers in the multi-modal branches. Finally, the proposed model was verified on the DEAP dataset, and the accuracy of valence and arousal dimension two-classification recognition reached 99.39%, 99.37%, and four-classification recognition accuracy reached 99.08%. Compared with the existing single-mode and multi-mode emotion recognition methods, the performance of this model is excellent.</div></div>\",\"PeriodicalId\":51011,\"journal\":{\"name\":\"Digital Signal Processing\",\"volume\":\"164 \",\"pages\":\"Article 105261\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2025-04-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1051200425002830\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200425002830","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

人机交互系统的快速发展可以帮助计算机更好地理解人类的意图。如何全面、准确地识别情绪是人机交互的关键环节。然而，目前对多模态情感分析的研究倾向于单独提取单个模态的信息，然后简单地将各个模态的特征结合起来。在此过程中，各个模态提取的特征之间缺乏有效的交互作用，忽略了多模态数据与识别任务之间的密切关系。此外，对多模态特征进行简单的相加和连接，忽略了不同模态信息对最终识别结果的相关程度和贡献的差异。鉴于此，本文提出了一种用于多模态情感识别的混合注意卷积和压缩融合单元，即AC2Net。我们设计了一种混合注意卷积模型，该模型专注于多模态特征的交互提取，能够在面部图像序列、脑电图和周围生理信号三个信号特征中准确捕获关键的情绪感知特征。设计了压缩融合单元对多模态特征进行聚合，通过压缩和激励多模态分支中的密集层，实现多模态异构信息的深度融合。最后，在DEAP数据集上对该模型进行了验证，价维和唤醒维二分类识别的准确率达到99.39%、99.37%，四分类识别的准确率达到99.08%。与现有的单模和多模情感识别方法相比，该模型具有优异的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

AC2Net: Hybrid attention convolution and compression fusion network for multimodal emotion recognition

The rapid development of human-computer interaction systems can help computers better understand human intentions. How to comprehensively and accurately recognize emotions is a key link in human-computer interaction. However, the current research on multimodal sentiment analysis tends to extract the information of a single modal separately and then simply combine the features of each mode. In this process, there is a lack of effective interaction between the features of each modal extraction, and the close relationship between the multi-modal data and the recognition task is ignored. In addition, the multimodal features are simply added and connected, ignoring the differences in the degree of correlation and contribution of different modal information to the final recognition result. In view of this, this paper proposes the Hybrid Attention Convolution and Compression Fusion Unit for Multimodal Emotion Recognition, namely AC2Net. We have designed a hybrid attention convolution model, which focuses on the interactive extraction of multi-modal features and can accurately capture key emotional perception features among three signal features, namely facial image sequence, EEG, and peripheral physiological signals. A compression fusion unit is designed to aggregate the multi-modal features, and the deep fusion of multi-modal heterogeneous information is realized by compressing and exciting the dense layers in the multi-modal branches. Finally, the proposed model was verified on the DEAP dataset, and the accuracy of valence and arousal dimension two-classification recognition reached 99.39%, 99.37%, and four-classification recognition accuracy reached 99.08%. Compared with the existing single-mode and multi-mode emotion recognition methods, the performance of this model is excellent.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Digital Signal Processing 工程技术-工程：电子与电气

CiteScore

5.30

自引率

17.20%

发文量

435

审稿时长

66 days

期刊介绍： Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal. The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as: • big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,