{"title":"多模态情感识别的混合注意卷积和压缩融合网络","authors":"Lixun Xie , Weiqing Sun , Jingyi Zhang , Xiaohu Zhao","doi":"10.1016/j.dsp.2025.105261","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid development of human-computer interaction systems can help computers better understand human intentions. How to comprehensively and accurately recognize emotions is a key link in human-computer interaction. However, the current research on multimodal sentiment analysis tends to extract the information of a single modal separately and then simply combine the features of each mode. In this process, there is a lack of effective interaction between the features of each modal extraction, and the close relationship between the multi-modal data and the recognition task is ignored. In addition, the multimodal features are simply added and connected, ignoring the differences in the degree of correlation and contribution of different modal information to the final recognition result. In view of this, this paper proposes the Hybrid Attention Convolution and Compression Fusion Unit for Multimodal Emotion Recognition, namely AC2Net. We have designed a hybrid attention convolution model, which focuses on the interactive extraction of multi-modal features and can accurately capture key emotional perception features among three signal features, namely facial image sequence, EEG, and peripheral physiological signals. A compression fusion unit is designed to aggregate the multi-modal features, and the deep fusion of multi-modal heterogeneous information is realized by compressing and exciting the dense layers in the multi-modal branches. Finally, the proposed model was verified on the DEAP dataset, and the accuracy of valence and arousal dimension two-classification recognition reached 99.39%, 99.37%, and four-classification recognition accuracy reached 99.08%. Compared with the existing single-mode and multi-mode emotion recognition methods, the performance of this model is excellent.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"164 ","pages":"Article 105261"},"PeriodicalIF":2.9000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"AC2Net: Hybrid attention convolution and compression fusion network for multimodal emotion recognition\",\"authors\":\"Lixun Xie , Weiqing Sun , Jingyi Zhang , Xiaohu Zhao\",\"doi\":\"10.1016/j.dsp.2025.105261\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The rapid development of human-computer interaction systems can help computers better understand human intentions. How to comprehensively and accurately recognize emotions is a key link in human-computer interaction. However, the current research on multimodal sentiment analysis tends to extract the information of a single modal separately and then simply combine the features of each mode. In this process, there is a lack of effective interaction between the features of each modal extraction, and the close relationship between the multi-modal data and the recognition task is ignored. In addition, the multimodal features are simply added and connected, ignoring the differences in the degree of correlation and contribution of different modal information to the final recognition result. In view of this, this paper proposes the Hybrid Attention Convolution and Compression Fusion Unit for Multimodal Emotion Recognition, namely AC2Net. We have designed a hybrid attention convolution model, which focuses on the interactive extraction of multi-modal features and can accurately capture key emotional perception features among three signal features, namely facial image sequence, EEG, and peripheral physiological signals. A compression fusion unit is designed to aggregate the multi-modal features, and the deep fusion of multi-modal heterogeneous information is realized by compressing and exciting the dense layers in the multi-modal branches. Finally, the proposed model was verified on the DEAP dataset, and the accuracy of valence and arousal dimension two-classification recognition reached 99.39%, 99.37%, and four-classification recognition accuracy reached 99.08%. Compared with the existing single-mode and multi-mode emotion recognition methods, the performance of this model is excellent.</div></div>\",\"PeriodicalId\":51011,\"journal\":{\"name\":\"Digital Signal Processing\",\"volume\":\"164 \",\"pages\":\"Article 105261\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2025-04-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1051200425002830\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200425002830","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
AC2Net: Hybrid attention convolution and compression fusion network for multimodal emotion recognition
The rapid development of human-computer interaction systems can help computers better understand human intentions. How to comprehensively and accurately recognize emotions is a key link in human-computer interaction. However, the current research on multimodal sentiment analysis tends to extract the information of a single modal separately and then simply combine the features of each mode. In this process, there is a lack of effective interaction between the features of each modal extraction, and the close relationship between the multi-modal data and the recognition task is ignored. In addition, the multimodal features are simply added and connected, ignoring the differences in the degree of correlation and contribution of different modal information to the final recognition result. In view of this, this paper proposes the Hybrid Attention Convolution and Compression Fusion Unit for Multimodal Emotion Recognition, namely AC2Net. We have designed a hybrid attention convolution model, which focuses on the interactive extraction of multi-modal features and can accurately capture key emotional perception features among three signal features, namely facial image sequence, EEG, and peripheral physiological signals. A compression fusion unit is designed to aggregate the multi-modal features, and the deep fusion of multi-modal heterogeneous information is realized by compressing and exciting the dense layers in the multi-modal branches. Finally, the proposed model was verified on the DEAP dataset, and the accuracy of valence and arousal dimension two-classification recognition reached 99.39%, 99.37%, and four-classification recognition accuracy reached 99.08%. Compared with the existing single-mode and multi-mode emotion recognition methods, the performance of this model is excellent.
期刊介绍:
Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal.
The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as:
• big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,