A dual transfer learning method based on 3D-CNN and vision transformer for emotion recognition

IF 3.4 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Intelligence Pub Date : 2024-12-21 DOI:10.1007/s10489-024-05976-z

Zhifen Guo, Jiao Wang, Bin Zhang, Yating Ku, Fengbin Ma

{"title":"A dual transfer learning method based on 3D-CNN and vision transformer for emotion recognition","authors":"Zhifen Guo, Jiao Wang, Bin Zhang, Yating Ku, Fengbin Ma","doi":"10.1007/s10489-024-05976-z","DOIUrl":null,"url":null,"abstract":"<div>In the domain of medical science, emotion recognition based on electroencephalogram (EEG) has been widely used in emotion computing. Despite the prevalence of deep learning in EEG signals analysis, standard convolutional and recurrent neural networks fall short in effectively processing EEG data due to their inherent limitations in capturing global dependencies and addressing the non-linear and unstable characteristics of EEG signals. We propose a dual transfer learning method based on 3D Convolutional Neural Networks (3D-CNN) with a Vision Transformer (ViT) to enhance emotion recognition. This paper aims to utilize 3D-CNN effectively to capture the spatial characteristics of EEG signals and reduce data covariance, extracting shallow features. Additionally, ViT is incorporated to improve the model’s ability to capture long-range dependencies, facilitating deep feature extraction. The methodology involves a two-stage process: initially, the front end of a pre-trained 3D-CNN is employed as a shallow feature extractor to mitigate EEG data covariance and transformer biases, focusing on low-level feature detection. The subsequent stage utilizes ViT as a deep feature extractor, adept at modeling the global aspects of EEG signals and employing attention mechanisms for precise classification. We also present an innovative algorithm for data mapping in transfer learning, ensuring consistent feature representation across both spatio-temporal dimensions. This approach significantly improves global feature processing and long-range dependency detection, with the integration of color channels augmenting the model’s sensitivity to signal variations. In a 10-fold cross-validation experiment on the DEAP, experimental results demonstrate that the proposed method achieves classification accuracies of 92.44\\(\\%\\) and 92.85\\(\\%\\) for the valence and arousal dimensions, and the accuracies of four-class classification across valence and arousal are HVHA: 88.01\\(\\%\\), HVLA: 88.27\\(\\%\\), LVHA: 90.89\\(\\%\\), LVLA: 78.84\\(\\%\\). Similarly, it achieves an accuracy of 98.69\\(\\%\\) on the SEED. Overall, this methodology not only holds substantial potential in advancing emotion recognition tasks but also contributes to the broader field of affective computing.</div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 2","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2024-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-024-05976-z","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In the domain of medical science, emotion recognition based on electroencephalogram (EEG) has been widely used in emotion computing. Despite the prevalence of deep learning in EEG signals analysis, standard convolutional and recurrent neural networks fall short in effectively processing EEG data due to their inherent limitations in capturing global dependencies and addressing the non-linear and unstable characteristics of EEG signals. We propose a dual transfer learning method based on 3D Convolutional Neural Networks (3D-CNN) with a Vision Transformer (ViT) to enhance emotion recognition. This paper aims to utilize 3D-CNN effectively to capture the spatial characteristics of EEG signals and reduce data covariance, extracting shallow features. Additionally, ViT is incorporated to improve the model’s ability to capture long-range dependencies, facilitating deep feature extraction. The methodology involves a two-stage process: initially, the front end of a pre-trained 3D-CNN is employed as a shallow feature extractor to mitigate EEG data covariance and transformer biases, focusing on low-level feature detection. The subsequent stage utilizes ViT as a deep feature extractor, adept at modeling the global aspects of EEG signals and employing attention mechanisms for precise classification. We also present an innovative algorithm for data mapping in transfer learning, ensuring consistent feature representation across both spatio-temporal dimensions. This approach significantly improves global feature processing and long-range dependency detection, with the integration of color channels augmenting the model’s sensitivity to signal variations. In a 10-fold cross-validation experiment on the DEAP, experimental results demonstrate that the proposed method achieves classification accuracies of 92.44\(\%\) and 92.85\(\%\) for the valence and arousal dimensions, and the accuracies of four-class classification across valence and arousal are HVHA: 88.01\(\%\), HVLA: 88.27\(\%\), LVHA: 90.89\(\%\), LVLA: 78.84\(\%\). Similarly, it achieves an accuracy of 98.69\(\%\) on the SEED. Overall, this methodology not only holds substantial potential in advancing emotion recognition tasks but also contributes to the broader field of affective computing.

Abstract Image

查看原文本刊更多论文

在医学领域，基于脑电图（EEG）的情感识别已广泛应用于情感计算。尽管深度学习在脑电信号分析中非常普遍，但标准的卷积神经网络和递归神经网络在捕捉全局依赖性和解决脑电信号的非线性和不稳定性特征方面存在固有的局限性，因此无法有效处理脑电图数据。我们提出了一种基于三维卷积神经网络（3D-CNN）和视觉转换器（ViT）的双重迁移学习方法，以增强情感识别能力。本文旨在有效利用三维卷积神经网络捕捉脑电信号的空间特征，降低数据协方差，提取浅层特征。此外，还加入了 ViT，以提高模型捕捉长程依赖关系的能力，从而促进深度特征提取。该方法包括两个阶段：首先，将预先训练好的 3D-CNN 前端用作浅层特征提取器，以减轻脑电图数据协方差和变压器偏差，重点是低层特征检测。随后的阶段利用 ViT 作为深度特征提取器，善于对脑电信号的全局进行建模，并利用注意力机制进行精确分类。我们还提出了一种用于迁移学习中数据映射的创新算法，确保两个时空维度的特征表示一致。这种方法大大改进了全局特征处理和长程依赖性检测，同时整合了颜色通道，增强了模型对信号变化的敏感性。在对 DEAP 进行的 10 倍交叉验证实验中，实验结果表明所提出的方法达到了 92.44 （\%\）和 92.85 （\%\）的分类精度。在情绪和唤醒维度上，四类分类的准确率分别为：HVHA：88.01（\%\），HVLA：88.27（\%\），LVHA：90.89（\%\），LVLA：78.84（\%\）。同样，它在 SEED 上也达到了 98.69 的准确率。总之，这种方法不仅在推进情感识别任务方面具有巨大潜力，而且对更广泛的情感计算领域也有贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Intelligence 工程技术-计算机：人工智能

CiteScore

6.60

自引率

20.80%

发文量

1361

审稿时长

5.9 months

期刊介绍： With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance. The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.