In the domain of medical science, emotion recognition based on electroencephalogram (EEG) has been widely used in emotion computing. Despite the prevalence of deep learning in EEG signals analysis, standard convolutional and recurrent neural networks fall short in effectively processing EEG data due to their inherent limitations in capturing global dependencies and addressing the non-linear and unstable characteristics of EEG signals. We propose a dual transfer learning method based on 3D Convolutional Neural Networks (3D-CNN) with a Vision Transformer (ViT) to enhance emotion recognition. This paper aims to utilize 3D-CNN effectively to capture the spatial characteristics of EEG signals and reduce data covariance, extracting shallow features. Additionally, ViT is incorporated to improve the model’s ability to capture long-range dependencies, facilitating deep feature extraction. The methodology involves a two-stage process: initially, the front end of a pre-trained 3D-CNN is employed as a shallow feature extractor to mitigate EEG data covariance and transformer biases, focusing on low-level feature detection. The subsequent stage utilizes ViT as a deep feature extractor, adept at modeling the global aspects of EEG signals and employing attention mechanisms for precise classification. We also present an innovative algorithm for data mapping in transfer learning, ensuring consistent feature representation across both spatio-temporal dimensions. This approach significantly improves global feature processing and long-range dependency detection, with the integration of color channels augmenting the model’s sensitivity to signal variations. In a 10-fold cross-validation experiment on the DEAP, experimental results demonstrate that the proposed method achieves classification accuracies of 92.44\(\%\) and 92.85\(\%\) for the valence and arousal dimensions, and the accuracies of four-class classification across valence and arousal are HVHA: 88.01\(\%\), HVLA: 88.27\(\%\), LVHA: 90.89\(\%\), LVLA: 78.84\(\%\). Similarly, it achieves an accuracy of 98.69\(\%\) on the SEED. Overall, this methodology not only holds substantial potential in advancing emotion recognition tasks but also contributes to the broader field of affective computing.