{"title":"评论文章:用于说话人独立语音情感识别的变压器性能基准评价。","authors":"Francisco Portal, Javier De Lope, Manuel Graña","doi":"10.1142/S0129065725300013","DOIUrl":null,"url":null,"abstract":"<p><p>Speech Emotion Recognition (SER) is becoming a key element of speech-based human-computer interfaces, endowing them with some form of empathy towards the emotional status of the human. Transformers have become a central Deep Learning (DL) architecture in natural language processing and signal processing, recently including audio signals for Automatic Speech Recognition (ASR) and SER. A central question addressed in this paper is the achievement of speaker-independent SER systems, i.e. systems that perform independently of a specific training set, enabling their deployment in real-world situations by overcoming the typical limitations of laboratory environments. This paper presents a comprehensive performance evaluation review of transformer architectures that have been proposed to deal with the SER task, carrying out an independent validation at different levels over the most relevant publicly available datasets for validation of SER models. The comprehensive experimental design implemented in this paper provides an accurate picture of the performance achieved by current state-of-the-art transformer models in speaker-independent SER. We have found that most experimental instances reach accuracies below 40% when a model is trained on a dataset and tested on a different one. A speaker-independent evaluation combining up to five datasets and testing on a different one achieves up to 58.85% accuracy. In conclusion, the SER results improved with the aggregation of datasets, indicating that model generalization can be enhanced by extracting data from diverse datasets.</p>","PeriodicalId":94052,"journal":{"name":"International journal of neural systems","volume":" ","pages":"2530001"},"PeriodicalIF":6.4000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Performance Benchmarking Review of Transformers for Speaker-Independent Speech Emotion Recognition.\",\"authors\":\"Francisco Portal, Javier De Lope, Manuel Graña\",\"doi\":\"10.1142/S0129065725300013\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Speech Emotion Recognition (SER) is becoming a key element of speech-based human-computer interfaces, endowing them with some form of empathy towards the emotional status of the human. Transformers have become a central Deep Learning (DL) architecture in natural language processing and signal processing, recently including audio signals for Automatic Speech Recognition (ASR) and SER. A central question addressed in this paper is the achievement of speaker-independent SER systems, i.e. systems that perform independently of a specific training set, enabling their deployment in real-world situations by overcoming the typical limitations of laboratory environments. This paper presents a comprehensive performance evaluation review of transformer architectures that have been proposed to deal with the SER task, carrying out an independent validation at different levels over the most relevant publicly available datasets for validation of SER models. The comprehensive experimental design implemented in this paper provides an accurate picture of the performance achieved by current state-of-the-art transformer models in speaker-independent SER. We have found that most experimental instances reach accuracies below 40% when a model is trained on a dataset and tested on a different one. A speaker-independent evaluation combining up to five datasets and testing on a different one achieves up to 58.85% accuracy. In conclusion, the SER results improved with the aggregation of datasets, indicating that model generalization can be enhanced by extracting data from diverse datasets.</p>\",\"PeriodicalId\":94052,\"journal\":{\"name\":\"International journal of neural systems\",\"volume\":\" \",\"pages\":\"2530001\"},\"PeriodicalIF\":6.4000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International journal of neural systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1142/S0129065725300013\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/7/29 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of neural systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/S0129065725300013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/29 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
A Performance Benchmarking Review of Transformers for Speaker-Independent Speech Emotion Recognition.
Speech Emotion Recognition (SER) is becoming a key element of speech-based human-computer interfaces, endowing them with some form of empathy towards the emotional status of the human. Transformers have become a central Deep Learning (DL) architecture in natural language processing and signal processing, recently including audio signals for Automatic Speech Recognition (ASR) and SER. A central question addressed in this paper is the achievement of speaker-independent SER systems, i.e. systems that perform independently of a specific training set, enabling their deployment in real-world situations by overcoming the typical limitations of laboratory environments. This paper presents a comprehensive performance evaluation review of transformer architectures that have been proposed to deal with the SER task, carrying out an independent validation at different levels over the most relevant publicly available datasets for validation of SER models. The comprehensive experimental design implemented in this paper provides an accurate picture of the performance achieved by current state-of-the-art transformer models in speaker-independent SER. We have found that most experimental instances reach accuracies below 40% when a model is trained on a dataset and tested on a different one. A speaker-independent evaluation combining up to five datasets and testing on a different one achieves up to 58.85% accuracy. In conclusion, the SER results improved with the aggregation of datasets, indicating that model generalization can be enhanced by extracting data from diverse datasets.