{"title":"T_SRNET: A multimodal model based on convolutional neural network for emotional speech enhancement","authors":"Shaoqiang Wang , Lei Feng , Li Zhang","doi":"10.1016/j.aej.2025.03.071","DOIUrl":null,"url":null,"abstract":"<div><div>Speech classification is a technology that can determine the emotional state conveyed by speech. It can support emotion-related applications and improve the human–computer interaction experience. However, the lack of high-quality speech annotation datasets makes it difficult for many models to provide sufficient data for training, resulting in poor model generalization performance. It is necessary to obtain more high-quality speech annotation datasets through the high-precision model. For example, there are many human emotional data in the image dataset that can be utilized to assist in speech emotional information recognition. In this study, a multimodal algorithm T_SRNET is proposed, which can assist speech emotion recognition by extracting image emotion features and converting them into spectrograms. Firstly, the face image data with emotions such as joy and sadness are transformed into the corresponding phonograms by the diffusion model. Secondly, the features can be extracted by using the speech feature extraction network SRNET based on the improved transform structure. Finally, the speech signal features are extracted, and the two features are fused before the decision is made to output the results. After ablation and contrast experiments, the accuracy of CREMA-D and IEMOCAP was improved by 2% and 1% respectively. Also it can be evaluated that the proposed model in this study can correlate image data with speech data, improve the quality of speech data tagging and enhance the performance of speech recognition.</div></div>","PeriodicalId":7484,"journal":{"name":"alexandria engineering journal","volume":"124 ","pages":"Pages 573-581"},"PeriodicalIF":6.2000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"alexandria engineering journal","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110016825003795","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
Speech classification is a technology that can determine the emotional state conveyed by speech. It can support emotion-related applications and improve the human–computer interaction experience. However, the lack of high-quality speech annotation datasets makes it difficult for many models to provide sufficient data for training, resulting in poor model generalization performance. It is necessary to obtain more high-quality speech annotation datasets through the high-precision model. For example, there are many human emotional data in the image dataset that can be utilized to assist in speech emotional information recognition. In this study, a multimodal algorithm T_SRNET is proposed, which can assist speech emotion recognition by extracting image emotion features and converting them into spectrograms. Firstly, the face image data with emotions such as joy and sadness are transformed into the corresponding phonograms by the diffusion model. Secondly, the features can be extracted by using the speech feature extraction network SRNET based on the improved transform structure. Finally, the speech signal features are extracted, and the two features are fused before the decision is made to output the results. After ablation and contrast experiments, the accuracy of CREMA-D and IEMOCAP was improved by 2% and 1% respectively. Also it can be evaluated that the proposed model in this study can correlate image data with speech data, improve the quality of speech data tagging and enhance the performance of speech recognition.
期刊介绍:
Alexandria Engineering Journal is an international journal devoted to publishing high quality papers in the field of engineering and applied science. Alexandria Engineering Journal is cited in the Engineering Information Services (EIS) and the Chemical Abstracts (CA). The papers published in Alexandria Engineering Journal are grouped into five sections, according to the following classification:
• Mechanical, Production, Marine and Textile Engineering
• Electrical Engineering, Computer Science and Nuclear Engineering
• Civil and Architecture Engineering
• Chemical Engineering and Applied Sciences
• Environmental Engineering