Analysis and Research on Spectrogram-Based Emotional Speech Signal Augmentation Algorithm.

IF 2.1 3区物理与天体物理 Q2 PHYSICS, MULTIDISCIPLINARY

Entropy Pub Date : 2025-06-15 DOI:10.3390/e27060640

Huawei Tao, Sixian Li, Xuemei Wang, Binkun Liu, Shuailong Zheng

{"title":"Analysis and Research on Spectrogram-Based Emotional Speech Signal Augmentation Algorithm.","authors":"Huawei Tao, Sixian Li, Xuemei Wang, Binkun Liu, Shuailong Zheng","doi":"10.3390/e27060640","DOIUrl":null,"url":null,"abstract":"<p><p>Data augmentation techniques are widely applied in speech emotion recognition to increase the diversity of data and enhance the performance of models. However, existing research has not deeply explored the impact of these data augmentation techniques on emotional data. Inappropriate augmentation algorithms may distort emotional labels, thereby reducing the performance of models. To address this issue, in this paper we systematically evaluate the influence of common data augmentation algorithms on emotion recognition from three dimensions: (1) we design subjective auditory experiments to intuitively demonstrate the impact of augmentation algorithms on the emotional expression of speech; (2) we jointly extract multi-dimensional features from spectrograms based on the Librosa library and analyze the impact of data augmentation algorithms on the spectral features of speech signals through heatmap visualization; and (3) we objectively evaluate the recognition performance of the model by means of indicators such as cross-entropy loss and introduce statistical significance analysis to verify the effectiveness of the augmentation algorithms. The experimental results show that \"time stretching\" may distort speech features, affect the attribution of emotional labels, and significantly reduce the model's accuracy. In contrast, \"reverberation\" (RIR) and \"resampling\" within a limited range have the least impact on emotional data, enhancing the diversity of samples. Moreover, their combination can increase accuracy by up to 7.1%, providing a basis for optimizing data augmentation strategies.</p>","PeriodicalId":11694,"journal":{"name":"Entropy","volume":"27 6","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2025-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12191602/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Entropy","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.3390/e27060640","RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PHYSICS, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Data augmentation techniques are widely applied in speech emotion recognition to increase the diversity of data and enhance the performance of models. However, existing research has not deeply explored the impact of these data augmentation techniques on emotional data. Inappropriate augmentation algorithms may distort emotional labels, thereby reducing the performance of models. To address this issue, in this paper we systematically evaluate the influence of common data augmentation algorithms on emotion recognition from three dimensions: (1) we design subjective auditory experiments to intuitively demonstrate the impact of augmentation algorithms on the emotional expression of speech; (2) we jointly extract multi-dimensional features from spectrograms based on the Librosa library and analyze the impact of data augmentation algorithms on the spectral features of speech signals through heatmap visualization; and (3) we objectively evaluate the recognition performance of the model by means of indicators such as cross-entropy loss and introduce statistical significance analysis to verify the effectiveness of the augmentation algorithms. The experimental results show that "time stretching" may distort speech features, affect the attribution of emotional labels, and significantly reduce the model's accuracy. In contrast, "reverberation" (RIR) and "resampling" within a limited range have the least impact on emotional data, enhancing the diversity of samples. Moreover, their combination can increase accuracy by up to 7.1%, providing a basis for optimizing data augmentation strategies.

查看原文本刊更多论文

基于谱图的情绪语音信号增强算法分析与研究。

数据增强技术被广泛应用于语音情感识别，以增加数据的多样性和提高模型的性能。然而，现有的研究尚未深入探讨这些数据增强技术对情绪数据的影响。不适当的增强算法可能会扭曲情感标签，从而降低模型的性能。为了解决这一问题，本文从三个维度系统评估了常用数据增强算法对情绪识别的影响：(1)设计主观听觉实验，直观地展示增强算法对语音情感表达的影响；(2)基于Librosa库，共同从频谱图中提取多维特征，并通过热图可视化分析数据增强算法对语音信号频谱特征的影响；(3)通过交叉熵损失等指标客观评价模型的识别性能，并引入统计显著性分析来验证增强算法的有效性。实验结果表明，“时间拉伸”会扭曲语音特征，影响情绪标签的归因，显著降低模型的准确率。相比之下，在有限范围内的“混响”（RIR）和“重采样”对情绪数据的影响最小，增强了样本的多样性。此外，它们的组合可以使准确率提高7.1%，为优化数据增强策略提供了基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Entropy PHYSICS, MULTIDISCIPLINARY-

CiteScore

4.90

自引率

11.10%

发文量

1580

审稿时长

21.05 days

期刊介绍： Entropy (ISSN 1099-4300), an international and interdisciplinary journal of entropy and information studies, publishes reviews, regular research papers and short notes. Our aim is to encourage scientists to publish as much as possible their theoretical and experimental details. There is no restriction on the length of the papers. If there are computation and the experiment, the details must be provided so that the results can be reproduced.