Bangla Speech Emotion Recognition Using Deep Learning-Based Ensemble Learning and Feature Fusion.

IF 2.7 Q3 IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY

Journal of Imaging Pub Date : 2025-08-14 DOI:10.3390/jimaging11080273

Md Shahid Ahammed Shakil, Fahmid Al Farid, Nitun Kumar Podder, S M Hasan Sazzad Iqbal, Abu Saleh Musa Miah, Md Abdur Rahim, Hezerul Abdul Karim

{"title":"Bangla Speech Emotion Recognition Using Deep Learning-Based Ensemble Learning and Feature Fusion.","authors":"Md Shahid Ahammed Shakil, Fahmid Al Farid, Nitun Kumar Podder, S M Hasan Sazzad Iqbal, Abu Saleh Musa Miah, Md Abdur Rahim, Hezerul Abdul Karim","doi":"10.3390/jimaging11080273","DOIUrl":null,"url":null,"abstract":"<p><p>Emotion recognition in speech is essential for enhancing human-computer interaction (HCI) systems. Despite progress in Bangla speech emotion recognition, challenges remain, including low accuracy, speaker dependency, and poor generalization across emotional expressions. Previous approaches often rely on traditional machine learning or basic deep learning models, struggling with robustness and accuracy in noisy or varied data. In this study, we propose a novel multi-stream deep learning feature fusion approach for Bangla speech emotion recognition, addressing the limitations of existing methods. Our approach begins with various data augmentation techniques applied to the training dataset, enhancing the model's robustness and generalization. We then extract a comprehensive set of handcrafted features, including Zero-Crossing Rate (ZCR), chromagram, spectral centroid, spectral roll-off, spectral contrast, spectral flatness, Mel-Frequency Cepstral Coefficients (MFCCs), Root Mean Square (RMS) energy, and Mel-spectrogram. Although these features are used as 1D numerical vectors, some of them are computed from time-frequency representations (e.g., chromagram, Mel-spectrogram) that can themselves be depicted as images, which is conceptually close to imaging-based analysis. These features capture key characteristics of the speech signal, providing valuable insights into the emotional content. Sequentially, we utilize a multi-stream deep learning architecture to automatically learn complex, hierarchical representations of the speech signal. This architecture consists of three distinct streams: the first stream uses 1D convolutional neural networks (1D CNNs), the second integrates 1D CNN with Long Short-Term Memory (LSTM), and the third combines 1D CNNs with bidirectional LSTM (Bi-LSTM). These models capture intricate emotional nuances that handcrafted features alone may not fully represent. For each of these models, we generate predicted scores and then employ ensemble learning with a soft voting technique to produce the final prediction. This fusion of handcrafted features, deep learning-derived features, and ensemble voting enhances the accuracy and robustness of emotion identification across multiple datasets. Our method demonstrates the effectiveness of combining various learning models to improve emotion recognition in Bangla speech, providing a more comprehensive solution compared with existing methods. We utilize three primary datasets-SUBESCO, BanglaSER, and a merged version of both-as well as two external datasets, RAVDESS and EMODB, to assess the performance of our models. Our method achieves impressive results with accuracies of 92.90%, 85.20%, 90.63%, 67.71%, and 69.25% for the SUBESCO, BanglaSER, merged SUBESCO and BanglaSER, RAVDESS, and EMODB datasets, respectively. These results demonstrate the effectiveness of combining handcrafted features with deep learning-based features through ensemble learning for robust emotion recognition in Bangla speech.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"11 8","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12387467/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/jimaging11080273","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Emotion recognition in speech is essential for enhancing human-computer interaction (HCI) systems. Despite progress in Bangla speech emotion recognition, challenges remain, including low accuracy, speaker dependency, and poor generalization across emotional expressions. Previous approaches often rely on traditional machine learning or basic deep learning models, struggling with robustness and accuracy in noisy or varied data. In this study, we propose a novel multi-stream deep learning feature fusion approach for Bangla speech emotion recognition, addressing the limitations of existing methods. Our approach begins with various data augmentation techniques applied to the training dataset, enhancing the model's robustness and generalization. We then extract a comprehensive set of handcrafted features, including Zero-Crossing Rate (ZCR), chromagram, spectral centroid, spectral roll-off, spectral contrast, spectral flatness, Mel-Frequency Cepstral Coefficients (MFCCs), Root Mean Square (RMS) energy, and Mel-spectrogram. Although these features are used as 1D numerical vectors, some of them are computed from time-frequency representations (e.g., chromagram, Mel-spectrogram) that can themselves be depicted as images, which is conceptually close to imaging-based analysis. These features capture key characteristics of the speech signal, providing valuable insights into the emotional content. Sequentially, we utilize a multi-stream deep learning architecture to automatically learn complex, hierarchical representations of the speech signal. This architecture consists of three distinct streams: the first stream uses 1D convolutional neural networks (1D CNNs), the second integrates 1D CNN with Long Short-Term Memory (LSTM), and the third combines 1D CNNs with bidirectional LSTM (Bi-LSTM). These models capture intricate emotional nuances that handcrafted features alone may not fully represent. For each of these models, we generate predicted scores and then employ ensemble learning with a soft voting technique to produce the final prediction. This fusion of handcrafted features, deep learning-derived features, and ensemble voting enhances the accuracy and robustness of emotion identification across multiple datasets. Our method demonstrates the effectiveness of combining various learning models to improve emotion recognition in Bangla speech, providing a more comprehensive solution compared with existing methods. We utilize three primary datasets-SUBESCO, BanglaSER, and a merged version of both-as well as two external datasets, RAVDESS and EMODB, to assess the performance of our models. Our method achieves impressive results with accuracies of 92.90%, 85.20%, 90.63%, 67.71%, and 69.25% for the SUBESCO, BanglaSER, merged SUBESCO and BanglaSER, RAVDESS, and EMODB datasets, respectively. These results demonstrate the effectiveness of combining handcrafted features with deep learning-based features through ensemble learning for robust emotion recognition in Bangla speech.

查看原文本刊更多论文

基于深度学习的集成学习和特征融合的孟加拉语语音情感识别。

语音中的情感识别是增强人机交互系统的关键。尽管孟加拉语语音情感识别取得了进展，但仍然存在挑战，包括准确性低、对说话者的依赖以及情感表达的泛化能力差。以前的方法通常依赖于传统的机器学习或基本的深度学习模型，在嘈杂或变化的数据中挣扎于鲁棒性和准确性。在这项研究中，我们提出了一种新的多流深度学习特征融合方法用于孟加拉语语音情感识别，解决了现有方法的局限性。我们的方法从应用于训练数据集的各种数据增强技术开始，增强模型的鲁棒性和泛化性。然后，我们提取了一套全面的手工特征，包括过零率（ZCR）、色谱图、光谱质心、光谱滚降、光谱对比度、光谱平整度、Mel-Frequency倒谱系数（MFCCs）、均方根能量（RMS）和mel -谱图。虽然这些特征被用作一维数值向量，但其中一些是从时频表示（例如，色谱图，梅尔谱图）中计算出来的，这些表征本身可以被描述为图像，这在概念上接近基于图像的分析。这些特征捕捉了语音信号的关键特征，为情感内容提供了有价值的见解。接下来，我们利用多流深度学习架构来自动学习语音信号的复杂分层表示。该架构由三个不同的流组成：第一个流使用1D卷积神经网络（1D CNN），第二个流将1D CNN与长短期记忆（LSTM）相结合，第三个流将1D CNN与双向LSTM （Bi-LSTM）相结合。这些模型捕捉到了复杂的情感细微差别，而手工制作的特征本身可能无法完全代表这些细微差别。对于这些模型中的每一个，我们生成预测分数，然后使用集成学习和软投票技术来生成最终预测。这种手工特征、深度学习衍生特征和集成投票的融合提高了跨多个数据集情感识别的准确性和鲁棒性。我们的方法证明了结合多种学习模型提高孟加拉语语音情绪识别的有效性，与现有方法相比提供了更全面的解决方案。我们利用三个主要数据集——subesco、BanglaSER和两者的合并版本——以及两个外部数据集——RAVDESS和EMODB来评估我们模型的性能。我们的方法取得了令人印象深刻的结果，对于SUBESCO， BanglaSER，合并的SUBESCO和BanglaSER， RAVDESS和EMODB数据集的准确率分别为92.90%,85.20%,90.63%，67.71%和69.25%。这些结果表明，通过集成学习将手工特征与基于深度学习的特征相结合，对孟加拉语语音进行鲁棒情感识别是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊