RoutingConvNet: A Light-weight Speech Emotion Recognition Model Based on Bidirectional MFCC

Korean Institute of Smart Media Pub Date : 2023-06-30 DOI:10.30693/smj.2023.12.5.28

Hyun Taek Lim, Soo-Hyung Kim, Gueesang Lee, Hyung-Jeong Yang

引用次数: 0

Abstract

In this study, we propose a new light-weight model RoutingConvNet with fewer parameters to improve the applicability and practicality of speech emotion recognition. To reduce the number of learnable parameters, the proposed model connects bidirectional MFCCs on a channel-by-channel basis to learn long-term emotion dependence and extract contextual features. A light-weight deep CNN is constructed for low-level feature extraction, and self-attention is used to obtain information about channel and spatial signals in speech signals. In addition, we apply dynamic routing to improve the accuracy and construct a model that is robust to feature variations. The proposed model shows parameter reduction and accuracy improvement in the overall experiments of speech emotion datasets (EMO-DB, RAVDESS, and IEMOCAP), achieving 87.86%, 83.44%, and 66.06% accuracy respectively with about 156,000 parameters. In this study, we proposed a metric to calculate the trade-off between the number of parameters and accuracy for performance evaluation against light-weight.

查看原文本刊更多论文

路由卷积神经网络:基于双向MFCC的轻量级语音情感识别模型

为了提高语音情感识别的适用性和实用性，本文提出了一种参数较少的轻量级路由卷积神经网络模型。为了减少可学习参数的数量，该模型以通道为基础连接双向mfc，以学习长期情感依赖并提取上下文特征。构建轻量级深度CNN进行底层特征提取，利用自关注获取语音信号中的信道和空间信号信息。此外，我们采用动态路由来提高精度，并构建了一个对特征变化具有鲁棒性的模型。在语音情感数据集(EMO-DB、RAVDESS和IEMOCAP)的整体实验中，该模型的参数减少，准确率提高，约15.6万个参数，准确率分别达到87.86%、83.44%和66.06%。在这项研究中，我们提出了一个度量来计算参数数量和轻量级性能评估准确性之间的权衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Korean Institute of Smart Media

自引率

0.00%

发文量