路由卷积神经网络:基于双向MFCC的轻量级语音情感识别模型

Hyun Taek Lim, Soo-Hyung Kim, Gueesang Lee, Hyung-Jeong Yang
{"title":"路由卷积神经网络:基于双向MFCC的轻量级语音情感识别模型","authors":"Hyun Taek Lim, Soo-Hyung Kim, Gueesang Lee, Hyung-Jeong Yang","doi":"10.30693/smj.2023.12.5.28","DOIUrl":null,"url":null,"abstract":"In this study, we propose a new light-weight model RoutingConvNet with fewer parameters to improve the applicability and practicality of speech emotion recognition. To reduce the number of learnable parameters, the proposed model connects bidirectional MFCCs on a channel-by-channel basis to learn long-term emotion dependence and extract contextual features. A light-weight deep CNN is constructed for low-level feature extraction, and self-attention is used to obtain information about channel and spatial signals in speech signals. In addition, we apply dynamic routing to improve the accuracy and construct a model that is robust to feature variations. The proposed model shows parameter reduction and accuracy improvement in the overall experiments of speech emotion datasets (EMO-DB, RAVDESS, and IEMOCAP), achieving 87.86%, 83.44%, and 66.06% accuracy respectively with about 156,000 parameters. In this study, we proposed a metric to calculate the trade-off between the number of parameters and accuracy for performance evaluation against light-weight.","PeriodicalId":249252,"journal":{"name":"Korean Institute of Smart Media","volume":"157 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RoutingConvNet: A Light-weight Speech Emotion Recognition Model Based on Bidirectional MFCC\",\"authors\":\"Hyun Taek Lim, Soo-Hyung Kim, Gueesang Lee, Hyung-Jeong Yang\",\"doi\":\"10.30693/smj.2023.12.5.28\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this study, we propose a new light-weight model RoutingConvNet with fewer parameters to improve the applicability and practicality of speech emotion recognition. To reduce the number of learnable parameters, the proposed model connects bidirectional MFCCs on a channel-by-channel basis to learn long-term emotion dependence and extract contextual features. A light-weight deep CNN is constructed for low-level feature extraction, and self-attention is used to obtain information about channel and spatial signals in speech signals. In addition, we apply dynamic routing to improve the accuracy and construct a model that is robust to feature variations. The proposed model shows parameter reduction and accuracy improvement in the overall experiments of speech emotion datasets (EMO-DB, RAVDESS, and IEMOCAP), achieving 87.86%, 83.44%, and 66.06% accuracy respectively with about 156,000 parameters. In this study, we proposed a metric to calculate the trade-off between the number of parameters and accuracy for performance evaluation against light-weight.\",\"PeriodicalId\":249252,\"journal\":{\"name\":\"Korean Institute of Smart Media\",\"volume\":\"157 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Korean Institute of Smart Media\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.30693/smj.2023.12.5.28\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Korean Institute of Smart Media","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30693/smj.2023.12.5.28","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

为了提高语音情感识别的适用性和实用性,本文提出了一种参数较少的轻量级路由卷积神经网络模型。为了减少可学习参数的数量,该模型以通道为基础连接双向mfc,以学习长期情感依赖并提取上下文特征。构建轻量级深度CNN进行底层特征提取,利用自关注获取语音信号中的信道和空间信号信息。此外,我们采用动态路由来提高精度,并构建了一个对特征变化具有鲁棒性的模型。在语音情感数据集(EMO-DB、RAVDESS和IEMOCAP)的整体实验中,该模型的参数减少,准确率提高,约15.6万个参数,准确率分别达到87.86%、83.44%和66.06%。在这项研究中,我们提出了一个度量来计算参数数量和轻量级性能评估准确性之间的权衡。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
RoutingConvNet: A Light-weight Speech Emotion Recognition Model Based on Bidirectional MFCC
In this study, we propose a new light-weight model RoutingConvNet with fewer parameters to improve the applicability and practicality of speech emotion recognition. To reduce the number of learnable parameters, the proposed model connects bidirectional MFCCs on a channel-by-channel basis to learn long-term emotion dependence and extract contextual features. A light-weight deep CNN is constructed for low-level feature extraction, and self-attention is used to obtain information about channel and spatial signals in speech signals. In addition, we apply dynamic routing to improve the accuracy and construct a model that is robust to feature variations. The proposed model shows parameter reduction and accuracy improvement in the overall experiments of speech emotion datasets (EMO-DB, RAVDESS, and IEMOCAP), achieving 87.86%, 83.44%, and 66.06% accuracy respectively with about 156,000 parameters. In this study, we proposed a metric to calculate the trade-off between the number of parameters and accuracy for performance evaluation against light-weight.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信