语音信号的短期和长期依赖建模及其在副语言情感分类中的应用

Q3 Mathematics
Oxana Verkholyak, Heysem Kaya, Alexey Karpov
{"title":"语音信号的短期和长期依赖建模及其在副语言情感分类中的应用","authors":"Oxana Verkholyak, Heysem Kaya, Alexey Karpov","doi":"10.15622/SP.18.1.30-56","DOIUrl":null,"url":null,"abstract":"Recently, Speech Emotion Recognition (SER) has become an important research topic of affective computing. It is a difficult problem, where some of the greatest challenges lie in the feature selection and representation tasks. A good feature representation should be able to reflect global trends as well as temporal structure of the signal, since emotions naturally evolve in time; it has become possible with the advent of Recurrent Neural Networks (RNN), which are actively used today for various sequence modeling tasks. This paper proposes a hybrid approach to feature representation, which combines traditionally engineered statistical features with Long Short-Term Memory (LSTM) sequence representation in order to take advantage of both short-term and long-term acoustic characteristics of the signal, therefore capturing not only the general trends but also temporal structure of the signal. The evaluation of the proposed method is done on three publicly available acted emotional speech corpora in three different languages, namely RUSLANA (Russian speech), BUEMODB  (Turkish speech) and EMODB (German speech). Compared to the traditional approach, the results of our experiments show an absolute improvement of 2.3% and 2.8% for two out of three databases, and a comparative performance on the third. Therefore, provided enough training data, the proposed method proves effective in modelling emotional content of speech utterances.","PeriodicalId":53447,"journal":{"name":"SPIIRAS Proceedings","volume":"7 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Modeling Short-Term and Long-Term Dependencies of the Speech Signal for Paralinguistic Emotion Classification\",\"authors\":\"Oxana Verkholyak, Heysem Kaya, Alexey Karpov\",\"doi\":\"10.15622/SP.18.1.30-56\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, Speech Emotion Recognition (SER) has become an important research topic of affective computing. It is a difficult problem, where some of the greatest challenges lie in the feature selection and representation tasks. A good feature representation should be able to reflect global trends as well as temporal structure of the signal, since emotions naturally evolve in time; it has become possible with the advent of Recurrent Neural Networks (RNN), which are actively used today for various sequence modeling tasks. This paper proposes a hybrid approach to feature representation, which combines traditionally engineered statistical features with Long Short-Term Memory (LSTM) sequence representation in order to take advantage of both short-term and long-term acoustic characteristics of the signal, therefore capturing not only the general trends but also temporal structure of the signal. The evaluation of the proposed method is done on three publicly available acted emotional speech corpora in three different languages, namely RUSLANA (Russian speech), BUEMODB  (Turkish speech) and EMODB (German speech). Compared to the traditional approach, the results of our experiments show an absolute improvement of 2.3% and 2.8% for two out of three databases, and a comparative performance on the third. Therefore, provided enough training data, the proposed method proves effective in modelling emotional content of speech utterances.\",\"PeriodicalId\":53447,\"journal\":{\"name\":\"SPIIRAS Proceedings\",\"volume\":\"7 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SPIIRAS Proceedings\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15622/SP.18.1.30-56\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Mathematics\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SPIIRAS Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15622/SP.18.1.30-56","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 15

摘要

近年来,语音情感识别(SER)已成为情感计算领域的一个重要研究课题。这是一个困难的问题,其中一些最大的挑战在于特征选择和表示任务。一个好的特征表示应该能够反映全局趋势以及信号的时间结构,因为情绪自然会随着时间而变化;随着递归神经网络(RNN)的出现,这已经成为可能,RNN目前被积极用于各种序列建模任务。本文提出了一种混合特征表示方法,将传统工程统计特征与长短期记忆(LSTM)序列表示相结合,以利用信号的短期和长期声学特征,从而不仅捕获信号的一般趋势,而且捕获信号的时间结构。在三种不同语言的RUSLANA(俄语语音)、BUEMODB(土耳其语语音)和EMODB(德语语音)这三个公开可用的行为情感语音语料库上对所提出的方法进行了评估。与传统方法相比,我们的实验结果表明,三个数据库中的两个数据库的绝对性能提高了2.3%和2.8%,第三个数据库的性能也比较好。因此,在提供足够的训练数据的情况下,所提出的方法可以有效地模拟语音话语的情感内容。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Modeling Short-Term and Long-Term Dependencies of the Speech Signal for Paralinguistic Emotion Classification
Recently, Speech Emotion Recognition (SER) has become an important research topic of affective computing. It is a difficult problem, where some of the greatest challenges lie in the feature selection and representation tasks. A good feature representation should be able to reflect global trends as well as temporal structure of the signal, since emotions naturally evolve in time; it has become possible with the advent of Recurrent Neural Networks (RNN), which are actively used today for various sequence modeling tasks. This paper proposes a hybrid approach to feature representation, which combines traditionally engineered statistical features with Long Short-Term Memory (LSTM) sequence representation in order to take advantage of both short-term and long-term acoustic characteristics of the signal, therefore capturing not only the general trends but also temporal structure of the signal. The evaluation of the proposed method is done on three publicly available acted emotional speech corpora in three different languages, namely RUSLANA (Russian speech), BUEMODB  (Turkish speech) and EMODB (German speech). Compared to the traditional approach, the results of our experiments show an absolute improvement of 2.3% and 2.8% for two out of three databases, and a comparative performance on the third. Therefore, provided enough training data, the proposed method proves effective in modelling emotional content of speech utterances.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
SPIIRAS Proceedings
SPIIRAS Proceedings Mathematics-Applied Mathematics
CiteScore
1.90
自引率
0.00%
发文量
0
审稿时长
14 weeks
期刊介绍: The SPIIRAS Proceedings journal publishes scientific, scientific-educational, scientific-popular papers relating to computer science, automation, applied mathematics, interdisciplinary research, as well as information technology, the theoretical foundations of computer science (such as mathematical and related to other scientific disciplines), information security and information protection, decision making and artificial intelligence, mathematical modeling, informatization.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信