基于声学和文本特征的递归神经网络的多维语音情感识别

International Journal of Informatics, Information System and Computer Engineering (INJIISCOM) Pub Date : 1900-01-01 DOI:10.34010/injiiscom.v1i1.4023

Bagus Tris Atmaja, Reda Elbarougy, M. Akagi

{"title":"基于声学和文本特征的递归神经网络的多维语音情感识别","authors":"Bagus Tris Atmaja, Reda Elbarougy, M. Akagi","doi":"10.34010/injiiscom.v1i1.4023","DOIUrl":null,"url":null,"abstract":"Emotion can be inferred from tonal and verbal information, where both features can be extracted from speech. While most researchers conducted studies on categorical emotion recognition from a single modality, this research presents a dimensional emotion recognition combining acoustic and text features. A number of 31 acoustic features are extracted from speech, while word vector is used as text features. The initial result on single modality emotion recognition can be used as a cue to combine both features with improving the recognition result. The latter result shows that a combination of acoustic and text features decreases the error of dimensional emotion score prediction by about 5% from the acoustic system and 1% from the text system. This smallest error is achieved by combining the text system with Long Short-Term Memory (LSTM) networks and acoustic systems with bidirectional LSTM networks and concatenated both systems with dense networks","PeriodicalId":196635,"journal":{"name":"International Journal of Informatics, Information System and Computer Engineering (INJIISCOM)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Dimensional Speech Emotion Recognition from Acoustic and Text Features using Recurrent Neural Networks\",\"authors\":\"Bagus Tris Atmaja, Reda Elbarougy, M. Akagi\",\"doi\":\"10.34010/injiiscom.v1i1.4023\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emotion can be inferred from tonal and verbal information, where both features can be extracted from speech. While most researchers conducted studies on categorical emotion recognition from a single modality, this research presents a dimensional emotion recognition combining acoustic and text features. A number of 31 acoustic features are extracted from speech, while word vector is used as text features. The initial result on single modality emotion recognition can be used as a cue to combine both features with improving the recognition result. The latter result shows that a combination of acoustic and text features decreases the error of dimensional emotion score prediction by about 5% from the acoustic system and 1% from the text system. This smallest error is achieved by combining the text system with Long Short-Term Memory (LSTM) networks and acoustic systems with bidirectional LSTM networks and concatenated both systems with dense networks\",\"PeriodicalId\":196635,\"journal\":{\"name\":\"International Journal of Informatics, Information System and Computer Engineering (INJIISCOM)\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Informatics, Information System and Computer Engineering (INJIISCOM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.34010/injiiscom.v1i1.4023\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Informatics, Information System and Computer Engineering (INJIISCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.34010/injiiscom.v1i1.4023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

情感可以从音调和语言信息中推断出来，这两种特征都可以从语音中提取出来。大多数研究者都是从单一模态进行分类情感识别的研究，而本研究提出了一种结合声音和文本特征的维度情感识别。从语音中提取31个声学特征，使用词向量作为文本特征。单模态情感识别的初步结果可以作为线索，将两种特征结合起来，提高识别结果。后者的结果表明，声学和文本特征的结合使维度情感评分预测的误差比声学系统降低了约5%，比文本系统降低了约1%。通过将文本系统与长短期记忆(LSTM)网络结合，将声学系统与双向LSTM网络结合，并将这两个系统与密集网络连接起来，可以实现最小的误差

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Dimensional Speech Emotion Recognition from Acoustic and Text Features using Recurrent Neural Networks

Emotion can be inferred from tonal and verbal information, where both features can be extracted from speech. While most researchers conducted studies on categorical emotion recognition from a single modality, this research presents a dimensional emotion recognition combining acoustic and text features. A number of 31 acoustic features are extracted from speech, while word vector is used as text features. The initial result on single modality emotion recognition can be used as a cue to combine both features with improving the recognition result. The latter result shows that a combination of acoustic and text features decreases the error of dimensional emotion score prediction by about 5% from the acoustic system and 1% from the text system. This smallest error is achieved by combining the text system with Long Short-Term Memory (LSTM) networks and acoustic systems with bidirectional LSTM networks and concatenated both systems with dense networks

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Informatics, Information System and Computer Engineering (INJIISCOM)

自引率

0.00%

发文量