基于CNN+LSTM架构的数据增强语音情感识别

Workshop on Speech, Music and Mind (SMM 2018) Pub Date : 2018-02-15 DOI:10.21437/SMM.2018-5

Caroline Etienne, Guillaume Fidanza, Andrei Petrovskii, L. Devillers, B. Schmauch

{"title":"基于CNN+LSTM架构的数据增强语音情感识别","authors":"Caroline Etienne, Guillaume Fidanza, Andrei Petrovskii, L. Devillers, B. Schmauch","doi":"10.21437/SMM.2018-5","DOIUrl":null,"url":null,"abstract":"In this work we design a neural network for recognizing emotions in speech, using the IEMOCAP dataset. Following the latest advances in audio analysis, we use an architecture involving both convolutional layers, for extracting high-level features from raw spectrograms, and recurrent ones for aggregating long-term dependencies. We examine the techniques of data augmentation with vocal track length perturbation, layer-wise optimizer adjustment, batch normalization of recurrent layers and obtain highly competitive results of 64.5% for weighted accuracy and 61.7% for unweighted accuracy on four emotions.","PeriodicalId":158743,"journal":{"name":"Workshop on Speech, Music and Mind (SMM 2018)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"78","resultStr":"{\"title\":\"CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation\",\"authors\":\"Caroline Etienne, Guillaume Fidanza, Andrei Petrovskii, L. Devillers, B. Schmauch\",\"doi\":\"10.21437/SMM.2018-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work we design a neural network for recognizing emotions in speech, using the IEMOCAP dataset. Following the latest advances in audio analysis, we use an architecture involving both convolutional layers, for extracting high-level features from raw spectrograms, and recurrent ones for aggregating long-term dependencies. We examine the techniques of data augmentation with vocal track length perturbation, layer-wise optimizer adjustment, batch normalization of recurrent layers and obtain highly competitive results of 64.5% for weighted accuracy and 61.7% for unweighted accuracy on four emotions.\",\"PeriodicalId\":158743,\"journal\":{\"name\":\"Workshop on Speech, Music and Mind (SMM 2018)\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-02-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"78\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop on Speech, Music and Mind (SMM 2018)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/SMM.2018-5\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Speech, Music and Mind (SMM 2018)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/SMM.2018-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 78

摘要

在这项工作中，我们设计了一个神经网络来识别语音中的情绪，使用IEMOCAP数据集。根据音频分析的最新进展，我们使用了一种架构，包括卷积层，用于从原始频谱图中提取高级特征，以及用于聚合长期依赖关系的循环层。我们研究了用声道长度扰动、分层优化器调整、周期性层的批归一化来增强数据的技术，并在四种情绪上获得了加权准确率64.5%和非加权准确率61.7%的极具竞争力的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation

In this work we design a neural network for recognizing emotions in speech, using the IEMOCAP dataset. Following the latest advances in audio analysis, we use an architecture involving both convolutional layers, for extracting high-level features from raw spectrograms, and recurrent ones for aggregating long-term dependencies. We examine the techniques of data augmentation with vocal track length perturbation, layer-wise optimizer adjustment, batch normalization of recurrent layers and obtain highly competitive results of 64.5% for weighted accuracy and 61.7% for unweighted accuracy on four emotions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Workshop on Speech, Music and Mind (SMM 2018)

自引率

0.00%

发文量