End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network.

IF 1.9 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing Pub Date : 2021-01-01 Epub Date: 2021-05-12 DOI:10.1186/s13636-021-00208-5

Duowei Tang, Peter Kuppens, Luc Geurts, Toon van Waterschoot

{"title":"End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network.","authors":"Duowei Tang, Peter Kuppens, Luc Geurts, Toon van Waterschoot","doi":"10.1186/s13636-021-00208-5","DOIUrl":null,"url":null,"abstract":"<p><p>Amongst the various characteristics of a speech signal, the expression of emotion is one of the characteristics that exhibits the slowest temporal dynamics. Hence, a performant speech emotion recognition (SER) system requires a predictive model that is capable of learning sufficiently long temporal dependencies in the analysed speech signal. Therefore, in this work, we propose a novel end-to-end neural network architecture based on the concept of dilated causal convolution with context stacking. Firstly, the proposed model consists only of parallelisable layers and is hence suitable for parallel processing, while avoiding the inherent lack of parallelisability occurring with recurrent neural network (RNN) layers. Secondly, the design of a dedicated dilated causal convolution block allows the model to have a receptive field as large as the input sequence length, while maintaining a reasonably low computational cost. Thirdly, by introducing a context stacking structure, the proposed model is capable of exploiting long-term temporal dependencies hence providing an alternative to the use of RNN layers. We evaluate the proposed model in SER regression and classification tasks and provide a comparison with a state-of-the-art end-to-end SER model. Experimental results indicate that the proposed model requires only 1/3 of the number of model parameters used in the state-of-the-art model, while also significantly improving SER performance. Further experiments are reported to understand the impact of using various types of input representations (i.e. raw audio samples vs log mel-spectrograms) and to illustrate the benefits of an end-to-end approach over the use of hand-crafted audio features. Moreover, we show that the proposed model can efficiently learn intermediate embeddings preserving speech emotion information.</p>","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2021 1","pages":"18"},"PeriodicalIF":1.9000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13636-021-00208-5","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurasip Journal on Audio Speech and Music Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s13636-021-00208-5","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/5/12 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 15

Abstract

Amongst the various characteristics of a speech signal, the expression of emotion is one of the characteristics that exhibits the slowest temporal dynamics. Hence, a performant speech emotion recognition (SER) system requires a predictive model that is capable of learning sufficiently long temporal dependencies in the analysed speech signal. Therefore, in this work, we propose a novel end-to-end neural network architecture based on the concept of dilated causal convolution with context stacking. Firstly, the proposed model consists only of parallelisable layers and is hence suitable for parallel processing, while avoiding the inherent lack of parallelisability occurring with recurrent neural network (RNN) layers. Secondly, the design of a dedicated dilated causal convolution block allows the model to have a receptive field as large as the input sequence length, while maintaining a reasonably low computational cost. Thirdly, by introducing a context stacking structure, the proposed model is capable of exploiting long-term temporal dependencies hence providing an alternative to the use of RNN layers. We evaluate the proposed model in SER regression and classification tasks and provide a comparison with a state-of-the-art end-to-end SER model. Experimental results indicate that the proposed model requires only 1/3 of the number of model parameters used in the state-of-the-art model, while also significantly improving SER performance. Further experiments are reported to understand the impact of using various types of input representations (i.e. raw audio samples vs log mel-spectrograms) and to illustrate the benefits of an end-to-end approach over the use of hand-crafted audio features. Moreover, we show that the proposed model can efficiently learn intermediate embeddings preserving speech emotion information.

Abstract Image

查看原文本刊更多论文

基于上下文叠加扩展卷积神经网络的端到端语音情感识别。

在语音信号的各种特征中，情感的表达是表现出最慢的时间动态特征之一。因此，一个高性能的语音情感识别(SER)系统需要一个预测模型，该模型能够在分析的语音信号中学习足够长的时间依赖性。因此，在这项工作中，我们提出了一种基于扩展因果卷积与上下文堆叠概念的新型端到端神经网络架构。首先，该模型仅由可并行层组成，因此适合并行处理，同时避免了递归神经网络(RNN)层固有的缺乏并行性。其次，设计专用的扩展因果卷积块，使模型具有与输入序列长度一样大的接受域，同时保持较低的计算成本。第三，通过引入上下文堆叠结构，所提出的模型能够利用长期时间依赖性，从而为使用RNN层提供了一种替代方案。我们在SER回归和分类任务中评估了所提出的模型，并与最先进的端到端SER模型进行了比较。实验结果表明，该模型所需的模型参数数量仅为现有模型的1/3，同时显著提高了SER性能。进一步的实验报告，以了解使用不同类型的输入表示(即原始音频样本与对数mel-谱图)的影响，并说明端到端方法比使用手工制作的音频特征的好处。此外，我们还证明了该模型可以有效地学习中间嵌入，并保留语音情感信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Eurasip Journal on Audio Speech and Music Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

4.10

自引率

4.20%

发文量

审稿时长

12 months

期刊介绍： The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.