{"title":"Combining Gated Convolutional Networks and Self-Attention Mechanism for Speech Emotion Recognition","authors":"C. Li, Jinlong Jiao, Yiqin Zhao, Ziping Zhao","doi":"10.1109/ACIIW.2019.8925283","DOIUrl":null,"url":null,"abstract":"Discrete speech emotion recognition (SER), the assignment of a single emotion label to an entire speech utterance, is typically performed as a sequence-to-label task. The predominant approach to SER to date is based on recurrent neural networks. Their success on this task is often linked to their ability to capture unbounded context. In this paper we introduce new gated convolutional networks and apply them to SER, which can be more efficient since they allow parallelization over sequential tokens. We present a novel model architecture that incorporates a gated convolutional neural network and a temporal attention-based localization method for speech emotion recognition. To the best of the authors' knowledge, this is the first time that such a hybrid architecture is employed for SER. We demonstrate the effectiveness of our approach on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus. The experimental results demonstrate that our proposed model outperforms current state-of-the-art approaches.","PeriodicalId":193568,"journal":{"name":"2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACIIW.2019.8925283","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Discrete speech emotion recognition (SER), the assignment of a single emotion label to an entire speech utterance, is typically performed as a sequence-to-label task. The predominant approach to SER to date is based on recurrent neural networks. Their success on this task is often linked to their ability to capture unbounded context. In this paper we introduce new gated convolutional networks and apply them to SER, which can be more efficient since they allow parallelization over sequential tokens. We present a novel model architecture that incorporates a gated convolutional neural network and a temporal attention-based localization method for speech emotion recognition. To the best of the authors' knowledge, this is the first time that such a hybrid architecture is employed for SER. We demonstrate the effectiveness of our approach on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus. The experimental results demonstrate that our proposed model outperforms current state-of-the-art approaches.