Dilated Residual Network with Multi-head Self-attention for Speech Emotion Recognition

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2019-05-12 DOI:10.1109/ICASSP.2019.8682154

Runnan Li, Zhiyong Wu, Jia Jia, Sheng Zhao, H. Meng

{"title":"Dilated Residual Network with Multi-head Self-attention for Speech Emotion Recognition","authors":"Runnan Li, Zhiyong Wu, Jia Jia, Sheng Zhao, H. Meng","doi":"10.1109/ICASSP.2019.8682154","DOIUrl":null,"url":null,"abstract":"Speech emotion recognition (SER) plays an important role in intelligent speech interaction. One vital challenge in SER is to extract emotion-relevant features from speech signals. In state-of-the-art SER techniques, deep learning methods, e.g, Convolutional Neural Networks (CNNs), are widely employed for feature learning and have achieved significant performance. However, in the CNN-oriented methods, two performance limitations have raised: 1) the loss of temporal structure of speech in the progressive resolution reduction; 2) the ignoring of relative dependencies between elements in suprasegmental feature sequence. In this paper, we proposed the combining use of Dilated Residual Network (DRN) and Multi-head Self-attention to alleviate the above limitations. By employing DRN, the network can retain high resolution of temporal structure in feature learning, with similar size of receptive field to CNN based approach. By employing Multi-head Self-attention, the network can model the inner dependencies between elements with different positions in the learned suprasegmental feature sequence, which enhances the importing of emotion-salient information. Experiments on emotional benchmarking dataset IEMOCAP have demonstrated the effectiveness of the proposed framework, with 11.7% to 18.6% relative improvement to state-of-the-art approaches.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"80 1 1","pages":"6675-6679"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"45","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2019.8682154","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 45

Abstract

Speech emotion recognition (SER) plays an important role in intelligent speech interaction. One vital challenge in SER is to extract emotion-relevant features from speech signals. In state-of-the-art SER techniques, deep learning methods, e.g, Convolutional Neural Networks (CNNs), are widely employed for feature learning and have achieved significant performance. However, in the CNN-oriented methods, two performance limitations have raised: 1) the loss of temporal structure of speech in the progressive resolution reduction; 2) the ignoring of relative dependencies between elements in suprasegmental feature sequence. In this paper, we proposed the combining use of Dilated Residual Network (DRN) and Multi-head Self-attention to alleviate the above limitations. By employing DRN, the network can retain high resolution of temporal structure in feature learning, with similar size of receptive field to CNN based approach. By employing Multi-head Self-attention, the network can model the inner dependencies between elements with different positions in the learned suprasegmental feature sequence, which enhances the importing of emotion-salient information. Experiments on emotional benchmarking dataset IEMOCAP have demonstrated the effectiveness of the proposed framework, with 11.7% to 18.6% relative improvement to state-of-the-art approaches.

查看原文本刊更多论文

基于多头自注意的扩展残差网络语音情绪识别

语音情感识别在智能语音交互中起着重要的作用。从语音信号中提取情感相关特征是语音识别的一个重要挑战。在最先进的SER技术中，深度学习方法，例如卷积神经网络(cnn)，被广泛用于特征学习并取得了显着的性能。然而，在面向cnn的方法中，提出了两个性能限制:1)在逐级分辨率降低中语音时间结构的丢失;2)忽略了超分段特征序列中元素之间的相对依赖关系。本文提出了扩展残差网络(DRN)和多头自注意相结合的方法来缓解上述局限性。通过使用DRN，网络可以在特征学习中保持较高的时间结构分辨率，并且接收野的大小与基于CNN的方法相似。该网络利用多头自注意对学习到的超切分特征序列中不同位置元素之间的内在依赖关系进行建模，增强了情绪显著性信息的导入。在情感基准测试数据集IEMOCAP上的实验证明了所提出框架的有效性，相对于最先进的方法，该框架的相对改进幅度为11.7%至18.6%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量