An autoencoder-based feature level fusion for speech emotion recognition

IF 7.5 2区计算机科学 Q1 TELECOMMUNICATIONS

Digital Communications and Networks Pub Date : 2024-10-01 DOI:10.1016/j.dcan.2022.10.018

Peng Shixin, Chen Kai, Tian Tian, Chen Jingying

{"title":"An autoencoder-based feature level fusion for speech emotion recognition","authors":"Peng Shixin, Chen Kai, Tian Tian, Chen Jingying","doi":"10.1016/j.dcan.2022.10.018","DOIUrl":null,"url":null,"abstract":"<div><div>Although speech emotion recognition is challenging, it has broad application prospects in human-computer interaction. Building a system that can accurately and stably recognize emotions from human languages can provide a better user experience. However, the current unimodal emotion feature representations are not distinctive enough to accomplish the recognition, and they do not effectively simulate the inter-modality dynamics in speech emotion recognition tasks. This paper proposes a multimodal method that utilizes both audio and semantic content for speech emotion recognition. The proposed method consists of three parts: two high-level feature extractors for text and audio modalities, and an autoencoder-based feature fusion. For audio modality, we propose a structure called Temporal Global Feature Extractor (TGFE) to extract the high-level features of the time-frequency domain relationship from the original speech signal. Considering that text lacks frequency information, we use only a Bidirectional Long Short-Term Memory network (BLSTM) and attention mechanism to simulate an intra-modal dynamic. Once these steps have been accomplished, the high-level text and audio features are sent to the autoencoder in parallel to learn their shared representation for final emotion classification. We conducted extensive experiments on three public benchmark datasets to evaluate our method. The results on Interactive Emotional Motion Capture (IEMOCAP) and Multimodal EmotionLines Dataset (MELD) outperform the existing method. Additionally, the results of CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) are competitive. Furthermore, experimental results show that compared to unimodal information and autoencoder-based feature level fusion, the joint multimodal information (audio and text) improves the overall performance and can achieve greater accuracy than simple feature concatenation.</div></div>","PeriodicalId":48631,"journal":{"name":"Digital Communications and Networks","volume":"10 5","pages":"Pages 1341-1351"},"PeriodicalIF":7.5000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Communications and Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352864822002279","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"TELECOMMUNICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Although speech emotion recognition is challenging, it has broad application prospects in human-computer interaction. Building a system that can accurately and stably recognize emotions from human languages can provide a better user experience. However, the current unimodal emotion feature representations are not distinctive enough to accomplish the recognition, and they do not effectively simulate the inter-modality dynamics in speech emotion recognition tasks. This paper proposes a multimodal method that utilizes both audio and semantic content for speech emotion recognition. The proposed method consists of three parts: two high-level feature extractors for text and audio modalities, and an autoencoder-based feature fusion. For audio modality, we propose a structure called Temporal Global Feature Extractor (TGFE) to extract the high-level features of the time-frequency domain relationship from the original speech signal. Considering that text lacks frequency information, we use only a Bidirectional Long Short-Term Memory network (BLSTM) and attention mechanism to simulate an intra-modal dynamic. Once these steps have been accomplished, the high-level text and audio features are sent to the autoencoder in parallel to learn their shared representation for final emotion classification. We conducted extensive experiments on three public benchmark datasets to evaluate our method. The results on Interactive Emotional Motion Capture (IEMOCAP) and Multimodal EmotionLines Dataset (MELD) outperform the existing method. Additionally, the results of CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) are competitive. Furthermore, experimental results show that compared to unimodal information and autoencoder-based feature level fusion, the joint multimodal information (audio and text) improves the overall performance and can achieve greater accuracy than simple feature concatenation.

查看原文本刊更多论文

基于自动编码器的语音情感识别特征级融合

尽管语音情感识别具有挑战性，但它在人机交互领域却有着广阔的应用前景。建立一个能从人类语言中准确、稳定地识别情感的系统，能为用户提供更好的体验。然而，目前的单模态情感特征表征不够鲜明，无法有效模拟语音情感识别任务中的跨模态动态。本文提出了一种利用音频和语义内容进行语音情感识别的多模态方法。该方法由三部分组成：两个用于文本和音频模式的高级特征提取器，以及一个基于自动编码器的特征融合器。对于音频模式，我们提出了一种名为时域全局特征提取器（TGFE）的结构，用于从原始语音信号中提取时频域关系的高级特征。考虑到文本缺乏频率信息，我们仅使用双向长短期记忆网络（BLSTM）和注意力机制来模拟模态内动态。完成这些步骤后，高级文本和音频特征将并行发送给自动编码器，以学习它们的共享表示，从而进行最终的情感分类。我们在三个公共基准数据集上进行了广泛的实验，以评估我们的方法。交互式情感动作捕捉（IEMOCAP）和多模态情感线数据集（MELD）的结果优于现有方法。此外，CMU 多模态意见级情感强度（CMU-MOSI）的结果也很有竞争力。此外，实验结果表明，与单模态信息和基于自动编码器的特征级融合相比，联合多模态信息（音频和文本）提高了整体性能，比简单的特征串联获得更高的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Digital Communications and Networks Computer Science-Hardware and Architecture

CiteScore

12.80

自引率

5.10%

发文量

915

审稿时长

30 weeks

期刊介绍： Digital Communications and Networks is a prestigious journal that emphasizes on communication systems and networks. We publish only top-notch original articles and authoritative reviews, which undergo rigorous peer-review. We are proud to announce that all our articles are fully Open Access and can be accessed on ScienceDirect. Our journal is recognized and indexed by eminent databases such as the Science Citation Index Expanded (SCIE) and Scopus. In addition to regular articles, we may also consider exceptional conference papers that have been significantly expanded. Furthermore, we periodically release special issues that focus on specific aspects of the field. In conclusion, Digital Communications and Networks is a leading journal that guarantees exceptional quality and accessibility for researchers and scholars in the field of communication systems and networks.