An autoencoder-based feature level fusion for speech emotion recognition

IF 7.5 2区 计算机科学 Q1 TELECOMMUNICATIONS
Peng Shixin, Chen Kai, Tian Tian, Chen Jingying
{"title":"An autoencoder-based feature level fusion for speech emotion recognition","authors":"Peng Shixin,&nbsp;Chen Kai,&nbsp;Tian Tian,&nbsp;Chen Jingying","doi":"10.1016/j.dcan.2022.10.018","DOIUrl":null,"url":null,"abstract":"<div><div>Although speech emotion recognition is challenging, it has broad application prospects in human-computer interaction. Building a system that can accurately and stably recognize emotions from human languages can provide a better user experience. However, the current unimodal emotion feature representations are not distinctive enough to accomplish the recognition, and they do not effectively simulate the inter-modality dynamics in speech emotion recognition tasks. This paper proposes a multimodal method that utilizes both audio and semantic content for speech emotion recognition. The proposed method consists of three parts: two high-level feature extractors for text and audio modalities, and an autoencoder-based feature fusion. For audio modality, we propose a structure called Temporal Global Feature Extractor (TGFE) to extract the high-level features of the time-frequency domain relationship from the original speech signal. Considering that text lacks frequency information, we use only a Bidirectional Long Short-Term Memory network (BLSTM) and attention mechanism to simulate an intra-modal dynamic. Once these steps have been accomplished, the high-level text and audio features are sent to the autoencoder in parallel to learn their shared representation for final emotion classification. We conducted extensive experiments on three public benchmark datasets to evaluate our method. The results on Interactive Emotional Motion Capture (IEMOCAP) and Multimodal EmotionLines Dataset (MELD) outperform the existing method. Additionally, the results of CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) are competitive. Furthermore, experimental results show that compared to unimodal information and autoencoder-based feature level fusion, the joint multimodal information (audio and text) improves the overall performance and can achieve greater accuracy than simple feature concatenation.</div></div>","PeriodicalId":48631,"journal":{"name":"Digital Communications and Networks","volume":"10 5","pages":"Pages 1341-1351"},"PeriodicalIF":7.5000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Communications and Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352864822002279","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"TELECOMMUNICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Although speech emotion recognition is challenging, it has broad application prospects in human-computer interaction. Building a system that can accurately and stably recognize emotions from human languages can provide a better user experience. However, the current unimodal emotion feature representations are not distinctive enough to accomplish the recognition, and they do not effectively simulate the inter-modality dynamics in speech emotion recognition tasks. This paper proposes a multimodal method that utilizes both audio and semantic content for speech emotion recognition. The proposed method consists of three parts: two high-level feature extractors for text and audio modalities, and an autoencoder-based feature fusion. For audio modality, we propose a structure called Temporal Global Feature Extractor (TGFE) to extract the high-level features of the time-frequency domain relationship from the original speech signal. Considering that text lacks frequency information, we use only a Bidirectional Long Short-Term Memory network (BLSTM) and attention mechanism to simulate an intra-modal dynamic. Once these steps have been accomplished, the high-level text and audio features are sent to the autoencoder in parallel to learn their shared representation for final emotion classification. We conducted extensive experiments on three public benchmark datasets to evaluate our method. The results on Interactive Emotional Motion Capture (IEMOCAP) and Multimodal EmotionLines Dataset (MELD) outperform the existing method. Additionally, the results of CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) are competitive. Furthermore, experimental results show that compared to unimodal information and autoencoder-based feature level fusion, the joint multimodal information (audio and text) improves the overall performance and can achieve greater accuracy than simple feature concatenation.
基于自动编码器的语音情感识别特征级融合
尽管语音情感识别具有挑战性,但它在人机交互领域却有着广阔的应用前景。建立一个能从人类语言中准确、稳定地识别情感的系统,能为用户提供更好的体验。然而,目前的单模态情感特征表征不够鲜明,无法有效模拟语音情感识别任务中的跨模态动态。本文提出了一种利用音频和语义内容进行语音情感识别的多模态方法。该方法由三部分组成:两个用于文本和音频模式的高级特征提取器,以及一个基于自动编码器的特征融合器。对于音频模式,我们提出了一种名为时域全局特征提取器(TGFE)的结构,用于从原始语音信号中提取时频域关系的高级特征。考虑到文本缺乏频率信息,我们仅使用双向长短期记忆网络(BLSTM)和注意力机制来模拟模态内动态。完成这些步骤后,高级文本和音频特征将并行发送给自动编码器,以学习它们的共享表示,从而进行最终的情感分类。我们在三个公共基准数据集上进行了广泛的实验,以评估我们的方法。交互式情感动作捕捉(IEMOCAP)和多模态情感线数据集(MELD)的结果优于现有方法。此外,CMU 多模态意见级情感强度(CMU-MOSI)的结果也很有竞争力。此外,实验结果表明,与单模态信息和基于自动编码器的特征级融合相比,联合多模态信息(音频和文本)提高了整体性能,比简单的特征串联获得更高的准确性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Digital Communications and Networks
Digital Communications and Networks Computer Science-Hardware and Architecture
CiteScore
12.80
自引率
5.10%
发文量
915
审稿时长
30 weeks
期刊介绍: Digital Communications and Networks is a prestigious journal that emphasizes on communication systems and networks. We publish only top-notch original articles and authoritative reviews, which undergo rigorous peer-review. We are proud to announce that all our articles are fully Open Access and can be accessed on ScienceDirect. Our journal is recognized and indexed by eminent databases such as the Science Citation Index Expanded (SCIE) and Scopus. In addition to regular articles, we may also consider exceptional conference papers that have been significantly expanded. Furthermore, we periodically release special issues that focus on specific aspects of the field. In conclusion, Digital Communications and Networks is a leading journal that guarantees exceptional quality and accessibility for researchers and scholars in the field of communication systems and networks.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信