基于融合内模态和交叉模态特征的双峰语音情感识别

2023 Fourteenth International Conference on Ubiquitous and Future Networks (ICUFN) Pub Date : 2023-07-04 DOI:10.1109/ICUFN57995.2023.10199790

Samuel Kakuba, Dong Seog Han

{"title":"基于融合内模态和交叉模态特征的双峰语音情感识别","authors":"Samuel Kakuba, Dong Seog Han","doi":"10.1109/ICUFN57995.2023.10199790","DOIUrl":null,"url":null,"abstract":"The interactive speech between two or more inter locutors involves the text and acoustic modalities. These modalities consist of intra and cross-modality relationships at different time intervals which if modeled well, can avail emotionally rich cues for robust and accurate prediction of emotion states. This necessitates models that take into consideration long short-term dependency between the current, previous, and future time steps using multimodal approaches. Moreover, it is important to contextualize the interactive speech in order to accurately infer the emotional state. A combination of recurrent and/or convolutional neural networks with attention mechanisms is often used by researchers. In this paper, we propose a deep learning-based bimodal speech emotion recognition (DLBER) model that uses multi-level fusion to learn intra and cross-modality feature representations. The proposed DLBER model uses the transformer encoder to model the intra-modality features that are combined at the first level fusion in the local feature learning block (LFLB). We also use self-attentive bidirectional LSTM layers to further extract intramodality features before the second level fusion for further progressive learning of the cross-modality features. The resultant feature representation is fed into another self-attentive bidirectional LSTM layer in the global feature learning block (GFLB). The interactive emotional dyadic motion capture (IEMOCAP) dataset was used to evaluate the performance of the proposed DLBER model. The proposed DLBER model achieves 72.93% and 74.05% of F1 score and accuracy respectively.","PeriodicalId":341881,"journal":{"name":"2023 Fourteenth International Conference on Ubiquitous and Future Networks (ICUFN)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Bimodal Speech Emotion Recognition using Fused Intra and Cross Modality Features\",\"authors\":\"Samuel Kakuba, Dong Seog Han\",\"doi\":\"10.1109/ICUFN57995.2023.10199790\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The interactive speech between two or more inter locutors involves the text and acoustic modalities. These modalities consist of intra and cross-modality relationships at different time intervals which if modeled well, can avail emotionally rich cues for robust and accurate prediction of emotion states. This necessitates models that take into consideration long short-term dependency between the current, previous, and future time steps using multimodal approaches. Moreover, it is important to contextualize the interactive speech in order to accurately infer the emotional state. A combination of recurrent and/or convolutional neural networks with attention mechanisms is often used by researchers. In this paper, we propose a deep learning-based bimodal speech emotion recognition (DLBER) model that uses multi-level fusion to learn intra and cross-modality feature representations. The proposed DLBER model uses the transformer encoder to model the intra-modality features that are combined at the first level fusion in the local feature learning block (LFLB). We also use self-attentive bidirectional LSTM layers to further extract intramodality features before the second level fusion for further progressive learning of the cross-modality features. The resultant feature representation is fed into another self-attentive bidirectional LSTM layer in the global feature learning block (GFLB). The interactive emotional dyadic motion capture (IEMOCAP) dataset was used to evaluate the performance of the proposed DLBER model. The proposed DLBER model achieves 72.93% and 74.05% of F1 score and accuracy respectively.\",\"PeriodicalId\":341881,\"journal\":{\"name\":\"2023 Fourteenth International Conference on Ubiquitous and Future Networks (ICUFN)\",\"volume\":\"66 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 Fourteenth International Conference on Ubiquitous and Future Networks (ICUFN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICUFN57995.2023.10199790\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Fourteenth International Conference on Ubiquitous and Future Networks (ICUFN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICUFN57995.2023.10199790","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

两个或多个对话者之间的互动言语包括语篇和声模态。这些模态包括不同时间间隔的内模态和跨模态关系，如果建模良好，可以利用丰富的情感线索对情绪状态进行稳健和准确的预测。这就需要使用多模态方法来考虑当前、以前和未来时间步之间的长短期依赖关系的模型。此外，为了准确地推断情感状态，将互动言语语境化是很重要的。研究人员经常使用循环和/或卷积神经网络与注意机制的组合。在本文中，我们提出了一种基于深度学习的双峰语音情感识别(DLBER)模型，该模型使用多级融合来学习内模态和跨模态特征表示。提出的DLBER模型使用变压器编码器对在局部特征学习块(LFLB)的一级融合中组合的模态内特征进行建模。我们还使用自关注的双向LSTM层在第二级融合之前进一步提取模态内特征，以便进一步逐步学习跨模态特征。生成的特征表示被馈送到全局特征学习块(GFLB)中的另一个自关注的双向LSTM层。使用交互式情绪二元动作捕捉(IEMOCAP)数据集来评估所提出的DLBER模型的性能。所提出的DLBER模型分别达到F1得分和准确率的72.93%和74.05%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Bimodal Speech Emotion Recognition using Fused Intra and Cross Modality Features

The interactive speech between two or more inter locutors involves the text and acoustic modalities. These modalities consist of intra and cross-modality relationships at different time intervals which if modeled well, can avail emotionally rich cues for robust and accurate prediction of emotion states. This necessitates models that take into consideration long short-term dependency between the current, previous, and future time steps using multimodal approaches. Moreover, it is important to contextualize the interactive speech in order to accurately infer the emotional state. A combination of recurrent and/or convolutional neural networks with attention mechanisms is often used by researchers. In this paper, we propose a deep learning-based bimodal speech emotion recognition (DLBER) model that uses multi-level fusion to learn intra and cross-modality feature representations. The proposed DLBER model uses the transformer encoder to model the intra-modality features that are combined at the first level fusion in the local feature learning block (LFLB). We also use self-attentive bidirectional LSTM layers to further extract intramodality features before the second level fusion for further progressive learning of the cross-modality features. The resultant feature representation is fed into another self-attentive bidirectional LSTM layer in the global feature learning block (GFLB). The interactive emotional dyadic motion capture (IEMOCAP) dataset was used to evaluate the performance of the proposed DLBER model. The proposed DLBER model achieves 72.93% and 74.05% of F1 score and accuracy respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 Fourteenth International Conference on Ubiquitous and Future Networks (ICUFN)

自引率

0.00%

发文量