用于语音情感识别的双向融合网络中的相互关联注意因素

Proceedings of the ... ACM International Conference on Multimedia, with co-located Symposium & Workshops. ACM International Conference on Multimedia Pub Date : 2019-10-01 DOI:10.1145/3343031.3351039

Yue Gu, Xinyu Lyu, Weijia Sun, Weitian Li, Shuhong Chen, Xinyu Li, Marsic Ivan

{"title":"用于语音情感识别的双向融合网络中的相互关联注意因素","authors":"Yue Gu, Xinyu Lyu, Weijia Sun, Weitian Li, Shuhong Chen, Xinyu Li, Marsic Ivan","doi":"10.1145/3343031.3351039","DOIUrl":null,"url":null,"abstract":"Emotion recognition in dyadic communication is challenging because: 1. Extracting informative modality-specific representations requires disparate feature extractor designs due to the heterogenous input data formats. 2. How to effectively and efficiently fuse unimodal features and learn associations between dyadic utterances are critical to the model generalization in actual scenario. 3. Disagreeing annotations prevent previous approaches from precisely predicting emotions in context. To address the above issues, we propose an efficient dyadic fusion network that only relies on an attention mechanism to select representative vectors, fuse modality-specific features, and learn the sequence information. Our approach has three distinct characteristics: 1. Instead of using a recurrent neural network to extract temporal associations as in most previous research, we introduce multiple sub-view attention layers to compute the relevant dependencies among sequential utterances; this significantly improves model efficiency. 2. To improve fusion performance, we design a learnable mutual correlation factor inside each attention layer to compute associations across different modalities. 3. To overcome the label disagreement issue, we embed the labels from all annotators into a k-dimensional vector and transform the categorical problem into a regression problem; this method provides more accurate annotation information and fully uses the entire dataset. We evaluate the proposed model on two published multimodal emotion recognition datasets: IEMOCAP and MELD. Our model significantly outperforms previous state-of-the-art research by 3.8%-7.5% accuracy, using a more efficient model.","PeriodicalId":90687,"journal":{"name":"Proceedings of the ... ACM International Conference on Multimedia, with co-located Symposium & Workshops. ACM International Conference on Multimedia","volume":"2019 ","pages":"157-166"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7085887/pdf/nihms-1571671.pdf","citationCount":"0","resultStr":"{\"title\":\"Mutual Correlation Attentive Factors in Dyadic Fusion Networks for Speech Emotion Recognition.\",\"authors\":\"Yue Gu, Xinyu Lyu, Weijia Sun, Weitian Li, Shuhong Chen, Xinyu Li, Marsic Ivan\",\"doi\":\"10.1145/3343031.3351039\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emotion recognition in dyadic communication is challenging because: 1. Extracting informative modality-specific representations requires disparate feature extractor designs due to the heterogenous input data formats. 2. How to effectively and efficiently fuse unimodal features and learn associations between dyadic utterances are critical to the model generalization in actual scenario. 3. Disagreeing annotations prevent previous approaches from precisely predicting emotions in context. To address the above issues, we propose an efficient dyadic fusion network that only relies on an attention mechanism to select representative vectors, fuse modality-specific features, and learn the sequence information. Our approach has three distinct characteristics: 1. Instead of using a recurrent neural network to extract temporal associations as in most previous research, we introduce multiple sub-view attention layers to compute the relevant dependencies among sequential utterances; this significantly improves model efficiency. 2. To improve fusion performance, we design a learnable mutual correlation factor inside each attention layer to compute associations across different modalities. 3. To overcome the label disagreement issue, we embed the labels from all annotators into a k-dimensional vector and transform the categorical problem into a regression problem; this method provides more accurate annotation information and fully uses the entire dataset. We evaluate the proposed model on two published multimodal emotion recognition datasets: IEMOCAP and MELD. Our model significantly outperforms previous state-of-the-art research by 3.8%-7.5% accuracy, using a more efficient model.\",\"PeriodicalId\":90687,\"journal\":{\"name\":\"Proceedings of the ... ACM International Conference on Multimedia, with co-located Symposium & Workshops. ACM International Conference on Multimedia\",\"volume\":\"2019 \",\"pages\":\"157-166\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7085887/pdf/nihms-1571671.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ... ACM International Conference on Multimedia, with co-located Symposium & Workshops. ACM International Conference on Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3343031.3351039\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... ACM International Conference on Multimedia, with co-located Symposium & Workshops. ACM International Conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3343031.3351039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

双人交流中的情感识别具有挑战性，因为1.由于输入数据格式不同，提取信息量大的特定模态表征需要不同的特征提取器设计。2.如何有效、高效地融合单模态特征，并学习双向语篇之间的关联，对于模型在实际场景中的泛化至关重要。3.由于注释不一致，以往的方法无法精确预测语境中的情绪。为了解决上述问题，我们提出了一种高效的双元融合网络，它仅依靠注意力机制来选择代表性向量、融合特定模态特征并学习序列信息。我们的方法有三个显著特点：1.我们没有像之前的大多数研究那样使用递归神经网络来提取时间关联，而是引入了多个子视图注意层来计算序列语篇之间的相关依赖关系；这大大提高了模型的效率。2.2. 为了提高融合性能，我们在每个注意层内设计了一个可学习的相互关联因子，以计算不同模态之间的关联。3.3. 为了克服标签分歧问题，我们将所有注释者的标签嵌入到一个 k 维向量中，并将分类问题转化为回归问题；这种方法能提供更准确的注释信息，并充分利用整个数据集。我们在两个已发布的多模态情感识别数据集上对所提出的模型进行了评估：IEMOCAP 和 MELD。通过使用更高效的模型，我们的模型以 3.8%-7.5% 的准确率明显优于之前的先进研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Mutual Correlation Attentive Factors in Dyadic Fusion Networks for Speech Emotion Recognition.

Emotion recognition in dyadic communication is challenging because: 1. Extracting informative modality-specific representations requires disparate feature extractor designs due to the heterogenous input data formats. 2. How to effectively and efficiently fuse unimodal features and learn associations between dyadic utterances are critical to the model generalization in actual scenario. 3. Disagreeing annotations prevent previous approaches from precisely predicting emotions in context. To address the above issues, we propose an efficient dyadic fusion network that only relies on an attention mechanism to select representative vectors, fuse modality-specific features, and learn the sequence information. Our approach has three distinct characteristics: 1. Instead of using a recurrent neural network to extract temporal associations as in most previous research, we introduce multiple sub-view attention layers to compute the relevant dependencies among sequential utterances; this significantly improves model efficiency. 2. To improve fusion performance, we design a learnable mutual correlation factor inside each attention layer to compute associations across different modalities. 3. To overcome the label disagreement issue, we embed the labels from all annotators into a k-dimensional vector and transform the categorical problem into a regression problem; this method provides more accurate annotation information and fully uses the entire dataset. We evaluate the proposed model on two published multimodal emotion recognition datasets: IEMOCAP and MELD. Our model significantly outperforms previous state-of-the-art research by 3.8%-7.5% accuracy, using a more efficient model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ... ACM International Conference on Multimedia, with co-located Symposium & Workshops. ACM International Conference on Multimedia

自引率

0.00%

发文量