Audio–Visual Fusion for Emotion Recognition in the Valence–Arousal Space Using Joint Cross-Attention

IEEE transactions on biometrics, behavior, and identity science Pub Date : 2023-01-04 DOI:10.1109/TBIOM.2022.3233083

R. Gnana Praveen;Patrick Cardinal;Eric Granger

{"title":"Audio–Visual Fusion for Emotion Recognition in the Valence–Arousal Space Using Joint Cross-Attention","authors":"R. Gnana Praveen;Patrick Cardinal;Eric Granger","doi":"10.1109/TBIOM.2022.3233083","DOIUrl":null,"url":null,"abstract":"Automatic emotion recognition (ER) has recently gained much interest due to its potential in many real-world applications. In this context, multimodal approaches have been shown to improve performance (over unimodal approaches) by combining diverse and complementary sources of information, providing some robustness to noisy and missing modalities. In this paper, we focus on dimensional ER based on the fusion of facial and vocal modalities extracted from videos, where complementary audio-visual (A-V) relationships are explored to predict an individual’s emotional states in valence-arousal space. Most state-of-the-art fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. To address this problem, we introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities, and allows to effectively leverage the inter-modal relationships, while retaining the intra-modal relationships. In particular, it computes the cross-attention weights based on correlation between the joint feature representation and that of individual modalities. Deploying the joint A-V feature representation into the cross-attention module helps to simultaneously leverage both the intra and inter modal relationships, thereby significantly improving the performance of the system over the vanilla cross-attention module. The effectiveness of our proposed approach is validated experimentally on challenging videos from the RECOLA and AffWild2 datasets. Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches, even when the modalities are noisy or absent. Code is available at \n<uri>https://github.com/praveena2j/Joint-Cross-Attention-for-Audio-Visual-Fusion</uri>\n.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"5 3","pages":"360-373"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on biometrics, behavior, and identity science","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10005783/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Automatic emotion recognition (ER) has recently gained much interest due to its potential in many real-world applications. In this context, multimodal approaches have been shown to improve performance (over unimodal approaches) by combining diverse and complementary sources of information, providing some robustness to noisy and missing modalities. In this paper, we focus on dimensional ER based on the fusion of facial and vocal modalities extracted from videos, where complementary audio-visual (A-V) relationships are explored to predict an individual’s emotional states in valence-arousal space. Most state-of-the-art fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. To address this problem, we introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities, and allows to effectively leverage the inter-modal relationships, while retaining the intra-modal relationships. In particular, it computes the cross-attention weights based on correlation between the joint feature representation and that of individual modalities. Deploying the joint A-V feature representation into the cross-attention module helps to simultaneously leverage both the intra and inter modal relationships, thereby significantly improving the performance of the system over the vanilla cross-attention module. The effectiveness of our proposed approach is validated experimentally on challenging videos from the RECOLA and AffWild2 datasets. Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches, even when the modalities are noisy or absent. Code is available at https://github.com/praveena2j/Joint-Cross-Attention-for-Audio-Visual-Fusion .

查看原文本刊更多论文

基于联合交叉注意的效价觉醒空间的视听融合情绪识别

自动情绪识别(ER)由于其在现实世界中的应用潜力，近年来引起了人们的广泛关注。在这种情况下，多模态方法已被证明可以通过结合不同和互补的信息源来提高性能(相对于单模态方法)，并对噪声和缺失模态提供一定的鲁棒性。在本文中，我们将重点放在基于从视频中提取的面部和声音模式融合的维度ER上，其中探索互补视听(A-V)关系来预测个体在价-唤醒空间中的情绪状态。大多数最先进的融合技术依赖于循环网络或传统的注意力机制，这些机制不能有效地利用A-V模式的互补性。为了解决这个问题，我们引入了a- v融合的联合交叉注意模型，该模型提取了a- v模态的显著特征，并允许有效地利用模态间的关系，同时保留模态内的关系。特别是，它基于联合特征表示与单个模态表示之间的相关性计算交叉注意权重。将联合的A-V特征表示部署到交叉注意模块中，有助于同时利用模态内部和模态间的关系，从而显著提高系统的性能，优于普通的交叉注意模块。我们提出的方法的有效性在来自RECOLA和AffWild2数据集的挑战性视频上得到了实验验证。结果表明，我们的联合交叉注意a - v融合模型提供了一种具有成本效益的解决方案，即使在模式嘈杂或不存在的情况下，也可以优于最先进的方法。代码可从https://github.com/praveena2j/Joint-Cross-Attention-for-Audio-Visual-Fusion获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on biometrics, behavior, and identity science

CiteScore

10.90

自引率

0.00%

发文量