{"title":"意识到不协调性的跨模态注意力用于维度情感识别中的视听融合","authors":"R. Gnana Praveen;Jahangir Alam","doi":"10.1109/JSTSP.2024.3422823","DOIUrl":null,"url":null,"abstract":"Multimodal emotion recognition has immense potential for the comprehensive assessment of human emotions, utilizing multiple modalities that often exhibit complementary relationships. In video-based emotion recognition, audio and visual modalities have emerged as prominent contact-free channels, widely explored in existing literature. Current approaches typically employ cross-modal attention mechanisms between audio and visual modalities, assuming a constant state of complementarity. However, this assumption may not always hold true, as non-complementary relationships can also manifest, undermining the efficacy of cross-modal feature integration and thereby diminishing the quality of audio-visual feature representations. To tackle this problem, we introduce a novel Incongruity-Aware Cross-Attention (IACA) model, capable of harnessing the benefits of robust complementary relationships while efficiently managing non-complementary scenarios. Specifically, our approach incorporates a two-stage gating mechanism designed to adaptively select semantic features, thereby effectively capturing the inter-modal associations. Additionally, the proposed model demonstrates an ability to mitigate the adverse effects of severely corrupted or missing modalities. We rigorously evaluate the performance of the proposed model through extensive experiments conducted on the challenging RECOLA and Aff-Wild2 datasets. The results underscore the efficacy of our approach, as it outperforms state-of-the-art methods by adeptly capturing inter-modal relationships and minimizing the influence of missing or heavily corrupted modalities. Furthermore, we show that the proposed model is compatible with various cross-modal attention variants, consistently improving performance on both datasets.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 3","pages":"444-458"},"PeriodicalIF":8.7000,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Incongruity-Aware Cross-Modal Attention for Audio-Visual Fusion in Dimensional Emotion Recognition\",\"authors\":\"R. Gnana Praveen;Jahangir Alam\",\"doi\":\"10.1109/JSTSP.2024.3422823\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal emotion recognition has immense potential for the comprehensive assessment of human emotions, utilizing multiple modalities that often exhibit complementary relationships. In video-based emotion recognition, audio and visual modalities have emerged as prominent contact-free channels, widely explored in existing literature. Current approaches typically employ cross-modal attention mechanisms between audio and visual modalities, assuming a constant state of complementarity. However, this assumption may not always hold true, as non-complementary relationships can also manifest, undermining the efficacy of cross-modal feature integration and thereby diminishing the quality of audio-visual feature representations. To tackle this problem, we introduce a novel Incongruity-Aware Cross-Attention (IACA) model, capable of harnessing the benefits of robust complementary relationships while efficiently managing non-complementary scenarios. Specifically, our approach incorporates a two-stage gating mechanism designed to adaptively select semantic features, thereby effectively capturing the inter-modal associations. Additionally, the proposed model demonstrates an ability to mitigate the adverse effects of severely corrupted or missing modalities. We rigorously evaluate the performance of the proposed model through extensive experiments conducted on the challenging RECOLA and Aff-Wild2 datasets. The results underscore the efficacy of our approach, as it outperforms state-of-the-art methods by adeptly capturing inter-modal relationships and minimizing the influence of missing or heavily corrupted modalities. Furthermore, we show that the proposed model is compatible with various cross-modal attention variants, consistently improving performance on both datasets.\",\"PeriodicalId\":13038,\"journal\":{\"name\":\"IEEE Journal of Selected Topics in Signal Processing\",\"volume\":\"18 3\",\"pages\":\"444-458\"},\"PeriodicalIF\":8.7000,\"publicationDate\":\"2024-07-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal of Selected Topics in Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10584250/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10584250/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Incongruity-Aware Cross-Modal Attention for Audio-Visual Fusion in Dimensional Emotion Recognition
Multimodal emotion recognition has immense potential for the comprehensive assessment of human emotions, utilizing multiple modalities that often exhibit complementary relationships. In video-based emotion recognition, audio and visual modalities have emerged as prominent contact-free channels, widely explored in existing literature. Current approaches typically employ cross-modal attention mechanisms between audio and visual modalities, assuming a constant state of complementarity. However, this assumption may not always hold true, as non-complementary relationships can also manifest, undermining the efficacy of cross-modal feature integration and thereby diminishing the quality of audio-visual feature representations. To tackle this problem, we introduce a novel Incongruity-Aware Cross-Attention (IACA) model, capable of harnessing the benefits of robust complementary relationships while efficiently managing non-complementary scenarios. Specifically, our approach incorporates a two-stage gating mechanism designed to adaptively select semantic features, thereby effectively capturing the inter-modal associations. Additionally, the proposed model demonstrates an ability to mitigate the adverse effects of severely corrupted or missing modalities. We rigorously evaluate the performance of the proposed model through extensive experiments conducted on the challenging RECOLA and Aff-Wild2 datasets. The results underscore the efficacy of our approach, as it outperforms state-of-the-art methods by adeptly capturing inter-modal relationships and minimizing the influence of missing or heavily corrupted modalities. Furthermore, we show that the proposed model is compatible with various cross-modal attention variants, consistently improving performance on both datasets.
期刊介绍:
The IEEE Journal of Selected Topics in Signal Processing (JSTSP) focuses on the Field of Interest of the IEEE Signal Processing Society, which encompasses the theory and application of various signal processing techniques. These techniques include filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals using digital or analog devices. The term "signal" covers a wide range of data types, including audio, video, speech, image, communication, geophysical, sonar, radar, medical, musical, and others.
The journal format allows for in-depth exploration of signal processing topics, enabling the Society to cover both established and emerging areas. This includes interdisciplinary fields such as biomedical engineering and language processing, as well as areas not traditionally associated with engineering.