Gradient and Structure Consistency in Multimodal Emotion Recognition.

IF 13.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Image Processing Pub Date : 2025-09-18 DOI:10.1109/tip.2025.3608664

QingHongYa Shi,Mang Ye,Wenke Huang,Bo Du,Xiaofen Zong

{"title":"Gradient and Structure Consistency in Multimodal Emotion Recognition.","authors":"QingHongYa Shi,Mang Ye,Wenke Huang,Bo Du,Xiaofen Zong","doi":"10.1109/tip.2025.3608664","DOIUrl":null,"url":null,"abstract":"Multimodal emotion recognition is a task that integrates text, visual, and audio data to holistically infer an individual's emotional state. Existing research predominantly focuses on exploiting modality-specific cues for joint learning, often ignoring the differences between multiple modalities under common goal learning. Due to multimodal heterogeneity, common goal learning inadvertently introduces optimization biases and interaction noise. To address above challenges, we propose a novel approach named Gradient and Structure Consistency (GSCon). Our strategy operates at both overall and individual levels to consider balance optimization and effective interaction respectively. At the overall level, to avoid the optimization suppression of a modality on other modalities, we construct a balanced gradient direction that aligns each modality's optimization direction, ensuring unbiased convergence. Simultaneously, at the individual level, to avoid the interaction noise caused by multimodal alignment, we align the spatial structure of samples in different modalities. The spatial structure of the samples will not differ due to modal heterogeneity, achieving effective inter-modal interaction. Extensive experiments on multimodal emotion recognition and multimodal intention understanding datasets demonstrate the effectiveness of the proposed method. Code is available at https://github.com/ShiQingHongYa/GSCon.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"6 1","pages":""},"PeriodicalIF":13.7000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Image Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tip.2025.3608664","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Multimodal emotion recognition is a task that integrates text, visual, and audio data to holistically infer an individual's emotional state. Existing research predominantly focuses on exploiting modality-specific cues for joint learning, often ignoring the differences between multiple modalities under common goal learning. Due to multimodal heterogeneity, common goal learning inadvertently introduces optimization biases and interaction noise. To address above challenges, we propose a novel approach named Gradient and Structure Consistency (GSCon). Our strategy operates at both overall and individual levels to consider balance optimization and effective interaction respectively. At the overall level, to avoid the optimization suppression of a modality on other modalities, we construct a balanced gradient direction that aligns each modality's optimization direction, ensuring unbiased convergence. Simultaneously, at the individual level, to avoid the interaction noise caused by multimodal alignment, we align the spatial structure of samples in different modalities. The spatial structure of the samples will not differ due to modal heterogeneity, achieving effective inter-modal interaction. Extensive experiments on multimodal emotion recognition and multimodal intention understanding datasets demonstrate the effectiveness of the proposed method. Code is available at https://github.com/ShiQingHongYa/GSCon.

查看原文本刊更多论文

多模态情感识别中的梯度和结构一致性。

多模态情绪识别是一项整合文本、视觉和音频数据来全面推断个人情绪状态的任务。现有的研究主要集中在利用模态特异性线索进行联合学习，往往忽视了共同目标学习下多模态之间的差异。由于多模态异质性，共同目标学习无意中引入了优化偏差和交互噪声。为了解决上述挑战，我们提出了一种新的方法，称为梯度和结构一致性（GSCon）。我们的策略在整体和个人层面上运作，分别考虑平衡优化和有效互动。在整体上，为了避免一个模态对其他模态的优化抑制，我们构造了一个平衡梯度方向，使每个模态的优化方向对齐，保证无偏收敛。同时，在个体层面，为了避免多模态对齐带来的交互噪声，我们将样本的空间结构以不同的模态对齐。样品的空间结构不会因模态异质性而产生差异，实现有效的多模态相互作用。在多模态情感识别和多模态意图理解数据集上的大量实验证明了该方法的有效性。代码可从https://github.com/ShiQingHongYa/GSCon获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Image Processing 工程技术-工程：电子与电气

CiteScore

20.90

自引率

6.60%

发文量

774

审稿时长

7.6 months

期刊介绍： The IEEE Transactions on Image Processing delves into groundbreaking theories, algorithms, and structures concerning the generation, acquisition, manipulation, transmission, scrutiny, and presentation of images, video, and multidimensional signals across diverse applications. Topics span mathematical, statistical, and perceptual aspects, encompassing modeling, representation, formation, coding, filtering, enhancement, restoration, rendering, halftoning, search, and analysis of images, video, and multidimensional signals. Pertinent applications range from image and video communications to electronic imaging, biomedical imaging, image and video systems, and remote sensing.