ROSA: A Robust Self-Adaptive Model for Multimodal Emotion Recognition With Uncertain Missing Modalities

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-07-22 DOI:10.1109/TMM.2025.3590929

Ziming Li;Yaxin Liu;Chuanpeng Yang;Yan Zhou;Songlin Hu

{"title":"ROSA: A Robust Self-Adaptive Model for Multimodal Emotion Recognition With Uncertain Missing Modalities","authors":"Ziming Li;Yaxin Liu;Chuanpeng Yang;Yan Zhou;Songlin Hu","doi":"10.1109/TMM.2025.3590929","DOIUrl":null,"url":null,"abstract":"The rapid development of online media has heightened the importance of multimodal emotion recognition (MER) in video analysis. However, practical applications often encounter challenges due to missing modalities caused by various interferences. It is difficult to predict the specific missing situations, such as the number and types of missing modalities. Current approaches to modality missing typically apply a uniform method to address various missing cases, which are insufficiently adaptive to dynamic conditions. For example, translation-based methods can efficiently complete missing text from audio, but generating audio or video features that retain the original emotional information from other modalities is challenging and may introduce additional noise. In this paper, we introduce ROSA, a novel <bold>ro</b>bust <bold>s</b>elf-<bold>a</b>daptive model designed to address various missing cases with tailored approaches, leveraging available modalities effectively and reducing the introduction of additional noise. Specifically, the A-T Completion module based on the encoder-decoder architecture enables ROSA to generate missing raw text from audio rather than mere embedding representations, capturing more nuanced modal features. Additionally, we design the T-V Fusion module based on a vision-language large model for deep extraction and fusion of textual and visual features. Comprehensive experiments conducted on three widely used public datasets demonstrate the superiority and effectiveness of our model. ROSA outperforms other models in both fixed missing rate and fixed missing modality cases. The ablation studies further highlights the contribution of each designed module.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6766-6779"},"PeriodicalIF":9.7000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11086418/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The rapid development of online media has heightened the importance of multimodal emotion recognition (MER) in video analysis. However, practical applications often encounter challenges due to missing modalities caused by various interferences. It is difficult to predict the specific missing situations, such as the number and types of missing modalities. Current approaches to modality missing typically apply a uniform method to address various missing cases, which are insufficiently adaptive to dynamic conditions. For example, translation-based methods can efficiently complete missing text from audio, but generating audio or video features that retain the original emotional information from other modalities is challenging and may introduce additional noise. In this paper, we introduce ROSA, a novel robust self-adaptive model designed to address various missing cases with tailored approaches, leveraging available modalities effectively and reducing the introduction of additional noise. Specifically, the A-T Completion module based on the encoder-decoder architecture enables ROSA to generate missing raw text from audio rather than mere embedding representations, capturing more nuanced modal features. Additionally, we design the T-V Fusion module based on a vision-language large model for deep extraction and fusion of textual and visual features. Comprehensive experiments conducted on three widely used public datasets demonstrate the superiority and effectiveness of our model. ROSA outperforms other models in both fixed missing rate and fixed missing modality cases. The ablation studies further highlights the contribution of each designed module.

查看原文本刊更多论文

ROSA：一个具有不确定缺失模态的多模态情绪识别鲁棒自适应模型

网络媒体的快速发展，提高了多模态情感识别在视频分析中的重要性。然而，由于各种干扰导致的模态缺失，在实际应用中经常遇到挑战。很难预测具体的缺失情况，例如缺失模态的数量和类型。目前的模态缺失方法通常采用统一的方法来处理各种缺失情况，这些方法对动态条件的适应性不足。例如，基于翻译的方法可以有效地完成音频中缺失的文本，但是生成保留其他模式的原始情感信息的音频或视频特征是具有挑战性的，并且可能会引入额外的噪声。在本文中，我们介绍了ROSA，这是一种新颖的鲁棒自适应模型，旨在通过量身定制的方法解决各种缺失情况，有效地利用现有模式并减少额外噪声的引入。具体来说，基于编码器-解码器架构的A-T补全模块使ROSA能够从音频中生成缺失的原始文本，而不仅仅是嵌入表示，捕获更细微的模态特征。此外，我们设计了基于视觉语言大模型的T-V融合模块，用于文本和视觉特征的深度提取和融合。在三个广泛使用的公共数据集上进行的综合实验证明了该模型的优越性和有效性。ROSA在固定缺失率和固定缺失模态情况下都优于其他模型。烧蚀研究进一步强调了每个设计模块的贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.