RAFT：用于多模态情感分析的鲁棒对抗融合变压器

IF 4.5 Q2 COMPUTER SCIENCE, THEORY & METHODS

Array Pub Date : 2025-07-14 DOI:10.1016/j.array.2025.100445

Rui Wang , Duyun Xu , Lucia Cascone , Yaoyang Wang , Hui Chen , Jianbo Zheng , Xianxun Zhu

{"title":"RAFT：用于多模态情感分析的鲁棒对抗融合变压器","authors":"Rui Wang , Duyun Xu , Lucia Cascone , Yaoyang Wang , Hui Chen , Jianbo Zheng , Xianxun Zhu","doi":"10.1016/j.array.2025.100445","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal sentiment analysis (MSA) has emerged as a key technology for understanding human emotions by jointly processing text, audio, and visual cues. Despite significant progress, existing fusion models remain vulnerable to real-world challenges such as modality noise, missing channels, and weak inter-modal coupling. This paper addresses these limitations by introducing RAFT (Robust Adversarial Fusion Transformer), which integrates cross-modal and self-attention mechanisms with noise-imitation adversarial training to strengthen feature interactions and resilience under imperfect inputs. We first formalize the problem of noisy and incomplete data in MSA and demonstrate how adversarial noise simulation can bridge the gap between clean and corrupted modalities. RAFT is evaluated on two benchmark datasets, MOSI and MOSEI, where it achieves competitive binary classification accuracy (greater than 80%) and fine-grained sentiment performance (5-class accuracy 57%), while reducing mean absolute error and improving Pearson correlation by up to 2% over state-of-the-art baselines. Ablation studies confirm that both adversarial training and context-aware modules contribute substantially to robustness gains. Looking ahead, we plan to refine noise-generation strategies, explore more expressive fusion architectures, and extend RAFT to handle long-form dialogues and culturally diverse expressions. Our results suggest that RAFT lays a solid foundation for reliable, real-world sentiment analysis in noisy environments.</div></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"27 ","pages":"Article 100445"},"PeriodicalIF":4.5000,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RAFT: Robust Adversarial Fusion Transformer for multimodal sentiment analysis\",\"authors\":\"Rui Wang , Duyun Xu , Lucia Cascone , Yaoyang Wang , Hui Chen , Jianbo Zheng , Xianxun Zhu\",\"doi\":\"10.1016/j.array.2025.100445\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal sentiment analysis (MSA) has emerged as a key technology for understanding human emotions by jointly processing text, audio, and visual cues. Despite significant progress, existing fusion models remain vulnerable to real-world challenges such as modality noise, missing channels, and weak inter-modal coupling. This paper addresses these limitations by introducing RAFT (Robust Adversarial Fusion Transformer), which integrates cross-modal and self-attention mechanisms with noise-imitation adversarial training to strengthen feature interactions and resilience under imperfect inputs. We first formalize the problem of noisy and incomplete data in MSA and demonstrate how adversarial noise simulation can bridge the gap between clean and corrupted modalities. RAFT is evaluated on two benchmark datasets, MOSI and MOSEI, where it achieves competitive binary classification accuracy (greater than 80%) and fine-grained sentiment performance (5-class accuracy 57%), while reducing mean absolute error and improving Pearson correlation by up to 2% over state-of-the-art baselines. Ablation studies confirm that both adversarial training and context-aware modules contribute substantially to robustness gains. Looking ahead, we plan to refine noise-generation strategies, explore more expressive fusion architectures, and extend RAFT to handle long-form dialogues and culturally diverse expressions. Our results suggest that RAFT lays a solid foundation for reliable, real-world sentiment analysis in noisy environments.</div></div>\",\"PeriodicalId\":8417,\"journal\":{\"name\":\"Array\",\"volume\":\"27 \",\"pages\":\"Article 100445\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2025-07-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Array\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2590005625000724\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005625000724","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

多模态情感分析（MSA）是一种通过联合处理文本、音频和视觉线索来理解人类情感的关键技术。尽管取得了重大进展，但现有的融合模型仍然容易受到现实世界的挑战，如模态噪声、缺失通道和弱模态间耦合。本文通过引入RAFT（鲁棒对抗融合变压器）来解决这些限制，RAFT将跨模态和自注意机制与噪声模仿对抗训练相结合，以加强特征交互和不完美输入下的弹性。我们首先形式化了MSA中有噪声和不完整数据的问题，并演示了对抗噪声模拟如何弥合干净和损坏模式之间的差距。RAFT在MOSI和MOSEI两个基准数据集上进行了评估，其中它达到了具有竞争力的二元分类精度（大于80%）和细粒度情感表现（5类精度57%），同时减少了平均绝对误差，并将Pearson相关性提高了2%。消融研究证实，对抗性训练和上下文感知模块都对鲁棒性增益有很大贡献。展望未来，我们计划改进噪声产生策略，探索更具表现力的融合架构，并扩展RAFT以处理长格式对话和文化多样性表达。我们的研究结果表明，RAFT为嘈杂环境中可靠的真实情感分析奠定了坚实的基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

RAFT: Robust Adversarial Fusion Transformer for multimodal sentiment analysis

Multimodal sentiment analysis (MSA) has emerged as a key technology for understanding human emotions by jointly processing text, audio, and visual cues. Despite significant progress, existing fusion models remain vulnerable to real-world challenges such as modality noise, missing channels, and weak inter-modal coupling. This paper addresses these limitations by introducing RAFT (Robust Adversarial Fusion Transformer), which integrates cross-modal and self-attention mechanisms with noise-imitation adversarial training to strengthen feature interactions and resilience under imperfect inputs. We first formalize the problem of noisy and incomplete data in MSA and demonstrate how adversarial noise simulation can bridge the gap between clean and corrupted modalities. RAFT is evaluated on two benchmark datasets, MOSI and MOSEI, where it achieves competitive binary classification accuracy (greater than 80%) and fine-grained sentiment performance (5-class accuracy 57%), while reducing mean absolute error and improving Pearson correlation by up to 2% over state-of-the-art baselines. Ablation studies confirm that both adversarial training and context-aware modules contribute substantially to robustness gains. Looking ahead, we plan to refine noise-generation strategies, explore more expressive fusion architectures, and extend RAFT to handle long-form dialogues and culturally diverse expressions. Our results suggest that RAFT lays a solid foundation for reliable, real-world sentiment analysis in noisy environments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Array Computer Science-General Computer Science

CiteScore

4.40

自引率

0.00%

发文量

审稿时长

45 days