Rui Wang , Duyun Xu , Lucia Cascone , Yaoyang Wang , Hui Chen , Jianbo Zheng , Xianxun Zhu
{"title":"RAFT:用于多模态情感分析的鲁棒对抗融合变压器","authors":"Rui Wang , Duyun Xu , Lucia Cascone , Yaoyang Wang , Hui Chen , Jianbo Zheng , Xianxun Zhu","doi":"10.1016/j.array.2025.100445","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal sentiment analysis (MSA) has emerged as a key technology for understanding human emotions by jointly processing text, audio, and visual cues. Despite significant progress, existing fusion models remain vulnerable to real-world challenges such as modality noise, missing channels, and weak inter-modal coupling. This paper addresses these limitations by introducing RAFT (Robust Adversarial Fusion Transformer), which integrates cross-modal and self-attention mechanisms with noise-imitation adversarial training to strengthen feature interactions and resilience under imperfect inputs. We first formalize the problem of noisy and incomplete data in MSA and demonstrate how adversarial noise simulation can bridge the gap between clean and corrupted modalities. RAFT is evaluated on two benchmark datasets, MOSI and MOSEI, where it achieves competitive binary classification accuracy (greater than 80%) and fine-grained sentiment performance (5-class accuracy 57%), while reducing mean absolute error and improving Pearson correlation by up to 2% over state-of-the-art baselines. Ablation studies confirm that both adversarial training and context-aware modules contribute substantially to robustness gains. Looking ahead, we plan to refine noise-generation strategies, explore more expressive fusion architectures, and extend RAFT to handle long-form dialogues and culturally diverse expressions. Our results suggest that RAFT lays a solid foundation for reliable, real-world sentiment analysis in noisy environments.</div></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"27 ","pages":"Article 100445"},"PeriodicalIF":4.5000,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RAFT: Robust Adversarial Fusion Transformer for multimodal sentiment analysis\",\"authors\":\"Rui Wang , Duyun Xu , Lucia Cascone , Yaoyang Wang , Hui Chen , Jianbo Zheng , Xianxun Zhu\",\"doi\":\"10.1016/j.array.2025.100445\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal sentiment analysis (MSA) has emerged as a key technology for understanding human emotions by jointly processing text, audio, and visual cues. Despite significant progress, existing fusion models remain vulnerable to real-world challenges such as modality noise, missing channels, and weak inter-modal coupling. This paper addresses these limitations by introducing RAFT (Robust Adversarial Fusion Transformer), which integrates cross-modal and self-attention mechanisms with noise-imitation adversarial training to strengthen feature interactions and resilience under imperfect inputs. We first formalize the problem of noisy and incomplete data in MSA and demonstrate how adversarial noise simulation can bridge the gap between clean and corrupted modalities. RAFT is evaluated on two benchmark datasets, MOSI and MOSEI, where it achieves competitive binary classification accuracy (greater than 80%) and fine-grained sentiment performance (5-class accuracy 57%), while reducing mean absolute error and improving Pearson correlation by up to 2% over state-of-the-art baselines. Ablation studies confirm that both adversarial training and context-aware modules contribute substantially to robustness gains. Looking ahead, we plan to refine noise-generation strategies, explore more expressive fusion architectures, and extend RAFT to handle long-form dialogues and culturally diverse expressions. Our results suggest that RAFT lays a solid foundation for reliable, real-world sentiment analysis in noisy environments.</div></div>\",\"PeriodicalId\":8417,\"journal\":{\"name\":\"Array\",\"volume\":\"27 \",\"pages\":\"Article 100445\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2025-07-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Array\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2590005625000724\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005625000724","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
RAFT: Robust Adversarial Fusion Transformer for multimodal sentiment analysis
Multimodal sentiment analysis (MSA) has emerged as a key technology for understanding human emotions by jointly processing text, audio, and visual cues. Despite significant progress, existing fusion models remain vulnerable to real-world challenges such as modality noise, missing channels, and weak inter-modal coupling. This paper addresses these limitations by introducing RAFT (Robust Adversarial Fusion Transformer), which integrates cross-modal and self-attention mechanisms with noise-imitation adversarial training to strengthen feature interactions and resilience under imperfect inputs. We first formalize the problem of noisy and incomplete data in MSA and demonstrate how adversarial noise simulation can bridge the gap between clean and corrupted modalities. RAFT is evaluated on two benchmark datasets, MOSI and MOSEI, where it achieves competitive binary classification accuracy (greater than 80%) and fine-grained sentiment performance (5-class accuracy 57%), while reducing mean absolute error and improving Pearson correlation by up to 2% over state-of-the-art baselines. Ablation studies confirm that both adversarial training and context-aware modules contribute substantially to robustness gains. Looking ahead, we plan to refine noise-generation strategies, explore more expressive fusion architectures, and extend RAFT to handle long-form dialogues and culturally diverse expressions. Our results suggest that RAFT lays a solid foundation for reliable, real-world sentiment analysis in noisy environments.