{"title":"多模态情感分析的两阶段自适应融合网络","authors":"Jiaqi Liu, Yong Wang, Jing Yang, Xu Yu, Meng Zhao","doi":"10.1007/s10489-025-06577-0","DOIUrl":null,"url":null,"abstract":"<div><p>Multimodal sentiment analysis (MSA) provides a more accurate understanding of human emotional states than unimodal. However, the different modalities are limited by semantic expression in expressing emotion, leading to inconsistency in the importance of unimodal influence on the fused modal sentiment polarity, as well as sentiment polarity biases resulting from the interaction between multiple modalities. This can make MSA less accurate. To address this problem, we propose a two-stage adaptive fusion network (TsAFN) in this paper. The first stage is an adaptive fusion network based on the joint of modal features. Feature extraction is based on Bert and LSTM network. An importance metric adaptive benchmark is presented for proposing a feature planning method to jointly represent multimodal features to form fused modal features, which automatically equalizes the importance of unimodal influence on the fused modal sentiment polarity. The second stage is an adaptive fusion network based on modal interaction. A distance metric adaptive benchmark is defined, based on which a representation reconstruction method is proposed to take into account inter-modal interactions. The relationship and sentiment polarity biases of the modalities are adjusted to reconstruct unimodal sentiment polarity and a more accurate representation of the fused modality. Finally, the loss function is defined and the model is trained on three datasets MOSI, MOSEI, and CH-SIMS. The results of comparative experiments show that TsAFN can achieve better accuracy in MSA.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 10","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TsAFN: A two-stage adaptive fusion network for multimodal sentiment analysis\",\"authors\":\"Jiaqi Liu, Yong Wang, Jing Yang, Xu Yu, Meng Zhao\",\"doi\":\"10.1007/s10489-025-06577-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Multimodal sentiment analysis (MSA) provides a more accurate understanding of human emotional states than unimodal. However, the different modalities are limited by semantic expression in expressing emotion, leading to inconsistency in the importance of unimodal influence on the fused modal sentiment polarity, as well as sentiment polarity biases resulting from the interaction between multiple modalities. This can make MSA less accurate. To address this problem, we propose a two-stage adaptive fusion network (TsAFN) in this paper. The first stage is an adaptive fusion network based on the joint of modal features. Feature extraction is based on Bert and LSTM network. An importance metric adaptive benchmark is presented for proposing a feature planning method to jointly represent multimodal features to form fused modal features, which automatically equalizes the importance of unimodal influence on the fused modal sentiment polarity. The second stage is an adaptive fusion network based on modal interaction. A distance metric adaptive benchmark is defined, based on which a representation reconstruction method is proposed to take into account inter-modal interactions. The relationship and sentiment polarity biases of the modalities are adjusted to reconstruct unimodal sentiment polarity and a more accurate representation of the fused modality. Finally, the loss function is defined and the model is trained on three datasets MOSI, MOSEI, and CH-SIMS. The results of comparative experiments show that TsAFN can achieve better accuracy in MSA.</p></div>\",\"PeriodicalId\":8041,\"journal\":{\"name\":\"Applied Intelligence\",\"volume\":\"55 10\",\"pages\":\"\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-05-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10489-025-06577-0\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-025-06577-0","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
TsAFN: A two-stage adaptive fusion network for multimodal sentiment analysis
Multimodal sentiment analysis (MSA) provides a more accurate understanding of human emotional states than unimodal. However, the different modalities are limited by semantic expression in expressing emotion, leading to inconsistency in the importance of unimodal influence on the fused modal sentiment polarity, as well as sentiment polarity biases resulting from the interaction between multiple modalities. This can make MSA less accurate. To address this problem, we propose a two-stage adaptive fusion network (TsAFN) in this paper. The first stage is an adaptive fusion network based on the joint of modal features. Feature extraction is based on Bert and LSTM network. An importance metric adaptive benchmark is presented for proposing a feature planning method to jointly represent multimodal features to form fused modal features, which automatically equalizes the importance of unimodal influence on the fused modal sentiment polarity. The second stage is an adaptive fusion network based on modal interaction. A distance metric adaptive benchmark is defined, based on which a representation reconstruction method is proposed to take into account inter-modal interactions. The relationship and sentiment polarity biases of the modalities are adjusted to reconstruct unimodal sentiment polarity and a more accurate representation of the fused modality. Finally, the loss function is defined and the model is trained on three datasets MOSI, MOSEI, and CH-SIMS. The results of comparative experiments show that TsAFN can achieve better accuracy in MSA.
期刊介绍:
With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance.
The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.