{"title":"CMFF:用于多模态情感分析的跨模态多层特征融合网络","authors":"Shuting Zheng , Jingling Zhang , Yuanzhao Deng , Lanxiang Chen","doi":"10.1016/j.asoc.2025.113868","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal sentiment analysis seeks to interpret speaker sentiment by integrating information from multiple modalities, typically text and audio. While existing methods often focus on fusing deep-layer features extracted from the final stages of unimodal encoders, they may overlook crucial fine-grained information present in shallow-layer features (e.g., subtle phonetic variations or basic syntactic structures) relevant for nuanced sentiment understanding. Furthermore, effectively fusing features from different modalities presents the dual challenges of dynamically weighting each modality’s contribution and accommodating their inherent data heterogeneity. To address these limitations, we propose a novel Cross-modal Multi-layer Feature Fusion (CMFF) network. CMFF explicitly leverages the hierarchical information contained in both shallow-layer and deep-layer features from text and audio modalities. It employs multi-head cross-modal attention mechanisms within its fusion layers to facilitate interaction across feature layers and modalities. Crucially, CMFF incorporates a Mixture of Gated Experts (MoGE) network within these fusion layers. The MoGE utilizes modality-specific expert sub-networks, each tailored to process the distinct characteristics of text or audio data, thereby directly addressing data heterogeneity. Concurrently, each expert employs an internal gated feed-forward mechanism. This allows the model to dynamically control the information flow for each feature vector, effectively learning to weigh the importance of different feature dimensions from each layer and modality based on the input context. Extensive experiments conducted on the benchmark CMU-MOSI and CMU-MOSEI datasets demonstrate that the proposed CMFF model achieves competitive or superior performance compared to state-of-the-art methods across various standard evaluation metrics.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":"184 ","pages":"Article 113868"},"PeriodicalIF":6.6000,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CMFF: A cross-modal multi-layer feature fusion network for multimodal sentiment analysis\",\"authors\":\"Shuting Zheng , Jingling Zhang , Yuanzhao Deng , Lanxiang Chen\",\"doi\":\"10.1016/j.asoc.2025.113868\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal sentiment analysis seeks to interpret speaker sentiment by integrating information from multiple modalities, typically text and audio. While existing methods often focus on fusing deep-layer features extracted from the final stages of unimodal encoders, they may overlook crucial fine-grained information present in shallow-layer features (e.g., subtle phonetic variations or basic syntactic structures) relevant for nuanced sentiment understanding. Furthermore, effectively fusing features from different modalities presents the dual challenges of dynamically weighting each modality’s contribution and accommodating their inherent data heterogeneity. To address these limitations, we propose a novel Cross-modal Multi-layer Feature Fusion (CMFF) network. CMFF explicitly leverages the hierarchical information contained in both shallow-layer and deep-layer features from text and audio modalities. It employs multi-head cross-modal attention mechanisms within its fusion layers to facilitate interaction across feature layers and modalities. Crucially, CMFF incorporates a Mixture of Gated Experts (MoGE) network within these fusion layers. The MoGE utilizes modality-specific expert sub-networks, each tailored to process the distinct characteristics of text or audio data, thereby directly addressing data heterogeneity. Concurrently, each expert employs an internal gated feed-forward mechanism. This allows the model to dynamically control the information flow for each feature vector, effectively learning to weigh the importance of different feature dimensions from each layer and modality based on the input context. Extensive experiments conducted on the benchmark CMU-MOSI and CMU-MOSEI datasets demonstrate that the proposed CMFF model achieves competitive or superior performance compared to state-of-the-art methods across various standard evaluation metrics.</div></div>\",\"PeriodicalId\":50737,\"journal\":{\"name\":\"Applied Soft Computing\",\"volume\":\"184 \",\"pages\":\"Article 113868\"},\"PeriodicalIF\":6.6000,\"publicationDate\":\"2025-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Soft Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1568494625011810\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494625011810","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
CMFF: A cross-modal multi-layer feature fusion network for multimodal sentiment analysis
Multimodal sentiment analysis seeks to interpret speaker sentiment by integrating information from multiple modalities, typically text and audio. While existing methods often focus on fusing deep-layer features extracted from the final stages of unimodal encoders, they may overlook crucial fine-grained information present in shallow-layer features (e.g., subtle phonetic variations or basic syntactic structures) relevant for nuanced sentiment understanding. Furthermore, effectively fusing features from different modalities presents the dual challenges of dynamically weighting each modality’s contribution and accommodating their inherent data heterogeneity. To address these limitations, we propose a novel Cross-modal Multi-layer Feature Fusion (CMFF) network. CMFF explicitly leverages the hierarchical information contained in both shallow-layer and deep-layer features from text and audio modalities. It employs multi-head cross-modal attention mechanisms within its fusion layers to facilitate interaction across feature layers and modalities. Crucially, CMFF incorporates a Mixture of Gated Experts (MoGE) network within these fusion layers. The MoGE utilizes modality-specific expert sub-networks, each tailored to process the distinct characteristics of text or audio data, thereby directly addressing data heterogeneity. Concurrently, each expert employs an internal gated feed-forward mechanism. This allows the model to dynamically control the information flow for each feature vector, effectively learning to weigh the importance of different feature dimensions from each layer and modality based on the input context. Extensive experiments conducted on the benchmark CMU-MOSI and CMU-MOSEI datasets demonstrate that the proposed CMFF model achieves competitive or superior performance compared to state-of-the-art methods across various standard evaluation metrics.
期刊介绍:
Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities.
Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.