CMFF：用于多模态情感分析的跨模态多层特征融合网络

IF 6.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Soft Computing Pub Date : 2025-09-10 DOI:10.1016/j.asoc.2025.113868

Shuting Zheng , Jingling Zhang , Yuanzhao Deng , Lanxiang Chen

{"title":"CMFF：用于多模态情感分析的跨模态多层特征融合网络","authors":"Shuting Zheng , Jingling Zhang , Yuanzhao Deng , Lanxiang Chen","doi":"10.1016/j.asoc.2025.113868","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal sentiment analysis seeks to interpret speaker sentiment by integrating information from multiple modalities, typically text and audio. While existing methods often focus on fusing deep-layer features extracted from the final stages of unimodal encoders, they may overlook crucial fine-grained information present in shallow-layer features (e.g., subtle phonetic variations or basic syntactic structures) relevant for nuanced sentiment understanding. Furthermore, effectively fusing features from different modalities presents the dual challenges of dynamically weighting each modality’s contribution and accommodating their inherent data heterogeneity. To address these limitations, we propose a novel Cross-modal Multi-layer Feature Fusion (CMFF) network. CMFF explicitly leverages the hierarchical information contained in both shallow-layer and deep-layer features from text and audio modalities. It employs multi-head cross-modal attention mechanisms within its fusion layers to facilitate interaction across feature layers and modalities. Crucially, CMFF incorporates a Mixture of Gated Experts (MoGE) network within these fusion layers. The MoGE utilizes modality-specific expert sub-networks, each tailored to process the distinct characteristics of text or audio data, thereby directly addressing data heterogeneity. Concurrently, each expert employs an internal gated feed-forward mechanism. This allows the model to dynamically control the information flow for each feature vector, effectively learning to weigh the importance of different feature dimensions from each layer and modality based on the input context. Extensive experiments conducted on the benchmark CMU-MOSI and CMU-MOSEI datasets demonstrate that the proposed CMFF model achieves competitive or superior performance compared to state-of-the-art methods across various standard evaluation metrics.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":"184 ","pages":"Article 113868"},"PeriodicalIF":6.6000,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CMFF: A cross-modal multi-layer feature fusion network for multimodal sentiment analysis\",\"authors\":\"Shuting Zheng , Jingling Zhang , Yuanzhao Deng , Lanxiang Chen\",\"doi\":\"10.1016/j.asoc.2025.113868\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal sentiment analysis seeks to interpret speaker sentiment by integrating information from multiple modalities, typically text and audio. While existing methods often focus on fusing deep-layer features extracted from the final stages of unimodal encoders, they may overlook crucial fine-grained information present in shallow-layer features (e.g., subtle phonetic variations or basic syntactic structures) relevant for nuanced sentiment understanding. Furthermore, effectively fusing features from different modalities presents the dual challenges of dynamically weighting each modality’s contribution and accommodating their inherent data heterogeneity. To address these limitations, we propose a novel Cross-modal Multi-layer Feature Fusion (CMFF) network. CMFF explicitly leverages the hierarchical information contained in both shallow-layer and deep-layer features from text and audio modalities. It employs multi-head cross-modal attention mechanisms within its fusion layers to facilitate interaction across feature layers and modalities. Crucially, CMFF incorporates a Mixture of Gated Experts (MoGE) network within these fusion layers. The MoGE utilizes modality-specific expert sub-networks, each tailored to process the distinct characteristics of text or audio data, thereby directly addressing data heterogeneity. Concurrently, each expert employs an internal gated feed-forward mechanism. This allows the model to dynamically control the information flow for each feature vector, effectively learning to weigh the importance of different feature dimensions from each layer and modality based on the input context. Extensive experiments conducted on the benchmark CMU-MOSI and CMU-MOSEI datasets demonstrate that the proposed CMFF model achieves competitive or superior performance compared to state-of-the-art methods across various standard evaluation metrics.</div></div>\",\"PeriodicalId\":50737,\"journal\":{\"name\":\"Applied Soft Computing\",\"volume\":\"184 \",\"pages\":\"Article 113868\"},\"PeriodicalIF\":6.6000,\"publicationDate\":\"2025-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Soft Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1568494625011810\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494625011810","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

多模态情感分析试图通过整合来自多种模态的信息（通常是文本和音频）来解释说话者的情感。虽然现有的方法通常侧重于融合从单模编码器的最后阶段提取的深层特征，但它们可能忽略了与细微情感理解相关的浅层特征（例如，微妙的语音变化或基本句法结构）中存在的关键细粒度信息。此外，有效地融合来自不同模态的特征面临着动态加权每个模态贡献和适应其固有数据异质性的双重挑战。为了解决这些限制，我们提出了一种新的跨模态多层特征融合（CMFF）网络。CMFF明确地利用了文本和音频模式的浅层和深层特征中包含的分层信息。它在融合层中采用多头跨模态注意机制，以促进特征层和模态之间的交互。至关重要的是，CMFF在这些融合层中集成了一个混合门控专家（MoGE）网络。MoGE利用模式特定的专家子网络，每个子网络都针对文本或音频数据的不同特征进行定制，从而直接解决数据异构问题。同时，每个专家采用内部门控前馈机制。这使得模型可以动态控制每个特征向量的信息流，有效地学习基于输入上下文来权衡每个层和模态的不同特征维度的重要性。在基准CMU-MOSI和CMU-MOSEI数据集上进行的大量实验表明，与各种标准评估指标的最先进方法相比，所提出的CMFF模型具有竞争力或优越的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

CMFF: A cross-modal multi-layer feature fusion network for multimodal sentiment analysis

查看原文本刊更多论文

CMFF: A cross-modal multi-layer feature fusion network for multimodal sentiment analysis

Multimodal sentiment analysis seeks to interpret speaker sentiment by integrating information from multiple modalities, typically text and audio. While existing methods often focus on fusing deep-layer features extracted from the final stages of unimodal encoders, they may overlook crucial fine-grained information present in shallow-layer features (e.g., subtle phonetic variations or basic syntactic structures) relevant for nuanced sentiment understanding. Furthermore, effectively fusing features from different modalities presents the dual challenges of dynamically weighting each modality’s contribution and accommodating their inherent data heterogeneity. To address these limitations, we propose a novel Cross-modal Multi-layer Feature Fusion (CMFF) network. CMFF explicitly leverages the hierarchical information contained in both shallow-layer and deep-layer features from text and audio modalities. It employs multi-head cross-modal attention mechanisms within its fusion layers to facilitate interaction across feature layers and modalities. Crucially, CMFF incorporates a Mixture of Gated Experts (MoGE) network within these fusion layers. The MoGE utilizes modality-specific expert sub-networks, each tailored to process the distinct characteristics of text or audio data, thereby directly addressing data heterogeneity. Concurrently, each expert employs an internal gated feed-forward mechanism. This allows the model to dynamically control the information flow for each feature vector, effectively learning to weigh the importance of different feature dimensions from each layer and modality based on the input context. Extensive experiments conducted on the benchmark CMU-MOSI and CMU-MOSEI datasets demonstrate that the proposed CMFF model achieves competitive or superior performance compared to state-of-the-art methods across various standard evaluation metrics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Soft Computing 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

6.90%

发文量

874

审稿时长

10.9 months

期刊介绍： Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities. Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.