An effective Multi-Modality Feature Synergy and Feature Enhancer for multimodal intent recognition

IF 4 3区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Computers & Electrical Engineering Pub Date : 2025-04-01 DOI:10.1016/j.compeleceng.2025.110301

Yichao Xia , Jinmiao Song , Shenwei Tian , Qimeng Yang , Xin Fan , Zhezhe Zhu

{"title":"An effective Multi-Modality Feature Synergy and Feature Enhancer for multimodal intent recognition","authors":"Yichao Xia , Jinmiao Song , Shenwei Tian , Qimeng Yang , Xin Fan , Zhezhe Zhu","doi":"10.1016/j.compeleceng.2025.110301","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal intent recognition is a critical task that aims to accurately capture and interpret a user’s true intentions by integrating various sensory inputs such as facial expressions, body language, and vocal emotions. In complex and dynamic real-world multimodal interaction scenarios, deepening the understanding of human language and behavior becomes essential. Although multimodal data is rich in information, enhancing the representation of data features and efficiently integrating multimodal information to improve intent recognition performance remains a significant technical challenge. To address the aforementioned issue, a Video Feature Enhancer (VFE) module, combined with a Multi-Modality Feature Synergy (MFS) method, is proposed. The Video Feature Enhancer module employs a feature-weighting strategy based on energy optimization, along with an attention mechanism across channel spaces, to enhance the representational capability of video features. The Multi-Modality Feature Synergy method uses multi-level textual feature guidance and multimodal association learning to effectively integrate and optimize the feature representations of video and audio modalities. The Multi-Modality Feature Synergy method also suppresses non-essential information, facilitating the fusion of complementary information across different modalities, ultimately improving multimodal intent recognition performance. In the experimental evaluation, significant performance improvements are demonstrated over existing state-of-the-art methods on two benchmark datasets. On the MIntRec dataset, accuracy (ACC) is improved by 0.6%, weighted F1 score (WF1) by 1.21%, and weighted precision (WP) by 1.7%, while recall (R) increases by 1.8%. On the MELD-DA dataset, a 0.9% improvement in ACC is achieved, a significant increase of 1.15% in WF1 and 1.34% in WP, and also a 0.21% improvement in R is shown. Furthermore, through ablation studies, the substantial contributions of both the Video Feature Enhancer module and the Multi-Modality Feature Synergy method are validated in enhancing modality-specific feature representations and improving intent recognition accuracy.</div></div>","PeriodicalId":50630,"journal":{"name":"Computers & Electrical Engineering","volume":"123 ","pages":"Article 110301"},"PeriodicalIF":4.0000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Electrical Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0045790625002447","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Multimodal intent recognition is a critical task that aims to accurately capture and interpret a user’s true intentions by integrating various sensory inputs such as facial expressions, body language, and vocal emotions. In complex and dynamic real-world multimodal interaction scenarios, deepening the understanding of human language and behavior becomes essential. Although multimodal data is rich in information, enhancing the representation of data features and efficiently integrating multimodal information to improve intent recognition performance remains a significant technical challenge. To address the aforementioned issue, a Video Feature Enhancer (VFE) module, combined with a Multi-Modality Feature Synergy (MFS) method, is proposed. The Video Feature Enhancer module employs a feature-weighting strategy based on energy optimization, along with an attention mechanism across channel spaces, to enhance the representational capability of video features. The Multi-Modality Feature Synergy method uses multi-level textual feature guidance and multimodal association learning to effectively integrate and optimize the feature representations of video and audio modalities. The Multi-Modality Feature Synergy method also suppresses non-essential information, facilitating the fusion of complementary information across different modalities, ultimately improving multimodal intent recognition performance. In the experimental evaluation, significant performance improvements are demonstrated over existing state-of-the-art methods on two benchmark datasets. On the MIntRec dataset, accuracy (ACC) is improved by 0.6%, weighted F1 score (WF1) by 1.21%, and weighted precision (WP) by 1.7%, while recall (R) increases by 1.8%. On the MELD-DA dataset, a 0.9% improvement in ACC is achieved, a significant increase of 1.15% in WF1 and 1.34% in WP, and also a 0.21% improvement in R is shown. Furthermore, through ablation studies, the substantial contributions of both the Video Feature Enhancer module and the Multi-Modality Feature Synergy method are validated in enhancing modality-specific feature representations and improving intent recognition accuracy.

查看原文本刊更多论文

一种有效的多模态特征协同和特征增强器用于多模态意图识别

多模态意图识别是一项关键任务，旨在通过整合各种感官输入，如面部表情、肢体语言和声音情感，准确捕捉和解释用户的真实意图。在复杂和动态的现实世界多模态交互场景中，加深对人类语言和行为的理解变得至关重要。虽然多模态数据具有丰富的信息，但增强数据特征的表征并有效地整合多模态信息以提高意图识别性能仍然是一个重大的技术挑战。为了解决上述问题，提出了一种视频特征增强器（VFE）模块，并结合多模态特征协同（MFS）方法。视频特征增强器模块采用基于能量优化的特征加权策略，以及跨信道空间的注意机制，以增强视频特征的表示能力。多模态特征协同方法采用多层次文本特征引导和多模态关联学习，有效整合和优化视频和音频模态的特征表示。多模态特征协同方法还抑制了非必要信息，促进了不同模态间互补信息的融合，最终提高了多模态意图识别性能。在实验评估中，在两个基准数据集上，比现有的最先进的方法证明了显着的性能改进。在MIntRec数据集上，准确率（ACC）提高了0.6%，加权F1分数（WF1）提高了1.21%，加权精度（WP）提高了1.7%，召回率(R)提高了1.8%。在MELD-DA数据集上，ACC提高了0.9%，WF1和WP显著提高了1.15%和1.34%，R也提高了0.21%。此外，通过消融研究，验证了视频特征增强器模块和多模态特征协同方法在增强特定模态特征表示和提高意图识别准确性方面的重大贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computers & Electrical Engineering 工程技术-工程：电子与电气

CiteScore

9.20

自引率

7.00%

发文量

661

审稿时长

47 days

期刊介绍： The impact of computers has nowhere been more revolutionary than in electrical engineering. The design, analysis, and operation of electrical and electronic systems are now dominated by computers, a transformation that has been motivated by the natural ease of interface between computers and electrical systems, and the promise of spectacular improvements in speed and efficiency. Published since 1973, Computers & Electrical Engineering provides rapid publication of topical research into the integration of computer technology and computational techniques with electrical and electronic systems. The journal publishes papers featuring novel implementations of computers and computational techniques in areas like signal and image processing, high-performance computing, parallel processing, and communications. Special attention will be paid to papers describing innovative architectures, algorithms, and software tools.