一种有效的多模态特征协同和特征增强器用于多模态意图识别

IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Yichao Xia , Jinmiao Song , Shenwei Tian , Qimeng Yang , Xin Fan , Zhezhe Zhu
{"title":"一种有效的多模态特征协同和特征增强器用于多模态意图识别","authors":"Yichao Xia ,&nbsp;Jinmiao Song ,&nbsp;Shenwei Tian ,&nbsp;Qimeng Yang ,&nbsp;Xin Fan ,&nbsp;Zhezhe Zhu","doi":"10.1016/j.compeleceng.2025.110301","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal intent recognition is a critical task that aims to accurately capture and interpret a user’s true intentions by integrating various sensory inputs such as facial expressions, body language, and vocal emotions. In complex and dynamic real-world multimodal interaction scenarios, deepening the understanding of human language and behavior becomes essential. Although multimodal data is rich in information, enhancing the representation of data features and efficiently integrating multimodal information to improve intent recognition performance remains a significant technical challenge. To address the aforementioned issue, a Video Feature Enhancer (VFE) module, combined with a Multi-Modality Feature Synergy (MFS) method, is proposed. The Video Feature Enhancer module employs a feature-weighting strategy based on energy optimization, along with an attention mechanism across channel spaces, to enhance the representational capability of video features. The Multi-Modality Feature Synergy method uses multi-level textual feature guidance and multimodal association learning to effectively integrate and optimize the feature representations of video and audio modalities. The Multi-Modality Feature Synergy method also suppresses non-essential information, facilitating the fusion of complementary information across different modalities, ultimately improving multimodal intent recognition performance. In the experimental evaluation, significant performance improvements are demonstrated over existing state-of-the-art methods on two benchmark datasets. On the MIntRec dataset, accuracy (ACC) is improved by 0.6%, weighted F1 score (WF1) by 1.21%, and weighted precision (WP) by 1.7%, while recall (R) increases by 1.8%. On the MELD-DA dataset, a 0.9% improvement in ACC is achieved, a significant increase of 1.15% in WF1 and 1.34% in WP, and also a 0.21% improvement in R is shown. Furthermore, through ablation studies, the substantial contributions of both the Video Feature Enhancer module and the Multi-Modality Feature Synergy method are validated in enhancing modality-specific feature representations and improving intent recognition accuracy.</div></div>","PeriodicalId":50630,"journal":{"name":"Computers & Electrical Engineering","volume":"123 ","pages":"Article 110301"},"PeriodicalIF":4.0000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An effective Multi-Modality Feature Synergy and Feature Enhancer for multimodal intent recognition\",\"authors\":\"Yichao Xia ,&nbsp;Jinmiao Song ,&nbsp;Shenwei Tian ,&nbsp;Qimeng Yang ,&nbsp;Xin Fan ,&nbsp;Zhezhe Zhu\",\"doi\":\"10.1016/j.compeleceng.2025.110301\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal intent recognition is a critical task that aims to accurately capture and interpret a user’s true intentions by integrating various sensory inputs such as facial expressions, body language, and vocal emotions. In complex and dynamic real-world multimodal interaction scenarios, deepening the understanding of human language and behavior becomes essential. Although multimodal data is rich in information, enhancing the representation of data features and efficiently integrating multimodal information to improve intent recognition performance remains a significant technical challenge. To address the aforementioned issue, a Video Feature Enhancer (VFE) module, combined with a Multi-Modality Feature Synergy (MFS) method, is proposed. The Video Feature Enhancer module employs a feature-weighting strategy based on energy optimization, along with an attention mechanism across channel spaces, to enhance the representational capability of video features. The Multi-Modality Feature Synergy method uses multi-level textual feature guidance and multimodal association learning to effectively integrate and optimize the feature representations of video and audio modalities. The Multi-Modality Feature Synergy method also suppresses non-essential information, facilitating the fusion of complementary information across different modalities, ultimately improving multimodal intent recognition performance. In the experimental evaluation, significant performance improvements are demonstrated over existing state-of-the-art methods on two benchmark datasets. On the MIntRec dataset, accuracy (ACC) is improved by 0.6%, weighted F1 score (WF1) by 1.21%, and weighted precision (WP) by 1.7%, while recall (R) increases by 1.8%. On the MELD-DA dataset, a 0.9% improvement in ACC is achieved, a significant increase of 1.15% in WF1 and 1.34% in WP, and also a 0.21% improvement in R is shown. Furthermore, through ablation studies, the substantial contributions of both the Video Feature Enhancer module and the Multi-Modality Feature Synergy method are validated in enhancing modality-specific feature representations and improving intent recognition accuracy.</div></div>\",\"PeriodicalId\":50630,\"journal\":{\"name\":\"Computers & Electrical Engineering\",\"volume\":\"123 \",\"pages\":\"Article 110301\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computers & Electrical Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0045790625002447\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Electrical Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0045790625002447","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

多模态意图识别是一项关键任务,旨在通过整合各种感官输入,如面部表情、肢体语言和声音情感,准确捕捉和解释用户的真实意图。在复杂和动态的现实世界多模态交互场景中,加深对人类语言和行为的理解变得至关重要。虽然多模态数据具有丰富的信息,但增强数据特征的表征并有效地整合多模态信息以提高意图识别性能仍然是一个重大的技术挑战。为了解决上述问题,提出了一种视频特征增强器(VFE)模块,并结合多模态特征协同(MFS)方法。视频特征增强器模块采用基于能量优化的特征加权策略,以及跨信道空间的注意机制,以增强视频特征的表示能力。多模态特征协同方法采用多层次文本特征引导和多模态关联学习,有效整合和优化视频和音频模态的特征表示。多模态特征协同方法还抑制了非必要信息,促进了不同模态间互补信息的融合,最终提高了多模态意图识别性能。在实验评估中,在两个基准数据集上,比现有的最先进的方法证明了显着的性能改进。在MIntRec数据集上,准确率(ACC)提高了0.6%,加权F1分数(WF1)提高了1.21%,加权精度(WP)提高了1.7%,召回率(R)提高了1.8%。在MELD-DA数据集上,ACC提高了0.9%,WF1和WP显著提高了1.15%和1.34%,R也提高了0.21%。此外,通过消融研究,验证了视频特征增强器模块和多模态特征协同方法在增强特定模态特征表示和提高意图识别准确性方面的重大贡献。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
An effective Multi-Modality Feature Synergy and Feature Enhancer for multimodal intent recognition
Multimodal intent recognition is a critical task that aims to accurately capture and interpret a user’s true intentions by integrating various sensory inputs such as facial expressions, body language, and vocal emotions. In complex and dynamic real-world multimodal interaction scenarios, deepening the understanding of human language and behavior becomes essential. Although multimodal data is rich in information, enhancing the representation of data features and efficiently integrating multimodal information to improve intent recognition performance remains a significant technical challenge. To address the aforementioned issue, a Video Feature Enhancer (VFE) module, combined with a Multi-Modality Feature Synergy (MFS) method, is proposed. The Video Feature Enhancer module employs a feature-weighting strategy based on energy optimization, along with an attention mechanism across channel spaces, to enhance the representational capability of video features. The Multi-Modality Feature Synergy method uses multi-level textual feature guidance and multimodal association learning to effectively integrate and optimize the feature representations of video and audio modalities. The Multi-Modality Feature Synergy method also suppresses non-essential information, facilitating the fusion of complementary information across different modalities, ultimately improving multimodal intent recognition performance. In the experimental evaluation, significant performance improvements are demonstrated over existing state-of-the-art methods on two benchmark datasets. On the MIntRec dataset, accuracy (ACC) is improved by 0.6%, weighted F1 score (WF1) by 1.21%, and weighted precision (WP) by 1.7%, while recall (R) increases by 1.8%. On the MELD-DA dataset, a 0.9% improvement in ACC is achieved, a significant increase of 1.15% in WF1 and 1.34% in WP, and also a 0.21% improvement in R is shown. Furthermore, through ablation studies, the substantial contributions of both the Video Feature Enhancer module and the Multi-Modality Feature Synergy method are validated in enhancing modality-specific feature representations and improving intent recognition accuracy.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Computers & Electrical Engineering
Computers & Electrical Engineering 工程技术-工程:电子与电气
CiteScore
9.20
自引率
7.00%
发文量
661
审稿时长
47 days
期刊介绍: The impact of computers has nowhere been more revolutionary than in electrical engineering. The design, analysis, and operation of electrical and electronic systems are now dominated by computers, a transformation that has been motivated by the natural ease of interface between computers and electrical systems, and the promise of spectacular improvements in speed and efficiency. Published since 1973, Computers & Electrical Engineering provides rapid publication of topical research into the integration of computer technology and computational techniques with electrical and electronic systems. The journal publishes papers featuring novel implementations of computers and computational techniques in areas like signal and image processing, high-performance computing, parallel processing, and communications. Special attention will be paid to papers describing innovative architectures, algorithms, and software tools.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信