{"title":"Multimodal intent recognition based on text-guided cross-modal attention","authors":"Zhengyi Li, Junjie Peng, Xuanchao Lin, Zesu Cai","doi":"10.1007/s10489-025-06583-2","DOIUrl":null,"url":null,"abstract":"<div><p>In natural language understanding, intent recognition stands out as a crucial task that has drawn significant attention. While previous research focuses on intent recognition using task-specific unimodal data, real-world scenarios often involve human intents expressed through various ways, including speech, tone of voice, facial expressions, and actions. This prompts research into integrating multimodal information to more accurately identify human intent. However, existing intent recognition studies often fuse textual and non-textual modalities without considering their quality gap. The gap in feature quality across different modalities hinders the improvement of the model’s performance. To address this challenge, we propose a multimodal intent recognition model to enhance non-textual modality features. Specifically, we enrich the semantics of non-textual modalities by replacing redundant information through text-guided cross-modal attention. Additionally, we introduce a text-centric adaptive fusion gating mechanism to capitalize on the primary role of text modality in intent recognition. Extensive experiments on two multimodal task datasets show that our proposed model performs better in all metrics than state-of-the-art multimodal models. The results demonstrate that our model efficiently enhances non-textual modality features and fuses multimodal information, showing promising potential for intent recognition.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 7","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-025-06583-2","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In natural language understanding, intent recognition stands out as a crucial task that has drawn significant attention. While previous research focuses on intent recognition using task-specific unimodal data, real-world scenarios often involve human intents expressed through various ways, including speech, tone of voice, facial expressions, and actions. This prompts research into integrating multimodal information to more accurately identify human intent. However, existing intent recognition studies often fuse textual and non-textual modalities without considering their quality gap. The gap in feature quality across different modalities hinders the improvement of the model’s performance. To address this challenge, we propose a multimodal intent recognition model to enhance non-textual modality features. Specifically, we enrich the semantics of non-textual modalities by replacing redundant information through text-guided cross-modal attention. Additionally, we introduce a text-centric adaptive fusion gating mechanism to capitalize on the primary role of text modality in intent recognition. Extensive experiments on two multimodal task datasets show that our proposed model performs better in all metrics than state-of-the-art multimodal models. The results demonstrate that our model efficiently enhances non-textual modality features and fuses multimodal information, showing promising potential for intent recognition.
期刊介绍:
With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance.
The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.