Shengwei Fu , Mingyang Yu , Kaichen OuYang , Qingsong Fan , Haisong Huang
{"title":"MLLMs-MR: Multi-modal recognition based on multi-modal large language models","authors":"Shengwei Fu , Mingyang Yu , Kaichen OuYang , Qingsong Fan , Haisong Huang","doi":"10.1016/j.knosys.2025.114717","DOIUrl":null,"url":null,"abstract":"<div><div>To address the challenges of recognizing data from different modalities, this study proposes a multi-modal recognition method based on Multi-modal Large Language Models (MLLMs-MR), which can process six distinct data types: images, videos, audio, thermal, point cloud, and event data. Existing methods, such as UniBind, treat language as the central modality and construct a text-centric representation space, effectively reducing the representation imbalance among different modalities and improving recognition accuracy. However, descriptions generated by Multi-modal Large Language Models (MLLMs) are directly used for contrastive learning of text embeddings, which would result in a loss of semantic information from the original input. Furthermore, sole reliance on Large Language Models (LLMs) to be used as embedding centers can lead to category misclassification. To address these issues, we have proposed three key improvements based on UniBind: (1) constructing a category-based knowledge base using MLLMs, effectively reducing irrelevant descriptions; (2) designing fusion embedding center localization, which utilizes LLMs, MLLMs, and basic prompts to enhance the robustness of embedding centers; and (3) proposing a cross-modal attention mechanism that incorporates MLLMs-generated descriptions during training, enabling the model to better learn semantic information from multi-modal data and enhance feature representation. Subsequently, MLLMs-enhanced embeddings are aligned with class labels by contrastive learning to enable the recognition of multi-modal data. Experimental results demonstrate that MLLMs-MR outperforms existing models in multi-modal zero-shot recognition, with a 6.42 % accuracy gain on MSR-VTT. It shows an improvement of 8.19 % during fine-tuning on the ESC 5-fold audio dataset.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114717"},"PeriodicalIF":7.6000,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125017563","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
To address the challenges of recognizing data from different modalities, this study proposes a multi-modal recognition method based on Multi-modal Large Language Models (MLLMs-MR), which can process six distinct data types: images, videos, audio, thermal, point cloud, and event data. Existing methods, such as UniBind, treat language as the central modality and construct a text-centric representation space, effectively reducing the representation imbalance among different modalities and improving recognition accuracy. However, descriptions generated by Multi-modal Large Language Models (MLLMs) are directly used for contrastive learning of text embeddings, which would result in a loss of semantic information from the original input. Furthermore, sole reliance on Large Language Models (LLMs) to be used as embedding centers can lead to category misclassification. To address these issues, we have proposed three key improvements based on UniBind: (1) constructing a category-based knowledge base using MLLMs, effectively reducing irrelevant descriptions; (2) designing fusion embedding center localization, which utilizes LLMs, MLLMs, and basic prompts to enhance the robustness of embedding centers; and (3) proposing a cross-modal attention mechanism that incorporates MLLMs-generated descriptions during training, enabling the model to better learn semantic information from multi-modal data and enhance feature representation. Subsequently, MLLMs-enhanced embeddings are aligned with class labels by contrastive learning to enable the recognition of multi-modal data. Experimental results demonstrate that MLLMs-MR outperforms existing models in multi-modal zero-shot recognition, with a 6.42 % accuracy gain on MSR-VTT. It shows an improvement of 8.19 % during fine-tuning on the ESC 5-fold audio dataset.
为了解决识别不同模态数据的挑战,本研究提出了一种基于多模态大语言模型(MLLMs-MR)的多模态识别方法,该方法可以处理六种不同的数据类型:图像、视频、音频、热、点云和事件数据。UniBind等现有方法以语言为中心模态,构建以文本为中心的表示空间,有效减少了不同模态之间的表示不平衡,提高了识别准确率。然而,多模态大语言模型(Multi-modal Large Language Models, mllm)生成的描述直接用于文本嵌入的对比学习,会导致原始输入语义信息的丢失。此外,单独依赖大型语言模型(llm)作为嵌入中心可能导致分类错误。为了解决这些问题,我们提出了基于UniBind的三个关键改进:(1)使用mllm构建基于类别的知识库,有效减少不相关描述;(2)设计融合嵌入中心定位,利用llm、mllm和基本提示增强嵌入中心的鲁棒性;(3)提出了一种跨模态注意机制,该机制在训练过程中融合了mllms生成的描述,使模型能够更好地从多模态数据中学习语义信息,增强特征表征。随后,通过对比学习将mllms增强的嵌入与类标签对齐,从而实现对多模态数据的识别。实验结果表明,MLLMs-MR在多模态零弹识别方面优于现有模型,在MSR-VTT上的准确率提高了6.42%。在ESC 5倍音频数据集的微调过程中,它显示了8.19%的改进。
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.