mlms - mr:基于多模态大语言模型的多模态识别

IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Shengwei Fu , Mingyang Yu , Kaichen OuYang , Qingsong Fan , Haisong Huang
{"title":"mlms - mr:基于多模态大语言模型的多模态识别","authors":"Shengwei Fu ,&nbsp;Mingyang Yu ,&nbsp;Kaichen OuYang ,&nbsp;Qingsong Fan ,&nbsp;Haisong Huang","doi":"10.1016/j.knosys.2025.114717","DOIUrl":null,"url":null,"abstract":"<div><div>To address the challenges of recognizing data from different modalities, this study proposes a multi-modal recognition method based on Multi-modal Large Language Models (MLLMs-MR), which can process six distinct data types: images, videos, audio, thermal, point cloud, and event data. Existing methods, such as UniBind, treat language as the central modality and construct a text-centric representation space, effectively reducing the representation imbalance among different modalities and improving recognition accuracy. However, descriptions generated by Multi-modal Large Language Models (MLLMs) are directly used for contrastive learning of text embeddings, which would result in a loss of semantic information from the original input. Furthermore, sole reliance on Large Language Models (LLMs) to be used as embedding centers can lead to category misclassification. To address these issues, we have proposed three key improvements based on UniBind: (1) constructing a category-based knowledge base using MLLMs, effectively reducing irrelevant descriptions; (2) designing fusion embedding center localization, which utilizes LLMs, MLLMs, and basic prompts to enhance the robustness of embedding centers; and (3) proposing a cross-modal attention mechanism that incorporates MLLMs-generated descriptions during training, enabling the model to better learn semantic information from multi-modal data and enhance feature representation. Subsequently, MLLMs-enhanced embeddings are aligned with class labels by contrastive learning to enable the recognition of multi-modal data. Experimental results demonstrate that MLLMs-MR outperforms existing models in multi-modal zero-shot recognition, with a 6.42 % accuracy gain on MSR-VTT. It shows an improvement of 8.19 % during fine-tuning on the ESC 5-fold audio dataset.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114717"},"PeriodicalIF":7.6000,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MLLMs-MR: Multi-modal recognition based on multi-modal large language models\",\"authors\":\"Shengwei Fu ,&nbsp;Mingyang Yu ,&nbsp;Kaichen OuYang ,&nbsp;Qingsong Fan ,&nbsp;Haisong Huang\",\"doi\":\"10.1016/j.knosys.2025.114717\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>To address the challenges of recognizing data from different modalities, this study proposes a multi-modal recognition method based on Multi-modal Large Language Models (MLLMs-MR), which can process six distinct data types: images, videos, audio, thermal, point cloud, and event data. Existing methods, such as UniBind, treat language as the central modality and construct a text-centric representation space, effectively reducing the representation imbalance among different modalities and improving recognition accuracy. However, descriptions generated by Multi-modal Large Language Models (MLLMs) are directly used for contrastive learning of text embeddings, which would result in a loss of semantic information from the original input. Furthermore, sole reliance on Large Language Models (LLMs) to be used as embedding centers can lead to category misclassification. To address these issues, we have proposed three key improvements based on UniBind: (1) constructing a category-based knowledge base using MLLMs, effectively reducing irrelevant descriptions; (2) designing fusion embedding center localization, which utilizes LLMs, MLLMs, and basic prompts to enhance the robustness of embedding centers; and (3) proposing a cross-modal attention mechanism that incorporates MLLMs-generated descriptions during training, enabling the model to better learn semantic information from multi-modal data and enhance feature representation. Subsequently, MLLMs-enhanced embeddings are aligned with class labels by contrastive learning to enable the recognition of multi-modal data. Experimental results demonstrate that MLLMs-MR outperforms existing models in multi-modal zero-shot recognition, with a 6.42 % accuracy gain on MSR-VTT. It shows an improvement of 8.19 % during fine-tuning on the ESC 5-fold audio dataset.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"330 \",\"pages\":\"Article 114717\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705125017563\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125017563","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

为了解决识别不同模态数据的挑战,本研究提出了一种基于多模态大语言模型(MLLMs-MR)的多模态识别方法,该方法可以处理六种不同的数据类型:图像、视频、音频、热、点云和事件数据。UniBind等现有方法以语言为中心模态,构建以文本为中心的表示空间,有效减少了不同模态之间的表示不平衡,提高了识别准确率。然而,多模态大语言模型(Multi-modal Large Language Models, mllm)生成的描述直接用于文本嵌入的对比学习,会导致原始输入语义信息的丢失。此外,单独依赖大型语言模型(llm)作为嵌入中心可能导致分类错误。为了解决这些问题,我们提出了基于UniBind的三个关键改进:(1)使用mllm构建基于类别的知识库,有效减少不相关描述;(2)设计融合嵌入中心定位,利用llm、mllm和基本提示增强嵌入中心的鲁棒性;(3)提出了一种跨模态注意机制,该机制在训练过程中融合了mllms生成的描述,使模型能够更好地从多模态数据中学习语义信息,增强特征表征。随后,通过对比学习将mllms增强的嵌入与类标签对齐,从而实现对多模态数据的识别。实验结果表明,MLLMs-MR在多模态零弹识别方面优于现有模型,在MSR-VTT上的准确率提高了6.42%。在ESC 5倍音频数据集的微调过程中,它显示了8.19%的改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
MLLMs-MR: Multi-modal recognition based on multi-modal large language models
To address the challenges of recognizing data from different modalities, this study proposes a multi-modal recognition method based on Multi-modal Large Language Models (MLLMs-MR), which can process six distinct data types: images, videos, audio, thermal, point cloud, and event data. Existing methods, such as UniBind, treat language as the central modality and construct a text-centric representation space, effectively reducing the representation imbalance among different modalities and improving recognition accuracy. However, descriptions generated by Multi-modal Large Language Models (MLLMs) are directly used for contrastive learning of text embeddings, which would result in a loss of semantic information from the original input. Furthermore, sole reliance on Large Language Models (LLMs) to be used as embedding centers can lead to category misclassification. To address these issues, we have proposed three key improvements based on UniBind: (1) constructing a category-based knowledge base using MLLMs, effectively reducing irrelevant descriptions; (2) designing fusion embedding center localization, which utilizes LLMs, MLLMs, and basic prompts to enhance the robustness of embedding centers; and (3) proposing a cross-modal attention mechanism that incorporates MLLMs-generated descriptions during training, enabling the model to better learn semantic information from multi-modal data and enhance feature representation. Subsequently, MLLMs-enhanced embeddings are aligned with class labels by contrastive learning to enable the recognition of multi-modal data. Experimental results demonstrate that MLLMs-MR outperforms existing models in multi-modal zero-shot recognition, with a 6.42 % accuracy gain on MSR-VTT. It shows an improvement of 8.19 % during fine-tuning on the ESC 5-fold audio dataset.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Knowledge-Based Systems
Knowledge-Based Systems 工程技术-计算机:人工智能
CiteScore
14.80
自引率
12.50%
发文量
1245
审稿时长
7.8 months
期刊介绍: Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信