MALM-CLIP:一种生成式多智能体框架,用于工业异常检测中的多模态融合

IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Hanzhi Chen , Jingbin Que , Kexin Zhu , Zhide Chen , Fei Zhu , Wencheng Yang , Xu Yang , Xuechao Yang
{"title":"MALM-CLIP:一种生成式多智能体框架,用于工业异常检测中的多模态融合","authors":"Hanzhi Chen ,&nbsp;Jingbin Que ,&nbsp;Kexin Zhu ,&nbsp;Zhide Chen ,&nbsp;Fei Zhu ,&nbsp;Wencheng Yang ,&nbsp;Xu Yang ,&nbsp;Xuechao Yang","doi":"10.1016/j.inffus.2025.103765","DOIUrl":null,"url":null,"abstract":"<div><div>The Contrastive Language-Image Pre-training (CLIP) model has significantly improved few-shot industrial anomaly detection. However, existing approaches often rely on manually crafted visual description texts, which lack robustness and generalizability in real-world production settings. This limitation is evident as these methods struggle to adapt to new or evolving anomalies, where original prompts fail to generalize beyond their initial design. This paper proposes a novel method, Multi-agent Language Models with CLIP (MALM-CLIP), which integrates the generative capabilities of large language models (LLMs) with CLIP within a multi-agent framework. In this system, specialized agents handle different subtasks such as prompt generation and model evaluation, enabling automated and context-aware multimodal information fusion. By eliminating manual prompt engineering, MALM-CLIP enhances both the accuracy and efficiency of anomaly detection. Experimental results on standard datasets such as MVTec and VisA demonstrate that our approach outperforms existing methods in detecting image-level anomalies with minimal training data. This work highlights the potential of combining Generative Artificial Intelligence (GenAI) and multi-agent systems for robust few-shot industrial anomaly detection.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103765"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MALM-CLIP: A generative multi-agent framework for multimodal fusion in few-shot industrial anomaly detection\",\"authors\":\"Hanzhi Chen ,&nbsp;Jingbin Que ,&nbsp;Kexin Zhu ,&nbsp;Zhide Chen ,&nbsp;Fei Zhu ,&nbsp;Wencheng Yang ,&nbsp;Xu Yang ,&nbsp;Xuechao Yang\",\"doi\":\"10.1016/j.inffus.2025.103765\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The Contrastive Language-Image Pre-training (CLIP) model has significantly improved few-shot industrial anomaly detection. However, existing approaches often rely on manually crafted visual description texts, which lack robustness and generalizability in real-world production settings. This limitation is evident as these methods struggle to adapt to new or evolving anomalies, where original prompts fail to generalize beyond their initial design. This paper proposes a novel method, Multi-agent Language Models with CLIP (MALM-CLIP), which integrates the generative capabilities of large language models (LLMs) with CLIP within a multi-agent framework. In this system, specialized agents handle different subtasks such as prompt generation and model evaluation, enabling automated and context-aware multimodal information fusion. By eliminating manual prompt engineering, MALM-CLIP enhances both the accuracy and efficiency of anomaly detection. Experimental results on standard datasets such as MVTec and VisA demonstrate that our approach outperforms existing methods in detecting image-level anomalies with minimal training data. This work highlights the potential of combining Generative Artificial Intelligence (GenAI) and multi-agent systems for robust few-shot industrial anomaly detection.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"127 \",\"pages\":\"Article 103765\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525008279\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525008279","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

对比语言图像预训练(CLIP)模型显著改善了少量工业异常检测。然而,现有的方法通常依赖于手工制作的视觉描述文本,在现实世界的生产环境中缺乏鲁棒性和泛化性。由于这些方法难以适应新的或不断变化的异常,原始提示无法推广到其初始设计之外,因此这种限制是显而易见的。本文提出了一种新的方法——基于CLIP的多智能体语言模型(MALM-CLIP),该方法将大型语言模型(llm)与CLIP的生成能力集成在一个多智能体框架内。在该系统中,专门的代理处理不同的子任务,如提示生成和模型评估,实现自动化和上下文感知的多模态信息融合。MALM-CLIP通过消除人工提示工程,提高了异常检测的准确性和效率。在MVTec和VisA等标准数据集上的实验结果表明,我们的方法在用最少的训练数据检测图像级异常方面优于现有的方法。这项工作强调了将生成式人工智能(GenAI)和多智能体系统结合起来进行鲁棒的少量工业异常检测的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
MALM-CLIP: A generative multi-agent framework for multimodal fusion in few-shot industrial anomaly detection
The Contrastive Language-Image Pre-training (CLIP) model has significantly improved few-shot industrial anomaly detection. However, existing approaches often rely on manually crafted visual description texts, which lack robustness and generalizability in real-world production settings. This limitation is evident as these methods struggle to adapt to new or evolving anomalies, where original prompts fail to generalize beyond their initial design. This paper proposes a novel method, Multi-agent Language Models with CLIP (MALM-CLIP), which integrates the generative capabilities of large language models (LLMs) with CLIP within a multi-agent framework. In this system, specialized agents handle different subtasks such as prompt generation and model evaluation, enabling automated and context-aware multimodal information fusion. By eliminating manual prompt engineering, MALM-CLIP enhances both the accuracy and efficiency of anomaly detection. Experimental results on standard datasets such as MVTec and VisA demonstrate that our approach outperforms existing methods in detecting image-level anomalies with minimal training data. This work highlights the potential of combining Generative Artificial Intelligence (GenAI) and multi-agent systems for robust few-shot industrial anomaly detection.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Information Fusion
Information Fusion 工程技术-计算机:理论方法
CiteScore
33.20
自引率
4.30%
发文量
161
审稿时长
7.9 months
期刊介绍: Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信