MALM-CLIP: A generative multi-agent framework for multimodal fusion in few-shot industrial anomaly detection

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-09-22 DOI:10.1016/j.inffus.2025.103765

Hanzhi Chen , Jingbin Que , Kexin Zhu , Zhide Chen , Fei Zhu , Wencheng Yang , Xu Yang , Xuechao Yang

{"title":"MALM-CLIP: A generative multi-agent framework for multimodal fusion in few-shot industrial anomaly detection","authors":"Hanzhi Chen , Jingbin Que , Kexin Zhu , Zhide Chen , Fei Zhu , Wencheng Yang , Xu Yang , Xuechao Yang","doi":"10.1016/j.inffus.2025.103765","DOIUrl":null,"url":null,"abstract":"<div><div>The Contrastive Language-Image Pre-training (CLIP) model has significantly improved few-shot industrial anomaly detection. However, existing approaches often rely on manually crafted visual description texts, which lack robustness and generalizability in real-world production settings. This limitation is evident as these methods struggle to adapt to new or evolving anomalies, where original prompts fail to generalize beyond their initial design. This paper proposes a novel method, Multi-agent Language Models with CLIP (MALM-CLIP), which integrates the generative capabilities of large language models (LLMs) with CLIP within a multi-agent framework. In this system, specialized agents handle different subtasks such as prompt generation and model evaluation, enabling automated and context-aware multimodal information fusion. By eliminating manual prompt engineering, MALM-CLIP enhances both the accuracy and efficiency of anomaly detection. Experimental results on standard datasets such as MVTec and VisA demonstrate that our approach outperforms existing methods in detecting image-level anomalies with minimal training data. This work highlights the potential of combining Generative Artificial Intelligence (GenAI) and multi-agent systems for robust few-shot industrial anomaly detection.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103765"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525008279","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The Contrastive Language-Image Pre-training (CLIP) model has significantly improved few-shot industrial anomaly detection. However, existing approaches often rely on manually crafted visual description texts, which lack robustness and generalizability in real-world production settings. This limitation is evident as these methods struggle to adapt to new or evolving anomalies, where original prompts fail to generalize beyond their initial design. This paper proposes a novel method, Multi-agent Language Models with CLIP (MALM-CLIP), which integrates the generative capabilities of large language models (LLMs) with CLIP within a multi-agent framework. In this system, specialized agents handle different subtasks such as prompt generation and model evaluation, enabling automated and context-aware multimodal information fusion. By eliminating manual prompt engineering, MALM-CLIP enhances both the accuracy and efficiency of anomaly detection. Experimental results on standard datasets such as MVTec and VisA demonstrate that our approach outperforms existing methods in detecting image-level anomalies with minimal training data. This work highlights the potential of combining Generative Artificial Intelligence (GenAI) and multi-agent systems for robust few-shot industrial anomaly detection.

查看原文本刊更多论文

MALM-CLIP：一种生成式多智能体框架，用于工业异常检测中的多模态融合

对比语言图像预训练（CLIP）模型显著改善了少量工业异常检测。然而，现有的方法通常依赖于手工制作的视觉描述文本，在现实世界的生产环境中缺乏鲁棒性和泛化性。由于这些方法难以适应新的或不断变化的异常，原始提示无法推广到其初始设计之外，因此这种限制是显而易见的。本文提出了一种新的方法——基于CLIP的多智能体语言模型（MALM-CLIP），该方法将大型语言模型（llm）与CLIP的生成能力集成在一个多智能体框架内。在该系统中，专门的代理处理不同的子任务，如提示生成和模型评估，实现自动化和上下文感知的多模态信息融合。MALM-CLIP通过消除人工提示工程，提高了异常检测的准确性和效率。在MVTec和VisA等标准数据集上的实验结果表明，我们的方法在用最少的训练数据检测图像级异常方面优于现有的方法。这项工作强调了将生成式人工智能（GenAI）和多智能体系统结合起来进行鲁棒的少量工业异常检测的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.