AnDR-BLIP2: enhanced semantic understanding framework for industrial image anomaly detection and report generation

IF 4.2 3区计算机科学 Q2 AUTOMATION & CONTROL SYSTEMS

Journal of The Franklin Institute-engineering and Applied Mathematics Pub Date : 2025-06-21 DOI:10.1016/j.jfranklin.2025.107816

Ze Gao , Jing Guo , Liming Chen , Kai Wang , Yang Chen , Yongzhen Ke , Shuai Yang

{"title":"AnDR-BLIP2: enhanced semantic understanding framework for industrial image anomaly detection and report generation","authors":"Ze Gao , Jing Guo , Liming Chen , Kai Wang , Yang Chen , Yongzhen Ke , Shuai Yang","doi":"10.1016/j.jfranklin.2025.107816","DOIUrl":null,"url":null,"abstract":"<div><div>Nowadays, the rapid development of Large Multimodal Models (LMM) has demonstrated its powerful ability in image understanding. However, when applied to downstream tasks such as industrial anomaly detection, it often lacks competence due to limitations in image parsing ability, pre-training data, and training strategy. Specifically, it struggles with understanding the detailed semantics of abnormal parts of images. As LLM performance continues to improve, the Industrial Image Anomaly Detection Report Generation (IADRG) task may emerge as a new challenge in the future. In this paper, we define the IADRG task as a deeper image understanding task and propose a solution for it. We propose AnDR-BLIP2, a dual-branch multi-modal large model based on the BLIP2 model combined with the SAM visual understanding branch to enhance detailed feature extraction from images. Additionally, we utilize mixed semantic pre-training of general and industrial image data to strengthen the model's ability to understand abnormal content in industrial anomaly detection tasks. Furthermore, our model leverages SAM's pixel-level feature parsing ability to integrate a prompt zero-shot industrial anomaly segmentation method into report generation. Experimental results on Mvtec-AD and VisA datasets demonstrate that our model accurately understands industrial image anomalies and achieves considerable performance in zero-shot anomaly segmentation.</div></div>","PeriodicalId":17283,"journal":{"name":"Journal of The Franklin Institute-engineering and Applied Mathematics","volume":"362 12","pages":"Article 107816"},"PeriodicalIF":4.2000,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of The Franklin Institute-engineering and Applied Mathematics","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0016003225003096","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Nowadays, the rapid development of Large Multimodal Models (LMM) has demonstrated its powerful ability in image understanding. However, when applied to downstream tasks such as industrial anomaly detection, it often lacks competence due to limitations in image parsing ability, pre-training data, and training strategy. Specifically, it struggles with understanding the detailed semantics of abnormal parts of images. As LLM performance continues to improve, the Industrial Image Anomaly Detection Report Generation (IADRG) task may emerge as a new challenge in the future. In this paper, we define the IADRG task as a deeper image understanding task and propose a solution for it. We propose AnDR-BLIP2, a dual-branch multi-modal large model based on the BLIP2 model combined with the SAM visual understanding branch to enhance detailed feature extraction from images. Additionally, we utilize mixed semantic pre-training of general and industrial image data to strengthen the model's ability to understand abnormal content in industrial anomaly detection tasks. Furthermore, our model leverages SAM's pixel-level feature parsing ability to integrate a prompt zero-shot industrial anomaly segmentation method into report generation. Experimental results on Mvtec-AD and VisA datasets demonstrate that our model accurately understands industrial image anomalies and achieves considerable performance in zero-shot anomaly segmentation.

查看原文本刊更多论文

AnDR-BLIP2：用于工业图像异常检测和报告生成的增强语义理解框架

近年来，大型多模态模型（Large Multimodal Models， LMM）的快速发展已经证明了其在图像理解方面的强大能力。然而，当应用于工业异常检测等下游任务时，由于图像解析能力、预训练数据和训练策略的限制，往往缺乏能力。具体来说，它很难理解图像异常部分的详细语义。随着LLM性能的不断提高，工业图像异常检测报告生成（IADRG）任务可能会成为未来的新挑战。本文将IADRG任务定义为更深层次的图像理解任务，并提出了解决方案。我们提出了基于BLIP2模型和SAM视觉理解分支的双分支多模态大模型AnDR-BLIP2，以增强图像的细节特征提取。此外，我们利用通用和工业图像数据的混合语义预训练来增强模型对工业异常检测任务中异常内容的理解能力。此外，我们的模型利用SAM的像素级特征解析能力，将快速的零采样工业异常分割方法集成到报告生成中。在Mvtec-AD和VisA数据集上的实验结果表明，该模型能够准确地理解工业图像异常，并在零射击异常分割方面取得了较好的效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of The Franklin Institute-engineering and Applied Mathematics 工程技术-工程：电子与电气

CiteScore

7.30

自引率

14.60%

发文量

586

审稿时长

6.9 months

期刊介绍： The Journal of The Franklin Institute has an established reputation for publishing high-quality papers in the field of engineering and applied mathematics. Its current focus is on control systems, complex networks and dynamic systems, signal processing and communications and their applications. All submitted papers are peer-reviewed. The Journal will publish original research papers and research review papers of substance. Papers and special focus issues are judged upon possible lasting value, which has been and continues to be the strength of the Journal of The Franklin Institute.