FocusCLIP：通过视觉文本差异关注异常区域

IF 8.3 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-31 DOI:10.1109/TCSVT.2024.3524784

Yuan Zhao;Jiayu Sun;Lihe Zhang;Huchuan Lu

{"title":"FocusCLIP：通过视觉文本差异关注异常区域","authors":"Yuan Zhao;Jiayu Sun;Lihe Zhang;Huchuan Lu","doi":"10.1109/TCSVT.2024.3524784","DOIUrl":null,"url":null,"abstract":"Few-shot anomaly detection aims to detect defects with only a limited number of normal samples for training. Recent few-shot methods typically focus on object-level features rather than subtle defects within objects, as pretrained models are generally trained on classification or image-text matching datasets. However, object-level features are often insufficient to detect defects, which are characterized by fine-grained texture variations. To address this, we propose FocusCLIP, which consists of a vision-guided branch and a language-guided branch. FocusCLIP leverages the complementary relationship between visual and text modalities to jointly emphasize discrepancies in fine-grained textures of defect regions. Specifically, we design three modules to mine these discrepancies. In the vision-guided branch, we propose the Bidirectional Self-knowledge Distillation (BSD) structure, which identifies anomaly regions through inconsistent representations and accumulates these discrepancies. Within this structure, the Anomaly Capture Module (ACM) is designed to refine features and detect more comprehensive anomalies by leveraging semantic cues from multi-head self-attention. In the language-guided branch, Multi-level Adversarial Class Activation Mapping (MACAM) utilizes foreground-invariant responses to adversarial text prompts, reducing interference from object regions and further focusing on defect regions. Our approach outperforms the state-of-the-art methods in few-shot anomaly detection. Additionally, the language-guided branch within FocusCLIP also demonstrates competitive performance in zero-shot anomaly detection, further validating the effectiveness of our proposed method.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4883-4895"},"PeriodicalIF":8.3000,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FocusCLIP: Focusing on Anomaly Regions by Visual-Text Discrepancies\",\"authors\":\"Yuan Zhao;Jiayu Sun;Lihe Zhang;Huchuan Lu\",\"doi\":\"10.1109/TCSVT.2024.3524784\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Few-shot anomaly detection aims to detect defects with only a limited number of normal samples for training. Recent few-shot methods typically focus on object-level features rather than subtle defects within objects, as pretrained models are generally trained on classification or image-text matching datasets. However, object-level features are often insufficient to detect defects, which are characterized by fine-grained texture variations. To address this, we propose FocusCLIP, which consists of a vision-guided branch and a language-guided branch. FocusCLIP leverages the complementary relationship between visual and text modalities to jointly emphasize discrepancies in fine-grained textures of defect regions. Specifically, we design three modules to mine these discrepancies. In the vision-guided branch, we propose the Bidirectional Self-knowledge Distillation (BSD) structure, which identifies anomaly regions through inconsistent representations and accumulates these discrepancies. Within this structure, the Anomaly Capture Module (ACM) is designed to refine features and detect more comprehensive anomalies by leveraging semantic cues from multi-head self-attention. In the language-guided branch, Multi-level Adversarial Class Activation Mapping (MACAM) utilizes foreground-invariant responses to adversarial text prompts, reducing interference from object regions and further focusing on defect regions. Our approach outperforms the state-of-the-art methods in few-shot anomaly detection. Additionally, the language-guided branch within FocusCLIP also demonstrates competitive performance in zero-shot anomaly detection, further validating the effectiveness of our proposed method.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 5\",\"pages\":\"4883-4895\"},\"PeriodicalIF\":8.3000,\"publicationDate\":\"2024-12-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10819451/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10819451/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

少量异常检测的目的是用有限数量的正常样本进行训练，从而检测出缺陷。由于预训练模型通常是在分类或图像-文本匹配数据集上训练的，所以最近的几种方法通常关注对象级别的特征，而不是对象内部的细微缺陷。然而，对象级特征通常不足以检测缺陷，缺陷的特征是细粒度的纹理变化。为了解决这个问题，我们提出了FocusCLIP，它由一个视觉引导分支和一个语言引导分支组成。FocusCLIP利用视觉和文本模式之间的互补关系，共同强调缺陷区域细粒度纹理的差异。具体来说，我们设计了三个模块来挖掘这些差异。在视觉引导分支中，我们提出了双向自知识蒸馏（BSD）结构，该结构通过不一致的表示来识别异常区域并积累这些差异。在此结构中，异常捕获模块（ACM）旨在通过利用来自多头自关注的语义线索来细化特征并检测更全面的异常。在语言引导的分支中，多级对抗性类激活映射（MACAM）利用前景不变响应对抗性文本提示，减少来自对象区域的干扰，并进一步关注缺陷区域。我们的方法在少量异常检测方面优于最先进的方法。此外，FocusCLIP中的语言引导分支在零射击异常检测中也展示了具有竞争力的性能，进一步验证了我们提出的方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FocusCLIP: Focusing on Anomaly Regions by Visual-Text Discrepancies

Few-shot anomaly detection aims to detect defects with only a limited number of normal samples for training. Recent few-shot methods typically focus on object-level features rather than subtle defects within objects, as pretrained models are generally trained on classification or image-text matching datasets. However, object-level features are often insufficient to detect defects, which are characterized by fine-grained texture variations. To address this, we propose FocusCLIP, which consists of a vision-guided branch and a language-guided branch. FocusCLIP leverages the complementary relationship between visual and text modalities to jointly emphasize discrepancies in fine-grained textures of defect regions. Specifically, we design three modules to mine these discrepancies. In the vision-guided branch, we propose the Bidirectional Self-knowledge Distillation (BSD) structure, which identifies anomaly regions through inconsistent representations and accumulates these discrepancies. Within this structure, the Anomaly Capture Module (ACM) is designed to refine features and detect more comprehensive anomalies by leveraging semantic cues from multi-head self-attention. In the language-guided branch, Multi-level Adversarial Class Activation Mapping (MACAM) utilizes foreground-invariant responses to adversarial text prompts, reducing interference from object regions and further focusing on defect regions. Our approach outperforms the state-of-the-art methods in few-shot anomaly detection. Additionally, the language-guided branch within FocusCLIP also demonstrates competitive performance in zero-shot anomaly detection, further validating the effectiveness of our proposed method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.