{"title":"FocusCLIP:通过视觉文本差异关注异常区域","authors":"Yuan Zhao;Jiayu Sun;Lihe Zhang;Huchuan Lu","doi":"10.1109/TCSVT.2024.3524784","DOIUrl":null,"url":null,"abstract":"Few-shot anomaly detection aims to detect defects with only a limited number of normal samples for training. Recent few-shot methods typically focus on object-level features rather than subtle defects within objects, as pretrained models are generally trained on classification or image-text matching datasets. However, object-level features are often insufficient to detect defects, which are characterized by fine-grained texture variations. To address this, we propose FocusCLIP, which consists of a vision-guided branch and a language-guided branch. FocusCLIP leverages the complementary relationship between visual and text modalities to jointly emphasize discrepancies in fine-grained textures of defect regions. Specifically, we design three modules to mine these discrepancies. In the vision-guided branch, we propose the Bidirectional Self-knowledge Distillation (BSD) structure, which identifies anomaly regions through inconsistent representations and accumulates these discrepancies. Within this structure, the Anomaly Capture Module (ACM) is designed to refine features and detect more comprehensive anomalies by leveraging semantic cues from multi-head self-attention. In the language-guided branch, Multi-level Adversarial Class Activation Mapping (MACAM) utilizes foreground-invariant responses to adversarial text prompts, reducing interference from object regions and further focusing on defect regions. Our approach outperforms the state-of-the-art methods in few-shot anomaly detection. Additionally, the language-guided branch within FocusCLIP also demonstrates competitive performance in zero-shot anomaly detection, further validating the effectiveness of our proposed method.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4883-4895"},"PeriodicalIF":8.3000,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FocusCLIP: Focusing on Anomaly Regions by Visual-Text Discrepancies\",\"authors\":\"Yuan Zhao;Jiayu Sun;Lihe Zhang;Huchuan Lu\",\"doi\":\"10.1109/TCSVT.2024.3524784\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Few-shot anomaly detection aims to detect defects with only a limited number of normal samples for training. Recent few-shot methods typically focus on object-level features rather than subtle defects within objects, as pretrained models are generally trained on classification or image-text matching datasets. However, object-level features are often insufficient to detect defects, which are characterized by fine-grained texture variations. To address this, we propose FocusCLIP, which consists of a vision-guided branch and a language-guided branch. FocusCLIP leverages the complementary relationship between visual and text modalities to jointly emphasize discrepancies in fine-grained textures of defect regions. Specifically, we design three modules to mine these discrepancies. In the vision-guided branch, we propose the Bidirectional Self-knowledge Distillation (BSD) structure, which identifies anomaly regions through inconsistent representations and accumulates these discrepancies. Within this structure, the Anomaly Capture Module (ACM) is designed to refine features and detect more comprehensive anomalies by leveraging semantic cues from multi-head self-attention. In the language-guided branch, Multi-level Adversarial Class Activation Mapping (MACAM) utilizes foreground-invariant responses to adversarial text prompts, reducing interference from object regions and further focusing on defect regions. Our approach outperforms the state-of-the-art methods in few-shot anomaly detection. Additionally, the language-guided branch within FocusCLIP also demonstrates competitive performance in zero-shot anomaly detection, further validating the effectiveness of our proposed method.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 5\",\"pages\":\"4883-4895\"},\"PeriodicalIF\":8.3000,\"publicationDate\":\"2024-12-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10819451/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10819451/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
FocusCLIP: Focusing on Anomaly Regions by Visual-Text Discrepancies
Few-shot anomaly detection aims to detect defects with only a limited number of normal samples for training. Recent few-shot methods typically focus on object-level features rather than subtle defects within objects, as pretrained models are generally trained on classification or image-text matching datasets. However, object-level features are often insufficient to detect defects, which are characterized by fine-grained texture variations. To address this, we propose FocusCLIP, which consists of a vision-guided branch and a language-guided branch. FocusCLIP leverages the complementary relationship between visual and text modalities to jointly emphasize discrepancies in fine-grained textures of defect regions. Specifically, we design three modules to mine these discrepancies. In the vision-guided branch, we propose the Bidirectional Self-knowledge Distillation (BSD) structure, which identifies anomaly regions through inconsistent representations and accumulates these discrepancies. Within this structure, the Anomaly Capture Module (ACM) is designed to refine features and detect more comprehensive anomalies by leveraging semantic cues from multi-head self-attention. In the language-guided branch, Multi-level Adversarial Class Activation Mapping (MACAM) utilizes foreground-invariant responses to adversarial text prompts, reducing interference from object regions and further focusing on defect regions. Our approach outperforms the state-of-the-art methods in few-shot anomaly detection. Additionally, the language-guided branch within FocusCLIP also demonstrates competitive performance in zero-shot anomaly detection, further validating the effectiveness of our proposed method.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.