Shuyang Lin;Tong Jia;Hao Wang;Bowen Ma;Mingyuan Li
{"title":"Open-Vocabulary Prohibited Item Detection for Real-World X-Ray Security Inspection","authors":"Shuyang Lin;Tong Jia;Hao Wang;Bowen Ma;Mingyuan Li","doi":"10.1109/TIFS.2025.3586492","DOIUrl":null,"url":null,"abstract":"Computer-aided prohibited item detection is applied in X-ray security inspection to maintain public safety. However, existing prohibited item detectors are limited to a small set of categories in current X-ray datasets, posing potential risks to public security. Since constructing bigger datasets and annotating hundreds of categories is time-consuming and labor-intensive, scaling detectors to more categories with minimal supervision is of great importance. To this end, in this paper, we adopt an open-vocabulary object detection (OVOD) method to detect arbitrary unlabeled novel categories of prohibited item. OVOD methods typically rely on datasets with caption annotations, which are lacking in the domain of prohibited item detection. To support the research on OVOD in X-ray security inspection scenarios, we contribute PIXray Caption dataset, the first X-ray dataset with image-caption pair annotations, which could benchmark and facilitate researches in the community. Further, we propose a novel Open-Vocabulary Prohibited Item Detection (OVPID) network to leverage textual information from captions. OVPID contains two core modules, i.e., Interference Resistant Module (IRM) and Prediction Module (PM). Specifically, IRM includes two submodules, namely Edge Perception (EP) and Foreground Activation (FA), which are designed to address the dilemma of interference caused by overlapping problem and complex background in X-ray images. PM consists of two branches for classification and localization. In classification branch, PM generates more accurate prompts for X-ray dataset via large multimodal model (LMM). In localization branch, PM aligns the student embeddings with both teacher and caption embeddings. Extensive experiments on PIXray Caption dataset demonstrate that OVPID outperforms other OVOD methods by delivering a higher accuracy on novel categories.","PeriodicalId":13492,"journal":{"name":"IEEE Transactions on Information Forensics and Security","volume":"20 ","pages":"7469-7481"},"PeriodicalIF":8.0000,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Forensics and Security","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11095302/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Computer-aided prohibited item detection is applied in X-ray security inspection to maintain public safety. However, existing prohibited item detectors are limited to a small set of categories in current X-ray datasets, posing potential risks to public security. Since constructing bigger datasets and annotating hundreds of categories is time-consuming and labor-intensive, scaling detectors to more categories with minimal supervision is of great importance. To this end, in this paper, we adopt an open-vocabulary object detection (OVOD) method to detect arbitrary unlabeled novel categories of prohibited item. OVOD methods typically rely on datasets with caption annotations, which are lacking in the domain of prohibited item detection. To support the research on OVOD in X-ray security inspection scenarios, we contribute PIXray Caption dataset, the first X-ray dataset with image-caption pair annotations, which could benchmark and facilitate researches in the community. Further, we propose a novel Open-Vocabulary Prohibited Item Detection (OVPID) network to leverage textual information from captions. OVPID contains two core modules, i.e., Interference Resistant Module (IRM) and Prediction Module (PM). Specifically, IRM includes two submodules, namely Edge Perception (EP) and Foreground Activation (FA), which are designed to address the dilemma of interference caused by overlapping problem and complex background in X-ray images. PM consists of two branches for classification and localization. In classification branch, PM generates more accurate prompts for X-ray dataset via large multimodal model (LMM). In localization branch, PM aligns the student embeddings with both teacher and caption embeddings. Extensive experiments on PIXray Caption dataset demonstrate that OVPID outperforms other OVOD methods by delivering a higher accuracy on novel categories.
期刊介绍:
The IEEE Transactions on Information Forensics and Security covers the sciences, technologies, and applications relating to information forensics, information security, biometrics, surveillance and systems applications that incorporate these features