Dinh-Cuong Hoang , Phan Xuan Tan , Anh-Nhat Nguyen , Ta Huu Anh Duong , Tuan-Minh Huynh , Duc-Manh Nguyen , Minh-Duc Cao , Duc-Huy Ngo , Thu-Uyen Nguyen , Khanh-Toan Phan , Minh-Quang Do , Xuan-Tung Dinh , Van-Hiep Duong , Ngoc-Anh Hoang , Van-Thiep Nguyen
{"title":"Accurate industrial anomaly detection with efficient multimodal fusion","authors":"Dinh-Cuong Hoang , Phan Xuan Tan , Anh-Nhat Nguyen , Ta Huu Anh Duong , Tuan-Minh Huynh , Duc-Manh Nguyen , Minh-Duc Cao , Duc-Huy Ngo , Thu-Uyen Nguyen , Khanh-Toan Phan , Minh-Quang Do , Xuan-Tung Dinh , Van-Hiep Duong , Ngoc-Anh Hoang , Van-Thiep Nguyen","doi":"10.1016/j.array.2025.100512","DOIUrl":null,"url":null,"abstract":"<div><div>Industrial anomaly detection is critical for ensuring quality and efficiency in modern manufacturing. However, existing deep learning models that rely solely on red-green-blue (RGB) images often fail to detect subtle structural defects, while most RGB-depth (RGBD) methods are computationally heavy and fragile in the presence of missing or noisy depth data. In this work, we propose a lightweight and real-time RGBD anomaly detection framework that not only refines per-modality features but also performs robust hierarchical fusion and tolerates missing inputs. Our approach employs a shared ResNet-50 backbone with a Modality-Specific Feature Enhancement (MSFE) module to amplify texture and geometric cues, followed by a Hierarchical Multi-Modal Fusion (HMM) encoder for cross-scale integration. We further introduce a curriculum-based anomalous feature generator to produce context-aware perturbations, training a compact two-layer discriminator to yield precise pixel-level normality scores. Extensive experiments on the MVTec Anomaly Detection (MVTec-AD) dataset, the Visual Anomaly (VisA) dataset, and a newly collected RealSense D435i RGBD dataset demonstrate up to 99.0% Pixel-level Area Under the Receiver Operating Characteristic Curve (P-AUROC), 99.6% Image-level AUROC (I-AUROC), 82.6% Area Under the Per-Region Overlap (AUPRO), and 45 frames per second (FPS) inference speed. These results validate the effectiveness and deployability of our approach in high-throughput industrial inspection scenarios.</div></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"28 ","pages":"Article 100512"},"PeriodicalIF":4.5000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005625001390","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Industrial anomaly detection is critical for ensuring quality and efficiency in modern manufacturing. However, existing deep learning models that rely solely on red-green-blue (RGB) images often fail to detect subtle structural defects, while most RGB-depth (RGBD) methods are computationally heavy and fragile in the presence of missing or noisy depth data. In this work, we propose a lightweight and real-time RGBD anomaly detection framework that not only refines per-modality features but also performs robust hierarchical fusion and tolerates missing inputs. Our approach employs a shared ResNet-50 backbone with a Modality-Specific Feature Enhancement (MSFE) module to amplify texture and geometric cues, followed by a Hierarchical Multi-Modal Fusion (HMM) encoder for cross-scale integration. We further introduce a curriculum-based anomalous feature generator to produce context-aware perturbations, training a compact two-layer discriminator to yield precise pixel-level normality scores. Extensive experiments on the MVTec Anomaly Detection (MVTec-AD) dataset, the Visual Anomaly (VisA) dataset, and a newly collected RealSense D435i RGBD dataset demonstrate up to 99.0% Pixel-level Area Under the Receiver Operating Characteristic Curve (P-AUROC), 99.6% Image-level AUROC (I-AUROC), 82.6% Area Under the Per-Region Overlap (AUPRO), and 45 frames per second (FPS) inference speed. These results validate the effectiveness and deployability of our approach in high-throughput industrial inspection scenarios.