Dinh-Cuong Hoang , Phan Xuan Tan , Anh-Nhat Nguyen , Ta Huu Anh Duong , Tuan-Minh Huynh , Duc-Manh Nguyen , Minh-Duc Cao , Duc-Huy Ngo , Thu-Uyen Nguyen , Khanh-Toan Phan , Minh-Quang Do , Xuan-Tung Dinh , Van-Hiep Duong , Ngoc-Anh Hoang , Van-Thiep Nguyen
{"title":"精确的工业异常检测与高效的多模态融合","authors":"Dinh-Cuong Hoang , Phan Xuan Tan , Anh-Nhat Nguyen , Ta Huu Anh Duong , Tuan-Minh Huynh , Duc-Manh Nguyen , Minh-Duc Cao , Duc-Huy Ngo , Thu-Uyen Nguyen , Khanh-Toan Phan , Minh-Quang Do , Xuan-Tung Dinh , Van-Hiep Duong , Ngoc-Anh Hoang , Van-Thiep Nguyen","doi":"10.1016/j.array.2025.100512","DOIUrl":null,"url":null,"abstract":"<div><div>Industrial anomaly detection is critical for ensuring quality and efficiency in modern manufacturing. However, existing deep learning models that rely solely on red-green-blue (RGB) images often fail to detect subtle structural defects, while most RGB-depth (RGBD) methods are computationally heavy and fragile in the presence of missing or noisy depth data. In this work, we propose a lightweight and real-time RGBD anomaly detection framework that not only refines per-modality features but also performs robust hierarchical fusion and tolerates missing inputs. Our approach employs a shared ResNet-50 backbone with a Modality-Specific Feature Enhancement (MSFE) module to amplify texture and geometric cues, followed by a Hierarchical Multi-Modal Fusion (HMM) encoder for cross-scale integration. We further introduce a curriculum-based anomalous feature generator to produce context-aware perturbations, training a compact two-layer discriminator to yield precise pixel-level normality scores. Extensive experiments on the MVTec Anomaly Detection (MVTec-AD) dataset, the Visual Anomaly (VisA) dataset, and a newly collected RealSense D435i RGBD dataset demonstrate up to 99.0% Pixel-level Area Under the Receiver Operating Characteristic Curve (P-AUROC), 99.6% Image-level AUROC (I-AUROC), 82.6% Area Under the Per-Region Overlap (AUPRO), and 45 frames per second (FPS) inference speed. These results validate the effectiveness and deployability of our approach in high-throughput industrial inspection scenarios.</div></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"28 ","pages":"Article 100512"},"PeriodicalIF":4.5000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Accurate industrial anomaly detection with efficient multimodal fusion\",\"authors\":\"Dinh-Cuong Hoang , Phan Xuan Tan , Anh-Nhat Nguyen , Ta Huu Anh Duong , Tuan-Minh Huynh , Duc-Manh Nguyen , Minh-Duc Cao , Duc-Huy Ngo , Thu-Uyen Nguyen , Khanh-Toan Phan , Minh-Quang Do , Xuan-Tung Dinh , Van-Hiep Duong , Ngoc-Anh Hoang , Van-Thiep Nguyen\",\"doi\":\"10.1016/j.array.2025.100512\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Industrial anomaly detection is critical for ensuring quality and efficiency in modern manufacturing. However, existing deep learning models that rely solely on red-green-blue (RGB) images often fail to detect subtle structural defects, while most RGB-depth (RGBD) methods are computationally heavy and fragile in the presence of missing or noisy depth data. In this work, we propose a lightweight and real-time RGBD anomaly detection framework that not only refines per-modality features but also performs robust hierarchical fusion and tolerates missing inputs. Our approach employs a shared ResNet-50 backbone with a Modality-Specific Feature Enhancement (MSFE) module to amplify texture and geometric cues, followed by a Hierarchical Multi-Modal Fusion (HMM) encoder for cross-scale integration. We further introduce a curriculum-based anomalous feature generator to produce context-aware perturbations, training a compact two-layer discriminator to yield precise pixel-level normality scores. Extensive experiments on the MVTec Anomaly Detection (MVTec-AD) dataset, the Visual Anomaly (VisA) dataset, and a newly collected RealSense D435i RGBD dataset demonstrate up to 99.0% Pixel-level Area Under the Receiver Operating Characteristic Curve (P-AUROC), 99.6% Image-level AUROC (I-AUROC), 82.6% Area Under the Per-Region Overlap (AUPRO), and 45 frames per second (FPS) inference speed. These results validate the effectiveness and deployability of our approach in high-throughput industrial inspection scenarios.</div></div>\",\"PeriodicalId\":8417,\"journal\":{\"name\":\"Array\",\"volume\":\"28 \",\"pages\":\"Article 100512\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2025-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Array\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2590005625001390\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005625001390","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
Accurate industrial anomaly detection with efficient multimodal fusion
Industrial anomaly detection is critical for ensuring quality and efficiency in modern manufacturing. However, existing deep learning models that rely solely on red-green-blue (RGB) images often fail to detect subtle structural defects, while most RGB-depth (RGBD) methods are computationally heavy and fragile in the presence of missing or noisy depth data. In this work, we propose a lightweight and real-time RGBD anomaly detection framework that not only refines per-modality features but also performs robust hierarchical fusion and tolerates missing inputs. Our approach employs a shared ResNet-50 backbone with a Modality-Specific Feature Enhancement (MSFE) module to amplify texture and geometric cues, followed by a Hierarchical Multi-Modal Fusion (HMM) encoder for cross-scale integration. We further introduce a curriculum-based anomalous feature generator to produce context-aware perturbations, training a compact two-layer discriminator to yield precise pixel-level normality scores. Extensive experiments on the MVTec Anomaly Detection (MVTec-AD) dataset, the Visual Anomaly (VisA) dataset, and a newly collected RealSense D435i RGBD dataset demonstrate up to 99.0% Pixel-level Area Under the Receiver Operating Characteristic Curve (P-AUROC), 99.6% Image-level AUROC (I-AUROC), 82.6% Area Under the Per-Region Overlap (AUPRO), and 45 frames per second (FPS) inference speed. These results validate the effectiveness and deployability of our approach in high-throughput industrial inspection scenarios.