Accurate industrial anomaly detection with efficient multimodal fusion

IF 4.5 Q2 COMPUTER SCIENCE, THEORY & METHODS

Array Pub Date : 2025-09-19 DOI:10.1016/j.array.2025.100512

Dinh-Cuong Hoang , Phan Xuan Tan , Anh-Nhat Nguyen , Ta Huu Anh Duong , Tuan-Minh Huynh , Duc-Manh Nguyen , Minh-Duc Cao , Duc-Huy Ngo , Thu-Uyen Nguyen , Khanh-Toan Phan , Minh-Quang Do , Xuan-Tung Dinh , Van-Hiep Duong , Ngoc-Anh Hoang , Van-Thiep Nguyen

{"title":"Accurate industrial anomaly detection with efficient multimodal fusion","authors":"Dinh-Cuong Hoang , Phan Xuan Tan , Anh-Nhat Nguyen , Ta Huu Anh Duong , Tuan-Minh Huynh , Duc-Manh Nguyen , Minh-Duc Cao , Duc-Huy Ngo , Thu-Uyen Nguyen , Khanh-Toan Phan , Minh-Quang Do , Xuan-Tung Dinh , Van-Hiep Duong , Ngoc-Anh Hoang , Van-Thiep Nguyen","doi":"10.1016/j.array.2025.100512","DOIUrl":null,"url":null,"abstract":"<div><div>Industrial anomaly detection is critical for ensuring quality and efficiency in modern manufacturing. However, existing deep learning models that rely solely on red-green-blue (RGB) images often fail to detect subtle structural defects, while most RGB-depth (RGBD) methods are computationally heavy and fragile in the presence of missing or noisy depth data. In this work, we propose a lightweight and real-time RGBD anomaly detection framework that not only refines per-modality features but also performs robust hierarchical fusion and tolerates missing inputs. Our approach employs a shared ResNet-50 backbone with a Modality-Specific Feature Enhancement (MSFE) module to amplify texture and geometric cues, followed by a Hierarchical Multi-Modal Fusion (HMM) encoder for cross-scale integration. We further introduce a curriculum-based anomalous feature generator to produce context-aware perturbations, training a compact two-layer discriminator to yield precise pixel-level normality scores. Extensive experiments on the MVTec Anomaly Detection (MVTec-AD) dataset, the Visual Anomaly (VisA) dataset, and a newly collected RealSense D435i RGBD dataset demonstrate up to 99.0% Pixel-level Area Under the Receiver Operating Characteristic Curve (P-AUROC), 99.6% Image-level AUROC (I-AUROC), 82.6% Area Under the Per-Region Overlap (AUPRO), and 45 frames per second (FPS) inference speed. These results validate the effectiveness and deployability of our approach in high-throughput industrial inspection scenarios.</div></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"28 ","pages":"Article 100512"},"PeriodicalIF":4.5000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005625001390","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Industrial anomaly detection is critical for ensuring quality and efficiency in modern manufacturing. However, existing deep learning models that rely solely on red-green-blue (RGB) images often fail to detect subtle structural defects, while most RGB-depth (RGBD) methods are computationally heavy and fragile in the presence of missing or noisy depth data. In this work, we propose a lightweight and real-time RGBD anomaly detection framework that not only refines per-modality features but also performs robust hierarchical fusion and tolerates missing inputs. Our approach employs a shared ResNet-50 backbone with a Modality-Specific Feature Enhancement (MSFE) module to amplify texture and geometric cues, followed by a Hierarchical Multi-Modal Fusion (HMM) encoder for cross-scale integration. We further introduce a curriculum-based anomalous feature generator to produce context-aware perturbations, training a compact two-layer discriminator to yield precise pixel-level normality scores. Extensive experiments on the MVTec Anomaly Detection (MVTec-AD) dataset, the Visual Anomaly (VisA) dataset, and a newly collected RealSense D435i RGBD dataset demonstrate up to 99.0% Pixel-level Area Under the Receiver Operating Characteristic Curve (P-AUROC), 99.6% Image-level AUROC (I-AUROC), 82.6% Area Under the Per-Region Overlap (AUPRO), and 45 frames per second (FPS) inference speed. These results validate the effectiveness and deployability of our approach in high-throughput industrial inspection scenarios.

查看原文本刊更多论文

精确的工业异常检测与高效的多模态融合

在现代制造业中，工业异常检测是保证质量和效率的关键。然而，现有的仅依赖于红绿蓝（RGB）图像的深度学习模型往往无法检测到细微的结构缺陷，而大多数RGB-depth （RGBD）方法在存在缺失或噪声深度数据的情况下计算量大且脆弱。在这项工作中，我们提出了一个轻量级的实时RGBD异常检测框架，该框架不仅可以细化每模态特征，还可以执行鲁棒的分层融合并容忍缺失输入。我们的方法采用共享的ResNet-50骨干网和模态特定特征增强（MSFE）模块来放大纹理和几何线索，然后使用分层多模态融合（HMM）编码器进行跨尺度集成。我们进一步引入了一个基于课程的异常特征生成器来产生上下文感知的扰动，训练一个紧凑的两层鉴别器来产生精确的像素级正态性分数。在MVTec异常检测（MVTec- ad）数据集、视觉异常（VisA）数据集和新收集的RealSense D435i RGBD数据集上进行的大量实验表明，高达99.0%的像素级接收者工作特征曲线下面积（P-AUROC）、99.6%的图像级AUROC （I-AUROC）、82.6%的区域重叠面积（AUPRO）和45帧每秒（FPS）的推理速度。这些结果验证了我们的方法在高通量工业检测场景中的有效性和可部署性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊