{"title":"RGBX-DiffusionDet:一个使用DiffusionDet进行多模态RGB-X目标检测的框架","authors":"Eliraz Orfaig, Inna Stainvas, Igal Bilik","doi":"10.1016/j.patcog.2025.112460","DOIUrl":null,"url":null,"abstract":"<div><div>This work addresses the challenge of object detection using multimodal heterogeneous sensors by extending the recently proposed DiffusionDet framework, initially designed for RGB-only detection. We propose RGBX-DiffusionDet, a generalized diffusion-based object detection framework that enables seamless fusion of heterogeneous 2D modalities (denoted as “X”, e.g., depth, infrared, and polarimetric data) with RGB imagery. The proposed approach adopts a mid-level feature fusion strategy to address the heterogeneous nature of multimodal data, characterized by varying spatial resolutions, noise profiles, and semantic content. Instead of commonly used brute-force feature concatenation, we introduce two novel architectural components: (1) a dynamic channel reduction convolutional block attention module (DCR-CBAM), which enhances cross-modal fusion by emphasizing salient channel features while reducing the dimensionality of merged RGB-X features, and (2) a dynamic multi-level aggregation block (DMLAB), which addresses a limitation of the baseline DiffusionDet decoder by adaptively fusing spatial features to improve object localization. Additionally, we incorporate novel regularization losses that promote channel saliency and spatial selectivity, resulting in compact and discriminative feature embeddings. Extensive experiments on RGB-depth (KITTI), a newly annotated RGB-polarimetric (RGB-P) dataset, and RGB-infrared (M3FD) benchmarks demonstrate consistent superiority of the proposed approach over RGB-only baselines, while maintaining decoding efficiency. We further show that RGBX-DiffusionDet exhibits improved robustness and generalization capability in visually-corrupted conditions, demonstrating its practical efficiency for robust multimodal object detection.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"172 ","pages":"Article 112460"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RGBX-DiffusionDet: a framework for multi-modal RGB-X object detection using DiffusionDet\",\"authors\":\"Eliraz Orfaig, Inna Stainvas, Igal Bilik\",\"doi\":\"10.1016/j.patcog.2025.112460\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>This work addresses the challenge of object detection using multimodal heterogeneous sensors by extending the recently proposed DiffusionDet framework, initially designed for RGB-only detection. We propose RGBX-DiffusionDet, a generalized diffusion-based object detection framework that enables seamless fusion of heterogeneous 2D modalities (denoted as “X”, e.g., depth, infrared, and polarimetric data) with RGB imagery. The proposed approach adopts a mid-level feature fusion strategy to address the heterogeneous nature of multimodal data, characterized by varying spatial resolutions, noise profiles, and semantic content. Instead of commonly used brute-force feature concatenation, we introduce two novel architectural components: (1) a dynamic channel reduction convolutional block attention module (DCR-CBAM), which enhances cross-modal fusion by emphasizing salient channel features while reducing the dimensionality of merged RGB-X features, and (2) a dynamic multi-level aggregation block (DMLAB), which addresses a limitation of the baseline DiffusionDet decoder by adaptively fusing spatial features to improve object localization. Additionally, we incorporate novel regularization losses that promote channel saliency and spatial selectivity, resulting in compact and discriminative feature embeddings. Extensive experiments on RGB-depth (KITTI), a newly annotated RGB-polarimetric (RGB-P) dataset, and RGB-infrared (M3FD) benchmarks demonstrate consistent superiority of the proposed approach over RGB-only baselines, while maintaining decoding efficiency. We further show that RGBX-DiffusionDet exhibits improved robustness and generalization capability in visually-corrupted conditions, demonstrating its practical efficiency for robust multimodal object detection.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"172 \",\"pages\":\"Article 112460\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325011239\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325011239","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
RGBX-DiffusionDet: a framework for multi-modal RGB-X object detection using DiffusionDet
This work addresses the challenge of object detection using multimodal heterogeneous sensors by extending the recently proposed DiffusionDet framework, initially designed for RGB-only detection. We propose RGBX-DiffusionDet, a generalized diffusion-based object detection framework that enables seamless fusion of heterogeneous 2D modalities (denoted as “X”, e.g., depth, infrared, and polarimetric data) with RGB imagery. The proposed approach adopts a mid-level feature fusion strategy to address the heterogeneous nature of multimodal data, characterized by varying spatial resolutions, noise profiles, and semantic content. Instead of commonly used brute-force feature concatenation, we introduce two novel architectural components: (1) a dynamic channel reduction convolutional block attention module (DCR-CBAM), which enhances cross-modal fusion by emphasizing salient channel features while reducing the dimensionality of merged RGB-X features, and (2) a dynamic multi-level aggregation block (DMLAB), which addresses a limitation of the baseline DiffusionDet decoder by adaptively fusing spatial features to improve object localization. Additionally, we incorporate novel regularization losses that promote channel saliency and spatial selectivity, resulting in compact and discriminative feature embeddings. Extensive experiments on RGB-depth (KITTI), a newly annotated RGB-polarimetric (RGB-P) dataset, and RGB-infrared (M3FD) benchmarks demonstrate consistent superiority of the proposed approach over RGB-only baselines, while maintaining decoding efficiency. We further show that RGBX-DiffusionDet exhibits improved robustness and generalization capability in visually-corrupted conditions, demonstrating its practical efficiency for robust multimodal object detection.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.