Junyin Wang , Chenghu Du , Tongao Ge , Bingyi Liu , Shengwu Xiong
{"title":"D3PD: Dual distillation and dynamic fusion for camera-radar 3D perception","authors":"Junyin Wang , Chenghu Du , Tongao Ge , Bingyi Liu , Shengwu Xiong","doi":"10.1016/j.patcog.2025.112350","DOIUrl":null,"url":null,"abstract":"<div><div>Autonomous driving perception is driving rapid advancements in Bird’s-Eye-View (BEV) technology. The synergy of surround-view imagery and radar is seen as a cost-friendly approach that enhances the understanding of driving scenarios. However, current methods for fusing radar and camera features lack effective environmental perception guidance and dynamic adjustment capabilities, which restricts their performance in real-world scenarios. In this paper, we introduce the D3PD framework, which combines fusion techniques with knowledge distillation to tackle the dynamic guidance deficit in existing radar-camera fusion methods. Our method includes two key modules: Radar-Camera Feature Enhancement (RCFE) and Dual Distillation Knowledge Transfer. The RCFE module enhances the areas of interest in BEV, addressing the poor object perception performance of single-modal features. The Dual Distillation Knowledge Transfer includes four distinct modules: Camera Radar Sparse Distillation (CRSD) for sparse feature knowledge transfer and teacher-student network feature alignment. Position-guided Sampling Distillation(SamD) for refining the knowledge transfer of fused features through dynamic sampling. Detection Constraint Result Distillation (DcRD) for strengthening the positional correlation between teacher and student network outputs in forward propagation, achieving more precise detection perception. and Self-learning Mask Focused Distillation (SMFD) for focusing perception detection results on knowledge transfer through self-learning, concentrating on the reinforcement of local key areas. The D3PD framework outperforms existing methods on the nuScenes benchmark, achieving 49.6 % mAP and 59.2 % NDS performance. Moreover, in the occupancy prediction task, D3PD-Occ has achieved an advanced performance of 37.94 % mIoU. This provides insights for the design and model training of camera and radar-based 3D object detection and occupancy network prediction methods. The code will be available at <span><span>https://github.com/no-Name128/D3PD</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"172 ","pages":"Article 112350"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325010118","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Autonomous driving perception is driving rapid advancements in Bird’s-Eye-View (BEV) technology. The synergy of surround-view imagery and radar is seen as a cost-friendly approach that enhances the understanding of driving scenarios. However, current methods for fusing radar and camera features lack effective environmental perception guidance and dynamic adjustment capabilities, which restricts their performance in real-world scenarios. In this paper, we introduce the D3PD framework, which combines fusion techniques with knowledge distillation to tackle the dynamic guidance deficit in existing radar-camera fusion methods. Our method includes two key modules: Radar-Camera Feature Enhancement (RCFE) and Dual Distillation Knowledge Transfer. The RCFE module enhances the areas of interest in BEV, addressing the poor object perception performance of single-modal features. The Dual Distillation Knowledge Transfer includes four distinct modules: Camera Radar Sparse Distillation (CRSD) for sparse feature knowledge transfer and teacher-student network feature alignment. Position-guided Sampling Distillation(SamD) for refining the knowledge transfer of fused features through dynamic sampling. Detection Constraint Result Distillation (DcRD) for strengthening the positional correlation between teacher and student network outputs in forward propagation, achieving more precise detection perception. and Self-learning Mask Focused Distillation (SMFD) for focusing perception detection results on knowledge transfer through self-learning, concentrating on the reinforcement of local key areas. The D3PD framework outperforms existing methods on the nuScenes benchmark, achieving 49.6 % mAP and 59.2 % NDS performance. Moreover, in the occupancy prediction task, D3PD-Occ has achieved an advanced performance of 37.94 % mIoU. This provides insights for the design and model training of camera and radar-based 3D object detection and occupancy network prediction methods. The code will be available at https://github.com/no-Name128/D3PD.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.