Xin Bi , Caien Weng , Panpan Tong , Arno Eichberger , Lu Xiong
{"title":"RVFormer: Keypoint-based fusion of 4D radar and vision for 3D object detection in autonomous driving","authors":"Xin Bi , Caien Weng , Panpan Tong , Arno Eichberger , Lu Xiong","doi":"10.1016/j.eswa.2026.131497","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-modal fusion is crucial in autonomous driving perception, enhancing reliability, completeness, and accuracy, which extends the performance limits of perception systems. Specifically, large-scale perception through 4D radar and vision fusion has become a key research focus aimed at improving driving safety, enhancing complex scene understanding, and supporting fine-grained local planning and control. However, existing 3D object detection methods typically rely on fixed-voxel representations to maintain detection accuracy. As the perception range increases, these methods incur considerable computational overhead. While transformer-based query methods show strong potential in capturing dependencies over large receptive fields in image-domain tasks, their application in radar-vision fusion is limited due to radar point cloud sparsity and cross-modal alignment challenges. To address these limitations, we propose RVFormer, a dual-branch feature-level fusion network that uses a sparse keypoint-based query strategy to integrate features from both modalities, thereby mitigating the impact of large-scale scenes on inference speed. Additionally, we introduce clustered voxel query initialization (CVQI) to accelerate convergence and enhance object localization. By incorporating the radar voxel painter (RVP), radar-image cross-attention (RICA), and gated adaptive fusion (GAF) modules, our framework enables deep and adaptive fusion of radar and visual features, effectively mitigating issues caused by point cloud sparsity and modality inconsistency. Compared to existing radar-vision fusion models, RVFormer demonstrates competitive performance, with an inference speed of approximately 15.2 frames per second. It delivers accuracy comparable to CNN-based approaches, while outperforming baseline methods by at least 4.72% in 3D mean average precision and 5.82% in bird’s-eye view mean average precision.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"312 ","pages":"Article 131497"},"PeriodicalIF":7.5000,"publicationDate":"2026-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417426004100","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/2/3 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Multi-modal fusion is crucial in autonomous driving perception, enhancing reliability, completeness, and accuracy, which extends the performance limits of perception systems. Specifically, large-scale perception through 4D radar and vision fusion has become a key research focus aimed at improving driving safety, enhancing complex scene understanding, and supporting fine-grained local planning and control. However, existing 3D object detection methods typically rely on fixed-voxel representations to maintain detection accuracy. As the perception range increases, these methods incur considerable computational overhead. While transformer-based query methods show strong potential in capturing dependencies over large receptive fields in image-domain tasks, their application in radar-vision fusion is limited due to radar point cloud sparsity and cross-modal alignment challenges. To address these limitations, we propose RVFormer, a dual-branch feature-level fusion network that uses a sparse keypoint-based query strategy to integrate features from both modalities, thereby mitigating the impact of large-scale scenes on inference speed. Additionally, we introduce clustered voxel query initialization (CVQI) to accelerate convergence and enhance object localization. By incorporating the radar voxel painter (RVP), radar-image cross-attention (RICA), and gated adaptive fusion (GAF) modules, our framework enables deep and adaptive fusion of radar and visual features, effectively mitigating issues caused by point cloud sparsity and modality inconsistency. Compared to existing radar-vision fusion models, RVFormer demonstrates competitive performance, with an inference speed of approximately 15.2 frames per second. It delivers accuracy comparable to CNN-based approaches, while outperforming baseline methods by at least 4.72% in 3D mean average precision and 5.82% in bird’s-eye view mean average precision.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.