RVFormer: Keypoint-based fusion of 4D radar and vision for 3D object detection in autonomous driving

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2026-05-25 Epub Date: 2026-02-03 DOI:10.1016/j.eswa.2026.131497

Xin Bi , Caien Weng , Panpan Tong , Arno Eichberger , Lu Xiong

{"title":"RVFormer: Keypoint-based fusion of 4D radar and vision for 3D object detection in autonomous driving","authors":"Xin Bi , Caien Weng , Panpan Tong , Arno Eichberger , Lu Xiong","doi":"10.1016/j.eswa.2026.131497","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-modal fusion is crucial in autonomous driving perception, enhancing reliability, completeness, and accuracy, which extends the performance limits of perception systems. Specifically, large-scale perception through 4D radar and vision fusion has become a key research focus aimed at improving driving safety, enhancing complex scene understanding, and supporting fine-grained local planning and control. However, existing 3D object detection methods typically rely on fixed-voxel representations to maintain detection accuracy. As the perception range increases, these methods incur considerable computational overhead. While transformer-based query methods show strong potential in capturing dependencies over large receptive fields in image-domain tasks, their application in radar-vision fusion is limited due to radar point cloud sparsity and cross-modal alignment challenges. To address these limitations, we propose RVFormer, a dual-branch feature-level fusion network that uses a sparse keypoint-based query strategy to integrate features from both modalities, thereby mitigating the impact of large-scale scenes on inference speed. Additionally, we introduce clustered voxel query initialization (CVQI) to accelerate convergence and enhance object localization. By incorporating the radar voxel painter (RVP), radar-image cross-attention (RICA), and gated adaptive fusion (GAF) modules, our framework enables deep and adaptive fusion of radar and visual features, effectively mitigating issues caused by point cloud sparsity and modality inconsistency. Compared to existing radar-vision fusion models, RVFormer demonstrates competitive performance, with an inference speed of approximately 15.2 frames per second. It delivers accuracy comparable to CNN-based approaches, while outperforming baseline methods by at least 4.72% in 3D mean average precision and 5.82% in bird’s-eye view mean average precision.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"312 ","pages":"Article 131497"},"PeriodicalIF":7.5000,"publicationDate":"2026-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417426004100","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/2/3 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Multi-modal fusion is crucial in autonomous driving perception, enhancing reliability, completeness, and accuracy, which extends the performance limits of perception systems. Specifically, large-scale perception through 4D radar and vision fusion has become a key research focus aimed at improving driving safety, enhancing complex scene understanding, and supporting fine-grained local planning and control. However, existing 3D object detection methods typically rely on fixed-voxel representations to maintain detection accuracy. As the perception range increases, these methods incur considerable computational overhead. While transformer-based query methods show strong potential in capturing dependencies over large receptive fields in image-domain tasks, their application in radar-vision fusion is limited due to radar point cloud sparsity and cross-modal alignment challenges. To address these limitations, we propose RVFormer, a dual-branch feature-level fusion network that uses a sparse keypoint-based query strategy to integrate features from both modalities, thereby mitigating the impact of large-scale scenes on inference speed. Additionally, we introduce clustered voxel query initialization (CVQI) to accelerate convergence and enhance object localization. By incorporating the radar voxel painter (RVP), radar-image cross-attention (RICA), and gated adaptive fusion (GAF) modules, our framework enables deep and adaptive fusion of radar and visual features, effectively mitigating issues caused by point cloud sparsity and modality inconsistency. Compared to existing radar-vision fusion models, RVFormer demonstrates competitive performance, with an inference speed of approximately 15.2 frames per second. It delivers accuracy comparable to CNN-based approaches, while outperforming baseline methods by at least 4.72% in 3D mean average precision and 5.82% in bird’s-eye view mean average precision.

查看原文本刊更多论文

RVFormer：基于关键点的四维雷达与视觉融合，用于自动驾驶中3D物体检测

多模态融合在自动驾驶感知中至关重要，提高了感知系统的可靠性、完整性和准确性，扩展了感知系统的性能极限。具体而言，通过4D雷达和视觉融合进行大规模感知已成为提高驾驶安全性、增强复杂场景理解和支持细粒度局部规划和控制的关键研究热点。然而，现有的3D物体检测方法通常依赖于固定体素表示来保持检测精度。随着感知范围的增加，这些方法会产生相当大的计算开销。虽然基于变压器的查询方法在图像域任务中显示出捕获大型接受域依赖关系的强大潜力，但由于雷达点云稀疏和跨模态对齐挑战，它们在雷达视觉融合中的应用受到限制。为了解决这些限制，我们提出了RVFormer，这是一种双分支特征级融合网络，它使用基于稀疏关键点的查询策略来整合两种模式的特征，从而减轻了大规模场景对推理速度的影响。此外，我们引入了聚类体素查询初始化（CVQI）来加速收敛和增强目标定位。通过整合雷达体素绘制（RVP）、雷达图像交叉关注（RICA）和门控自适应融合（GAF）模块，我们的框架能够实现雷达和视觉特征的深度和自适应融合，有效缓解点云稀疏和模态不一致造成的问题。与现有的雷达-视觉融合模型相比，RVFormer具有竞争力的性能，推理速度约为每秒15.2帧。它提供的精度与基于cnn的方法相当，而在3D平均精度上至少比基线方法高4.72%，在鸟瞰平均精度上至少比基线方法高5.82%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.