RevFB-BEV: Memory-Efficient Network With Reversible Swin Transformer for 3D BEV Object Detection

IF 1.2 Q3 AUTOMATION & CONTROL SYSTEMS

IET Cybersystems and Robotics Pub Date : 2025-09-08 DOI:10.1049/csy2.70021

Leilei Pan, Yingnan Guo, Yu Zhang

{"title":"RevFB-BEV: Memory-Efficient Network With Reversible Swin Transformer for 3D BEV Object Detection","authors":"Leilei Pan, Yingnan Guo, Yu Zhang","doi":"10.1049/csy2.70021","DOIUrl":null,"url":null,"abstract":"The perception of Bird's Eye View (BEV) has become a widely adopted approach in 3D object detection due to its spatial and dimensional consistency. However, the increasing complexity of neural network architectures has resulted in higher training memory, thereby limiting the scalability of model training. To address these challenges, we propose a novel model, RevFB-BEV, which is based on the Reversible Swin Transformer (RevSwin) with Forward-Backward View Transformation (FBVT) and LiDAR Guided Back Projection (LGBP). This approach includes the RevSwin backbone network, which employs a reversible architecture to minimise training memory by recomputing intermediate parameters. Moreover, we introduce the FBVT module that refines BEV features extracted from forward projection, yielding denser and more precise camera BEV representations. The LGBP module further utilises LiDAR BEV guidance for back projection to achieve more accurate camera BEV features. Extensive experiments on the nuScenes dataset demonstrate notable performance improvements, with our model achieving over a <math>\n <semantics>\n <mrow>\n <mn>4</mn>\n <mo>×</mo>\n </mrow>\n <annotation> $4\\times $</annotation>\n </semantics></math> reduction in training memory and a more than <math>\n <semantics>\n <mrow>\n <mn>12</mn>\n <mo>×</mo>\n </mrow>\n <annotation> $12\\times $</annotation>\n </semantics></math> decrease in single-backbone training memory. These efficiency gains become even more pronounced with deeper network architectures. Additionally, RevFB-BEV achieves 68.1 mAP (mean Average Precision) on the validation set and 68.9 mAP on the test set, which is nearly on par with the baseline BEVFusion, underscoring its effectiveness in resource-constrained scenarios.","PeriodicalId":34110,"journal":{"name":"IET Cybersystems and Robotics","volume":"7 1","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/csy2.70021","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Cybersystems and Robotics","FirstCategoryId":"1085","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/csy2.70021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The perception of Bird's Eye View (BEV) has become a widely adopted approach in 3D object detection due to its spatial and dimensional consistency. However, the increasing complexity of neural network architectures has resulted in higher training memory, thereby limiting the scalability of model training. To address these challenges, we propose a novel model, RevFB-BEV, which is based on the Reversible Swin Transformer (RevSwin) with Forward-Backward View Transformation (FBVT) and LiDAR Guided Back Projection (LGBP). This approach includes the RevSwin backbone network, which employs a reversible architecture to minimise training memory by recomputing intermediate parameters. Moreover, we introduce the FBVT module that refines BEV features extracted from forward projection, yielding denser and more precise camera BEV representations. The LGBP module further utilises LiDAR BEV guidance for back projection to achieve more accurate camera BEV features. Extensive experiments on the nuScenes dataset demonstrate notable performance improvements, with our model achieving over a $4 \times$ reduction in training memory and a more than $12 \times$ decrease in single-backbone training memory. These efficiency gains become even more pronounced with deeper network architectures. Additionally, RevFB-BEV achieves 68.1 mAP (mean Average Precision) on the validation set and 68.9 mAP on the test set, which is nearly on par with the baseline BEVFusion, underscoring its effectiveness in resource-constrained scenarios.

Abstract Image

查看原文本刊更多论文

RevFB-BEV：具有可逆Swin变压器的内存高效网络，用于3D BEV目标检测

鸟瞰图（BEV）因其空间和维度的一致性而成为三维目标检测中广泛采用的一种方法。然而，随着神经网络结构复杂度的不断提高，训练内存的要求越来越高，从而限制了模型训练的可扩展性。为了解决这些挑战，我们提出了一种新的模型，RevFB-BEV，它基于可逆Swin变压器（RevSwin），具有前向后视图变换（FBVT）和激光雷达制导反向投影（LGBP）。该方法包括RevSwin骨干网，该骨干网采用可逆结构，通过重新计算中间参数来最小化训练记忆。此外，我们还引入了FBVT模块，该模块对前向投影提取的BEV特征进行细化，从而获得更密集、更精确的相机BEV表示。LGBP模块进一步利用LiDAR BEV引导进行反向投影，以实现更精确的相机BEV功能。在nuScenes数据集上的大量实验证明了显着的性能改进，我们的模型在训练内存中实现了超过4 × 4\times $的减少，在单骨干训练内存中实现了超过12 × 12\times $的减少。随着网络架构的深入，这些效率的提升变得更加明显。此外，RevFB-BEV在验证集上达到68.1 mAP (mean Average Precision)，在测试集上达到68.9 mAP，几乎与基线BEVFusion相当，强调了其在资源受限场景下的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊