RevFB-BEV: Memory-Efficient Network With Reversible Swin Transformer for 3D BEV Object Detection

IF 1.2 Q3 AUTOMATION & CONTROL SYSTEMS
Leilei Pan, Yingnan Guo, Yu Zhang
{"title":"RevFB-BEV: Memory-Efficient Network With Reversible Swin Transformer for 3D BEV Object Detection","authors":"Leilei Pan,&nbsp;Yingnan Guo,&nbsp;Yu Zhang","doi":"10.1049/csy2.70021","DOIUrl":null,"url":null,"abstract":"<p>The perception of Bird's Eye View (BEV) has become a widely adopted approach in 3D object detection due to its spatial and dimensional consistency. However, the increasing complexity of neural network architectures has resulted in higher training memory, thereby limiting the scalability of model training. To address these challenges, we propose a novel model, RevFB-BEV, which is based on the Reversible Swin Transformer (RevSwin) with Forward-Backward View Transformation (FBVT) and LiDAR Guided Back Projection (LGBP). This approach includes the RevSwin backbone network, which employs a reversible architecture to minimise training memory by recomputing intermediate parameters. Moreover, we introduce the FBVT module that refines BEV features extracted from forward projection, yielding denser and more precise camera BEV representations. The LGBP module further utilises LiDAR BEV guidance for back projection to achieve more accurate camera BEV features. Extensive experiments on the nuScenes dataset demonstrate notable performance improvements, with our model achieving over a <span></span><math>\n <semantics>\n <mrow>\n <mn>4</mn>\n <mo>×</mo>\n </mrow>\n <annotation> $4\\times $</annotation>\n </semantics></math> reduction in training memory and a more than <span></span><math>\n <semantics>\n <mrow>\n <mn>12</mn>\n <mo>×</mo>\n </mrow>\n <annotation> $12\\times $</annotation>\n </semantics></math> decrease in single-backbone training memory. These efficiency gains become even more pronounced with deeper network architectures. Additionally, RevFB-BEV achieves 68.1 mAP (mean Average Precision) on the validation set and 68.9 mAP on the test set, which is nearly on par with the baseline BEVFusion, underscoring its effectiveness in resource-constrained scenarios.</p>","PeriodicalId":34110,"journal":{"name":"IET Cybersystems and Robotics","volume":"7 1","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/csy2.70021","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Cybersystems and Robotics","FirstCategoryId":"1085","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/csy2.70021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

The perception of Bird's Eye View (BEV) has become a widely adopted approach in 3D object detection due to its spatial and dimensional consistency. However, the increasing complexity of neural network architectures has resulted in higher training memory, thereby limiting the scalability of model training. To address these challenges, we propose a novel model, RevFB-BEV, which is based on the Reversible Swin Transformer (RevSwin) with Forward-Backward View Transformation (FBVT) and LiDAR Guided Back Projection (LGBP). This approach includes the RevSwin backbone network, which employs a reversible architecture to minimise training memory by recomputing intermediate parameters. Moreover, we introduce the FBVT module that refines BEV features extracted from forward projection, yielding denser and more precise camera BEV representations. The LGBP module further utilises LiDAR BEV guidance for back projection to achieve more accurate camera BEV features. Extensive experiments on the nuScenes dataset demonstrate notable performance improvements, with our model achieving over a 4 × $4\times $ reduction in training memory and a more than 12 × $12\times $ decrease in single-backbone training memory. These efficiency gains become even more pronounced with deeper network architectures. Additionally, RevFB-BEV achieves 68.1 mAP (mean Average Precision) on the validation set and 68.9 mAP on the test set, which is nearly on par with the baseline BEVFusion, underscoring its effectiveness in resource-constrained scenarios.

Abstract Image

Abstract Image

Abstract Image

Abstract Image

RevFB-BEV:具有可逆Swin变压器的内存高效网络,用于3D BEV目标检测
鸟瞰图(BEV)因其空间和维度的一致性而成为三维目标检测中广泛采用的一种方法。然而,随着神经网络结构复杂度的不断提高,训练内存的要求越来越高,从而限制了模型训练的可扩展性。为了解决这些挑战,我们提出了一种新的模型,RevFB-BEV,它基于可逆Swin变压器(RevSwin),具有前向后视图变换(FBVT)和激光雷达制导反向投影(LGBP)。该方法包括RevSwin骨干网,该骨干网采用可逆结构,通过重新计算中间参数来最小化训练记忆。此外,我们还引入了FBVT模块,该模块对前向投影提取的BEV特征进行细化,从而获得更密集、更精确的相机BEV表示。LGBP模块进一步利用LiDAR BEV引导进行反向投影,以实现更精确的相机BEV功能。在nuScenes数据集上的大量实验证明了显着的性能改进,我们的模型在训练内存中实现了超过4 × 4\times $的减少,在单骨干训练内存中实现了超过12 × 12\times $的减少。随着网络架构的深入,这些效率的提升变得更加明显。此外,RevFB-BEV在验证集上达到68.1 mAP (mean Average Precision),在测试集上达到68.9 mAP,几乎与基线BEVFusion相当,强调了其在资源受限场景下的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IET Cybersystems and Robotics
IET Cybersystems and Robotics Computer Science-Information Systems
CiteScore
3.70
自引率
0.00%
发文量
31
审稿时长
34 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信