{"title":"RevFB-BEV:具有可逆Swin变压器的内存高效网络,用于3D BEV目标检测","authors":"Leilei Pan, Yingnan Guo, Yu Zhang","doi":"10.1049/csy2.70021","DOIUrl":null,"url":null,"abstract":"<p>The perception of Bird's Eye View (BEV) has become a widely adopted approach in 3D object detection due to its spatial and dimensional consistency. However, the increasing complexity of neural network architectures has resulted in higher training memory, thereby limiting the scalability of model training. To address these challenges, we propose a novel model, RevFB-BEV, which is based on the Reversible Swin Transformer (RevSwin) with Forward-Backward View Transformation (FBVT) and LiDAR Guided Back Projection (LGBP). This approach includes the RevSwin backbone network, which employs a reversible architecture to minimise training memory by recomputing intermediate parameters. Moreover, we introduce the FBVT module that refines BEV features extracted from forward projection, yielding denser and more precise camera BEV representations. The LGBP module further utilises LiDAR BEV guidance for back projection to achieve more accurate camera BEV features. Extensive experiments on the nuScenes dataset demonstrate notable performance improvements, with our model achieving over a <span></span><math>\n <semantics>\n <mrow>\n <mn>4</mn>\n <mo>×</mo>\n </mrow>\n <annotation> $4\\times $</annotation>\n </semantics></math> reduction in training memory and a more than <span></span><math>\n <semantics>\n <mrow>\n <mn>12</mn>\n <mo>×</mo>\n </mrow>\n <annotation> $12\\times $</annotation>\n </semantics></math> decrease in single-backbone training memory. These efficiency gains become even more pronounced with deeper network architectures. Additionally, RevFB-BEV achieves 68.1 mAP (mean Average Precision) on the validation set and 68.9 mAP on the test set, which is nearly on par with the baseline BEVFusion, underscoring its effectiveness in resource-constrained scenarios.</p>","PeriodicalId":34110,"journal":{"name":"IET Cybersystems and Robotics","volume":"7 1","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/csy2.70021","citationCount":"0","resultStr":"{\"title\":\"RevFB-BEV: Memory-Efficient Network With Reversible Swin Transformer for 3D BEV Object Detection\",\"authors\":\"Leilei Pan, Yingnan Guo, Yu Zhang\",\"doi\":\"10.1049/csy2.70021\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The perception of Bird's Eye View (BEV) has become a widely adopted approach in 3D object detection due to its spatial and dimensional consistency. However, the increasing complexity of neural network architectures has resulted in higher training memory, thereby limiting the scalability of model training. To address these challenges, we propose a novel model, RevFB-BEV, which is based on the Reversible Swin Transformer (RevSwin) with Forward-Backward View Transformation (FBVT) and LiDAR Guided Back Projection (LGBP). This approach includes the RevSwin backbone network, which employs a reversible architecture to minimise training memory by recomputing intermediate parameters. Moreover, we introduce the FBVT module that refines BEV features extracted from forward projection, yielding denser and more precise camera BEV representations. The LGBP module further utilises LiDAR BEV guidance for back projection to achieve more accurate camera BEV features. Extensive experiments on the nuScenes dataset demonstrate notable performance improvements, with our model achieving over a <span></span><math>\\n <semantics>\\n <mrow>\\n <mn>4</mn>\\n <mo>×</mo>\\n </mrow>\\n <annotation> $4\\\\times $</annotation>\\n </semantics></math> reduction in training memory and a more than <span></span><math>\\n <semantics>\\n <mrow>\\n <mn>12</mn>\\n <mo>×</mo>\\n </mrow>\\n <annotation> $12\\\\times $</annotation>\\n </semantics></math> decrease in single-backbone training memory. These efficiency gains become even more pronounced with deeper network architectures. Additionally, RevFB-BEV achieves 68.1 mAP (mean Average Precision) on the validation set and 68.9 mAP on the test set, which is nearly on par with the baseline BEVFusion, underscoring its effectiveness in resource-constrained scenarios.</p>\",\"PeriodicalId\":34110,\"journal\":{\"name\":\"IET Cybersystems and Robotics\",\"volume\":\"7 1\",\"pages\":\"\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2025-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/csy2.70021\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IET Cybersystems and Robotics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/csy2.70021\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Cybersystems and Robotics","FirstCategoryId":"1085","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/csy2.70021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
RevFB-BEV: Memory-Efficient Network With Reversible Swin Transformer for 3D BEV Object Detection
The perception of Bird's Eye View (BEV) has become a widely adopted approach in 3D object detection due to its spatial and dimensional consistency. However, the increasing complexity of neural network architectures has resulted in higher training memory, thereby limiting the scalability of model training. To address these challenges, we propose a novel model, RevFB-BEV, which is based on the Reversible Swin Transformer (RevSwin) with Forward-Backward View Transformation (FBVT) and LiDAR Guided Back Projection (LGBP). This approach includes the RevSwin backbone network, which employs a reversible architecture to minimise training memory by recomputing intermediate parameters. Moreover, we introduce the FBVT module that refines BEV features extracted from forward projection, yielding denser and more precise camera BEV representations. The LGBP module further utilises LiDAR BEV guidance for back projection to achieve more accurate camera BEV features. Extensive experiments on the nuScenes dataset demonstrate notable performance improvements, with our model achieving over a reduction in training memory and a more than decrease in single-backbone training memory. These efficiency gains become even more pronounced with deeper network architectures. Additionally, RevFB-BEV achieves 68.1 mAP (mean Average Precision) on the validation set and 68.9 mAP on the test set, which is nearly on par with the baseline BEVFusion, underscoring its effectiveness in resource-constrained scenarios.