Leveraging Frozen Foundation Models and Multimodal Fusion for BEV Segmentation and Occupancy Prediction

IF 4.8 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Open Journal of Vehicular Technology Pub Date : 2025-04-23 DOI:10.1109/OJVT.2025.3563677

Seamie Hayes;Ganesh Sistu;Ciarán Eising

{"title":"Leveraging Frozen Foundation Models and Multimodal Fusion for BEV Segmentation and Occupancy Prediction","authors":"Seamie Hayes;Ganesh Sistu;Ciarán Eising","doi":"10.1109/OJVT.2025.3563677","DOIUrl":null,"url":null,"abstract":"In Bird's Eye View perception, significant emphasis is placed on deploying well-performing, convoluted model architectures and leveraging as many sensor modalities as possible to reach maximal performance. This paper investigates whether foundation models and multi-sensor deployments are essential for enhancing BEV perception. We examine the relative importance of advanced feature extraction versus the number of sensor modalities and assess whether foundation models can address feature extraction limitations and reduce the need for extensive training data. Specifically, incorporating the self-supervised DINOv2 for feature extraction and Metric3Dv2 for depth estimation into the Lift-Splat-Shoot framework results in a 7.4 IoU point increase in vehicle segmentation, representing a relative improvement of 22.4%, while requiring only half the training data and iterations compared to the original model. Furthermore, using Metric3Dv2’s depth maps as a pseudo-LiDAR point cloud within the Simple-BEV model improves IoU by 2.9 points, marking a 6.1% relative increase compared to the Camera-only setup. Finally, we extend the famous Gaussian Splatting BEV perception models, GaussianFormer and GaussianOcc, through multimodal deployment. The addition of LiDAR information in GaussianFormer results in a 9.4-point increase in mIoU, a 48.7% improvement over the Camera-only model, nearing state-of-the-art multimodal performance even with limited LiDAR scans. In the self-supervised GaussianOcc model, incorporating LiDAR leads to a 0.36-point increase in mIoU, representing a 3.6% improvement over the Camera-only model. This limited gain can be attributed to the absence of LiDAR encoding and the self-supervised nature of the model. Overall, our findings highlight the critical role of foundation models and multi-sensor integration in advancing BEV perception. By leveraging sophisticated foundation models and multi-sensor deployment, we can further model performance and reduce data requirements, addressing key challenges in BEV perception.","PeriodicalId":34270,"journal":{"name":"IEEE Open Journal of Vehicular Technology","volume":"6 ","pages":"1241-1261"},"PeriodicalIF":4.8000,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10974666","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of Vehicular Technology","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10974666/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

In Bird's Eye View perception, significant emphasis is placed on deploying well-performing, convoluted model architectures and leveraging as many sensor modalities as possible to reach maximal performance. This paper investigates whether foundation models and multi-sensor deployments are essential for enhancing BEV perception. We examine the relative importance of advanced feature extraction versus the number of sensor modalities and assess whether foundation models can address feature extraction limitations and reduce the need for extensive training data. Specifically, incorporating the self-supervised DINOv2 for feature extraction and Metric3Dv2 for depth estimation into the Lift-Splat-Shoot framework results in a 7.4 IoU point increase in vehicle segmentation, representing a relative improvement of 22.4%, while requiring only half the training data and iterations compared to the original model. Furthermore, using Metric3Dv2’s depth maps as a pseudo-LiDAR point cloud within the Simple-BEV model improves IoU by 2.9 points, marking a 6.1% relative increase compared to the Camera-only setup. Finally, we extend the famous Gaussian Splatting BEV perception models, GaussianFormer and GaussianOcc, through multimodal deployment. The addition of LiDAR information in GaussianFormer results in a 9.4-point increase in mIoU, a 48.7% improvement over the Camera-only model, nearing state-of-the-art multimodal performance even with limited LiDAR scans. In the self-supervised GaussianOcc model, incorporating LiDAR leads to a 0.36-point increase in mIoU, representing a 3.6% improvement over the Camera-only model. This limited gain can be attributed to the absence of LiDAR encoding and the self-supervised nature of the model. Overall, our findings highlight the critical role of foundation models and multi-sensor integration in advancing BEV perception. By leveraging sophisticated foundation models and multi-sensor deployment, we can further model performance and reduce data requirements, addressing key challenges in BEV perception.

查看原文本刊更多论文

基于冻结基础模型和多模态融合的纯电动汽车分割和占用预测

在鸟瞰感知中，重点是部署性能良好的复杂模型架构，并利用尽可能多的传感器模式来达到最大性能。本文探讨了基础模型和多传感器部署是否对增强纯电动汽车感知至关重要。我们研究了高级特征提取与传感器模式数量的相对重要性，并评估基础模型是否可以解决特征提取的限制并减少对大量训练数据的需求。具体来说，将用于特征提取的自监督DINOv2和用于深度估计的Metric3Dv2结合到lift - splata - shoot框架中，车辆分割的效率提高了7.4 IoU点，相对提高了22.4%，而与原始模型相比，只需要一半的训练数据和迭代。此外，在Simple-BEV模型中使用Metric3Dv2的深度图作为伪lidar点云，IoU提高了2.9分，与仅使用摄像头的设置相比，IoU相对提高了6.1%。最后，我们通过多模态部署扩展了著名的高斯飞溅BEV感知模型GaussianFormer和GaussianOcc。在GaussianFormer中添加LiDAR信息后，mIoU提高了9.4点，比只有摄像头的模型提高了48.7%，即使在激光雷达扫描有限的情况下，也接近最先进的多模态性能。在自监督GaussianOcc模型中，结合LiDAR导致mIoU增加0.36点，比仅相机模型提高3.6%。这种有限的增益可归因于缺乏激光雷达编码和模型的自监督性质。总之，我们的研究结果强调了基础模型和多传感器集成在推进纯电动汽车感知中的关键作用。通过利用复杂的基础模型和多传感器部署，我们可以进一步提高模型性能，减少数据需求，解决纯电动汽车感知中的关键挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊