Ziyang Yan, Wenzhen Dong, Yihua Shao, Yuhang Lu, Liu Haiyang, Jingwen Liu, Haozhe Wang, Zhe Wang, Yan Wang, Fabio Remondino, Yuexin Ma
{"title":"RenderWorld: World Model with Self-Supervised 3D Label","authors":"Ziyang Yan, Wenzhen Dong, Yihua Shao, Yuhang Lu, Liu Haiyang, Jingwen Liu, Haozhe Wang, Zhe Wang, Yan Wang, Fabio Remondino, Yuexin Ma","doi":"arxiv-2409.11356","DOIUrl":null,"url":null,"abstract":"End-to-end autonomous driving with vision-only is not only more\ncost-effective compared to LiDAR-vision fusion but also more reliable than\ntraditional methods. To achieve a economical and robust purely visual\nautonomous driving system, we propose RenderWorld, a vision-only end-to-end\nautonomous driving framework, which generates 3D occupancy labels using a\nself-supervised gaussian-based Img2Occ Module, then encodes the labels by\nAM-VAE, and uses world model for forecasting and planning. RenderWorld employs\nGaussian Splatting to represent 3D scenes and render 2D images greatly improves\nsegmentation accuracy and reduces GPU memory consumption compared with\nNeRF-based methods. By applying AM-VAE to encode air and non-air separately,\nRenderWorld achieves more fine-grained scene element representation, leading to\nstate-of-the-art performance in both 4D occupancy forecasting and motion\nplanning from autoregressive world model.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11356","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
End-to-end autonomous driving with vision-only is not only more
cost-effective compared to LiDAR-vision fusion but also more reliable than
traditional methods. To achieve a economical and robust purely visual
autonomous driving system, we propose RenderWorld, a vision-only end-to-end
autonomous driving framework, which generates 3D occupancy labels using a
self-supervised gaussian-based Img2Occ Module, then encodes the labels by
AM-VAE, and uses world model for forecasting and planning. RenderWorld employs
Gaussian Splatting to represent 3D scenes and render 2D images greatly improves
segmentation accuracy and reduces GPU memory consumption compared with
NeRF-based methods. By applying AM-VAE to encode air and non-air separately,
RenderWorld achieves more fine-grained scene element representation, leading to
state-of-the-art performance in both 4D occupancy forecasting and motion
planning from autoregressive world model.