Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, Jian Yang
{"title":"Deep Height Decoupling for Precise Vision-based 3D Occupancy Prediction","authors":"Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, Jian Yang","doi":"arxiv-2409.07972","DOIUrl":null,"url":null,"abstract":"The task of vision-based 3D occupancy prediction aims to reconstruct 3D\ngeometry and estimate its semantic classes from 2D color images, where the\n2D-to-3D view transformation is an indispensable step. Most previous methods\nconduct forward projection, such as BEVPooling and VoxelPooling, both of which\nmap the 2D image features into 3D grids. However, the current grid representing\nfeatures within a certain height range usually introduces many confusing\nfeatures that belong to other height ranges. To address this challenge, we\npresent Deep Height Decoupling (DHD), a novel framework that incorporates\nexplicit height prior to filter out the confusing features. Specifically, DHD\nfirst predicts height maps via explicit supervision. Based on the height\ndistribution statistics, DHD designs Mask Guided Height Sampling (MGHS) to\nadaptively decoupled the height map into multiple binary masks. MGHS projects\nthe 2D image features into multiple subspaces, where each grid contains\nfeatures within reasonable height ranges. Finally, a Synergistic Feature\nAggregation (SFA) module is deployed to enhance the feature representation\nthrough channel and spatial affinities, enabling further occupancy refinement.\nOn the popular Occ3D-nuScenes benchmark, our method achieves state-of-the-art\nperformance even with minimal input frames. Code is available at\nhttps://github.com/yanzq95/DHD.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07972","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The task of vision-based 3D occupancy prediction aims to reconstruct 3D
geometry and estimate its semantic classes from 2D color images, where the
2D-to-3D view transformation is an indispensable step. Most previous methods
conduct forward projection, such as BEVPooling and VoxelPooling, both of which
map the 2D image features into 3D grids. However, the current grid representing
features within a certain height range usually introduces many confusing
features that belong to other height ranges. To address this challenge, we
present Deep Height Decoupling (DHD), a novel framework that incorporates
explicit height prior to filter out the confusing features. Specifically, DHD
first predicts height maps via explicit supervision. Based on the height
distribution statistics, DHD designs Mask Guided Height Sampling (MGHS) to
adaptively decoupled the height map into multiple binary masks. MGHS projects
the 2D image features into multiple subspaces, where each grid contains
features within reasonable height ranges. Finally, a Synergistic Feature
Aggregation (SFA) module is deployed to enhance the feature representation
through channel and spatial affinities, enabling further occupancy refinement.
On the popular Occ3D-nuScenes benchmark, our method achieves state-of-the-art
performance even with minimal input frames. Code is available at
https://github.com/yanzq95/DHD.