Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network

Jianbiao Mei;Yu Yang;Mengmeng Wang;Junyu Zhu;Jongwon Ra;Yukai Ma;Laijian Li;Yong Liu
{"title":"Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network","authors":"Jianbiao Mei;Yu Yang;Mengmeng Wang;Junyu Zhu;Jongwon Ra;Yukai Ma;Laijian Li;Yong Liu","doi":"10.1109/TIP.2024.3461989","DOIUrl":null,"url":null,"abstract":"Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at \n<uri>https://github.com/Jieqianyu/SGN</uri>\n.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5468-5481"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10694710/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at https://github.com/Jieqianyu/SGN .
利用稀疏制导网络实现基于摄像头的 3D 语义场景补全
语义场景补全(SSC)旨在从有限的观测数据中预测整个三维场景中每个体素的语义占位情况,这是自动驾驶的一项新兴而关键的任务。最近,由于摄像头具有更丰富的视觉线索和成本效益,许多研究转向了基于摄像头的 SSC 解决方案。然而,现有的方法通常依赖于复杂而厚重的三维模型来直接处理提取的三维特征,而这些特征的判别能力不足以实现清晰的分割边界。在本文中,我们采用密集-稀疏-密集的设计,提出了一种基于摄像头的单级 SSC 框架(称为 SGN),根据空间几何线索将语义从语义感知种子体素传播到整个场景。首先,为了利用深度感知上下文并动态选择稀疏种子体素,我们重新设计了稀疏体素建议网络,以粗到细的范式直接处理深度预测生成的点。此外,通过设计混合引导(稀疏语义和几何引导)和有效的空间几何线索体素聚合,我们增强了不同类别之间的特征分离,并加快了语义传播的收敛。最后,我们为灵活的感受野设计了多尺度语义传播模块,同时减少了计算资源。在 SemanticKITTI 和 SSCBench-KITTI-360 数据集上的大量实验结果表明,我们的 SGN 优于现有的先进方法。即使是我们的轻量级版本SGN-L,在SeamnticKITTI验证中也取得了14.80% mIoU和45.45% IoU的显著成绩,而参数只有12.5 M,训练内存只有7.16 G。代码见 https://github.com/Jieqianyu/SGN。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信