Cross-modal semantic transfer for point cloud semantic segmentation

IF 10.6 1区地球科学 Q1 GEOGRAPHY, PHYSICAL

ISPRS Journal of Photogrammetry and Remote Sensing Pub Date : 2025-02-14 DOI:10.1016/j.isprsjprs.2025.01.024

Zhen Cao , Xiaoxin Mi , Bo Qiu , Zhipeng Cao , Chen Long , Xinrui Yan , Chao Zheng , Zhen Dong , Bisheng Yang

{"title":"Cross-modal semantic transfer for point cloud semantic segmentation","authors":"Zhen Cao , Xiaoxin Mi , Bo Qiu , Zhipeng Cao , Chen Long , Xinrui Yan , Chao Zheng , Zhen Dong , Bisheng Yang","doi":"10.1016/j.isprsjprs.2025.01.024","DOIUrl":null,"url":null,"abstract":"<div><div>3D street scene semantic segmentation is essential for urban understanding. However, supervised point cloud semantic segmentation networks heavily rely on expensive manual annotations and demonstrate limited generalization capabilities across datasets, which poses limitations in a range of downstream tasks. In contrast, image segmentation networks exhibit stronger generalization. Fortunately, mobile laser scanning systems can collect images and point clouds simultaneously, offering a potential solution for 2D-3D semantic transfer. In this paper, we introduce a cross-modal label transfer framework for point cloud semantic segmentation, without the supervision of 3D semantic annotation. Specifically, the proposed method takes point clouds and the associated posed images of a scene as inputs, and accomplishes the pointwise semantic segmentation for point clouds. We first get the image semantic pseudo-labels through a pre-trained image semantic segmentation model. Building on this, we construct implicit neural radiance fields (NeRF) to achieve multi-view consistent label mapping by jointly constructing color and semantic fields. Then, we design a superpoint semantic module to capture the local geometric features on point clouds, which contributes a lot to correcting semantic errors in the implicit field. Moreover, we introduce a dynamic object filter and a pose adjustment module to address the spatio-temporal misalignment between point clouds and images, further enhancing the consistency of the transferred semantic labels. The proposed approach has shown promising outcomes on two street scene datasets, namely KITTI-360 and WHU-Urban3D, highlighting the effectiveness and reliability of our method. Compared to the SoTA point cloud semantic segmentation method, namely SPT, the proposed method improves mIoU by approximately 15% on the WHU-Urban3D dataset. Our code and data are available at <span><span>https://github.com/a4152684/StreetSeg</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"221 ","pages":"Pages 265-279"},"PeriodicalIF":10.6000,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISPRS Journal of Photogrammetry and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0924271625000243","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOGRAPHY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

Abstract

3D street scene semantic segmentation is essential for urban understanding. However, supervised point cloud semantic segmentation networks heavily rely on expensive manual annotations and demonstrate limited generalization capabilities across datasets, which poses limitations in a range of downstream tasks. In contrast, image segmentation networks exhibit stronger generalization. Fortunately, mobile laser scanning systems can collect images and point clouds simultaneously, offering a potential solution for 2D-3D semantic transfer. In this paper, we introduce a cross-modal label transfer framework for point cloud semantic segmentation, without the supervision of 3D semantic annotation. Specifically, the proposed method takes point clouds and the associated posed images of a scene as inputs, and accomplishes the pointwise semantic segmentation for point clouds. We first get the image semantic pseudo-labels through a pre-trained image semantic segmentation model. Building on this, we construct implicit neural radiance fields (NeRF) to achieve multi-view consistent label mapping by jointly constructing color and semantic fields. Then, we design a superpoint semantic module to capture the local geometric features on point clouds, which contributes a lot to correcting semantic errors in the implicit field. Moreover, we introduce a dynamic object filter and a pose adjustment module to address the spatio-temporal misalignment between point clouds and images, further enhancing the consistency of the transferred semantic labels. The proposed approach has shown promising outcomes on two street scene datasets, namely KITTI-360 and WHU-Urban3D, highlighting the effectiveness and reliability of our method. Compared to the SoTA point cloud semantic segmentation method, namely SPT, the proposed method improves mIoU by approximately 15% on the WHU-Urban3D dataset. Our code and data are available at https://github.com/a4152684/StreetSeg.

查看原文本刊更多论文

点云语义分割的跨模态语义转移

三维街景语义分割是城市理解的关键。然而，监督点云语义分割网络严重依赖昂贵的人工注释，并且跨数据集的泛化能力有限，这对一系列下游任务造成了限制。相比之下，图像分割网络具有更强的泛化能力。幸运的是，移动激光扫描系统可以同时收集图像和点云，为2D-3D语义传输提供了一个潜在的解决方案。在本文中，我们引入了一种用于点云语义分割的跨模态标签转移框架，无需三维语义标注的监督。具体而言，该方法以场景的点云和相关摆姿图像为输入，实现点云的逐点语义分割。首先通过预训练的图像语义分割模型得到图像语义伪标签。在此基础上，构建隐式神经辐射场（NeRF），通过联合构建颜色场和语义场实现多视图一致标签映射。然后，我们设计了一个叠加点语义模块来捕获点云上的局部几何特征，这对修正隐式域的语义错误有很大的帮助。此外，我们引入了动态目标滤波器和姿态调整模块来解决点云和图像之间的时空不对齐问题，进一步增强了传递语义标签的一致性。该方法在两个街景数据集KITTI-360和WHU-Urban3D上显示出了令人满意的结果，突出了我们方法的有效性和可靠性。与SoTA点云语义分割方法（SPT）相比，该方法在WHU-Urban3D数据集上的mIoU提高了约15%。我们的代码和数据可在https://github.com/a4152684/StreetSeg上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ISPRS Journal of Photogrammetry and Remote Sensing 工程技术-成像科学与照相技术

CiteScore

21.00

自引率

6.30%

发文量

273

审稿时长

40 days

期刊介绍： The ISPRS Journal of Photogrammetry and Remote Sensing (P&RS) serves as the official journal of the International Society for Photogrammetry and Remote Sensing (ISPRS). It acts as a platform for scientists and professionals worldwide who are involved in various disciplines that utilize photogrammetry, remote sensing, spatial information systems, computer vision, and related fields. The journal aims to facilitate communication and dissemination of advancements in these disciplines, while also acting as a comprehensive source of reference and archive. P&RS endeavors to publish high-quality, peer-reviewed research papers that are preferably original and have not been published before. These papers can cover scientific/research, technological development, or application/practical aspects. Additionally, the journal welcomes papers that are based on presentations from ISPRS meetings, as long as they are considered significant contributions to the aforementioned fields. In particular, P&RS encourages the submission of papers that are of broad scientific interest, showcase innovative applications (especially in emerging fields), have an interdisciplinary focus, discuss topics that have received limited attention in P&RS or related journals, or explore new directions in scientific or professional realms. It is preferred that theoretical papers include practical applications, while papers focusing on systems and applications should include a theoretical background.