SPOT：可扩展的3D预训练通过占用预测学习可转移的3D表示。

IF 20.8 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2025-07-08 DOI:10.1109/tpami.2025.3586961

Xiangchao Yan,Runjian Chen,Bo Zhang,Hancheng Ye,Renqiu Xia,Jiakang Yuan,Hongbin Zhou,Xinyu Cai,Botian Shi,Wenqi Shao,Ping Luo,Yu Qiao,Tao Chen,Junchi Yan

{"title":"SPOT：可扩展的3D预训练通过占用预测学习可转移的3D表示。","authors":"Xiangchao Yan,Runjian Chen,Bo Zhang,Hancheng Ye,Renqiu Xia,Jiakang Yuan,Hongbin Zhou,Xinyu Cai,Botian Shi,Wenqi Shao,Ping Luo,Yu Qiao,Tao Chen,Junchi Yan","doi":"10.1109/tpami.2025.3586961","DOIUrl":null,"url":null,"abstract":"Annotating 3D LiDAR point clouds for perception tasks is fundamental for many applications e.g. autonomous driving, yet it still remains notoriously labor-intensive. Pretraining-finetuning approach can alleviate the labeling burden by fine-tuning a pre-trained backbone across various downstream datasets as well as tasks. In this paper, we propose SPOT, namely Scalable Pre-training via Occupancy prediction for learning Transferable 3D representations under such a label-efficient fine-tuning paradigm. SPOT achieves effectiveness on various public datasets with different downstream tasks, showcasing its general representation power, cross-domain robustness and data scalability which are three key factors for real-world application. Specifically, we both theoretically and empirically show, for the first time, that general representations learning can be achieved through the task of occupancy prediction. Then, to address the domain gap caused by different LiDAR sensors and annotation methods, we develop a beam re-sampling technique for point cloud augmentation combined with class-balancing strategy. Furthermore, scalable pre-training is observed, that is, the downstream performance across all the experiments gets better with more pre-training data. Additionally, such pre-training strategy also remains compatible with unlabeled data. The hope is that our findings will facilitate the understanding of LiDAR points and pave the way for future advancements in LiDAR pre-training.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"21 1","pages":""},"PeriodicalIF":20.8000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SPOT: Scalable 3D Pre-training via Occupancy Prediction for Learning Transferable 3D Representations.\",\"authors\":\"Xiangchao Yan,Runjian Chen,Bo Zhang,Hancheng Ye,Renqiu Xia,Jiakang Yuan,Hongbin Zhou,Xinyu Cai,Botian Shi,Wenqi Shao,Ping Luo,Yu Qiao,Tao Chen,Junchi Yan\",\"doi\":\"10.1109/tpami.2025.3586961\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Annotating 3D LiDAR point clouds for perception tasks is fundamental for many applications e.g. autonomous driving, yet it still remains notoriously labor-intensive. Pretraining-finetuning approach can alleviate the labeling burden by fine-tuning a pre-trained backbone across various downstream datasets as well as tasks. In this paper, we propose SPOT, namely Scalable Pre-training via Occupancy prediction for learning Transferable 3D representations under such a label-efficient fine-tuning paradigm. SPOT achieves effectiveness on various public datasets with different downstream tasks, showcasing its general representation power, cross-domain robustness and data scalability which are three key factors for real-world application. Specifically, we both theoretically and empirically show, for the first time, that general representations learning can be achieved through the task of occupancy prediction. Then, to address the domain gap caused by different LiDAR sensors and annotation methods, we develop a beam re-sampling technique for point cloud augmentation combined with class-balancing strategy. Furthermore, scalable pre-training is observed, that is, the downstream performance across all the experiments gets better with more pre-training data. Additionally, such pre-training strategy also remains compatible with unlabeled data. The hope is that our findings will facilitate the understanding of LiDAR points and pave the way for future advancements in LiDAR pre-training.\",\"PeriodicalId\":13426,\"journal\":{\"name\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"volume\":\"21 1\",\"pages\":\"\"},\"PeriodicalIF\":20.8000,\"publicationDate\":\"2025-07-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/tpami.2025.3586961\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3586961","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

为感知任务标注3D激光雷达点云是许多应用的基础，例如自动驾驶，但它仍然是出了名的劳动密集型。预训练-微调方法可以通过对不同下游数据集和任务的预训练主干进行微调来减轻标记负担。在本文中，我们提出了SPOT，即通过占用预测进行可扩展预训练，用于在这种标签高效的微调范式下学习可转移的3D表示。SPOT在具有不同下游任务的各种公共数据集上实现了有效性，展示了它的通用表示能力、跨域鲁棒性和数据可扩展性，这是实际应用的三个关键因素。具体来说，我们在理论和经验上都首次表明，一般表征学习可以通过占用预测任务来实现。然后，针对不同的激光雷达传感器和标注方法导致的域间隙，我们开发了一种结合类平衡策略的点云增强波束重采样技术。此外，还观察到预训练的可扩展性，即随着预训练数据的增加，所有实验的下游性能都有所提高。此外，这种预训练策略也与未标记的数据保持兼容。希望我们的研究结果能够促进对激光雷达点的理解，并为激光雷达预训练的未来发展铺平道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SPOT: Scalable 3D Pre-training via Occupancy Prediction for Learning Transferable 3D Representations.

Annotating 3D LiDAR point clouds for perception tasks is fundamental for many applications e.g. autonomous driving, yet it still remains notoriously labor-intensive. Pretraining-finetuning approach can alleviate the labeling burden by fine-tuning a pre-trained backbone across various downstream datasets as well as tasks. In this paper, we propose SPOT, namely Scalable Pre-training via Occupancy prediction for learning Transferable 3D representations under such a label-efficient fine-tuning paradigm. SPOT achieves effectiveness on various public datasets with different downstream tasks, showcasing its general representation power, cross-domain robustness and data scalability which are three key factors for real-world application. Specifically, we both theoretically and empirically show, for the first time, that general representations learning can be achieved through the task of occupancy prediction. Then, to address the domain gap caused by different LiDAR sensors and annotation methods, we develop a beam re-sampling technique for point cloud augmentation combined with class-balancing strategy. Furthermore, scalable pre-training is observed, that is, the downstream performance across all the experiments gets better with more pre-training data. Additionally, such pre-training strategy also remains compatible with unlabeled data. The hope is that our findings will facilitate the understanding of LiDAR points and pave the way for future advancements in LiDAR pre-training.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.