M3CS：多目标掩蔽点建模与可学习的代码本和暹罗解码器

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-03-21 DOI:10.1109/TCSVT.2025.3553525

Qibo Qiu;Honghui Yang;Jian Jiang;Shun Zhang;Haochao Ying;Haiming Gao;Wenxiao Wang;Xiaofei He

{"title":"M3CS：多目标掩蔽点建模与可学习的代码本和暹罗解码器","authors":"Qibo Qiu;Honghui Yang;Jian Jiang;Shun Zhang;Haochao Ying;Haiming Gao;Wenxiao Wang;Xiaofei He","doi":"10.1109/TCSVT.2025.3553525","DOIUrl":null,"url":null,"abstract":"Masked point modeling has become a promising scheme of self-supervised pre-training for point clouds. Existing methods reconstruct either the masked points or related features as the objective of pre-training. However, considering the diversity of downstream tasks, it is necessary for the model to have both low- and high-level representation modeling capabilities during pre-training. It enables the capture of both geometric details and semantic contexts. To this end, M3CS is proposed to endow the model with the above abilities. Specifically, with the masked point cloud as input, M3CS introduces two decoders to reconstruct masked representations and the masked points simultaneously. While an extra decoder doubles parameters for the decoding process and may lead to overfitting, we propose siamese decoders to keep the number of learnable parameters unchanged. Further, we propose an online codebook projecting continuous tokens into discrete ones before reconstructing masked points. In such a way, we can compel the decoder to take effect through the combinations of tokens rather than remembering each token. Comprehensive experiments show that M3CS achieves superior performance across both classification and segmentation tasks, outperforming existing methods that are also single-modality and single-scale.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8807-8818"},"PeriodicalIF":11.1000,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"M3CS: Multi-Target Masked Point Modeling With Learnable Codebook and Siamese Decoders\",\"authors\":\"Qibo Qiu;Honghui Yang;Jian Jiang;Shun Zhang;Haochao Ying;Haiming Gao;Wenxiao Wang;Xiaofei He\",\"doi\":\"10.1109/TCSVT.2025.3553525\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Masked point modeling has become a promising scheme of self-supervised pre-training for point clouds. Existing methods reconstruct either the masked points or related features as the objective of pre-training. However, considering the diversity of downstream tasks, it is necessary for the model to have both low- and high-level representation modeling capabilities during pre-training. It enables the capture of both geometric details and semantic contexts. To this end, M3CS is proposed to endow the model with the above abilities. Specifically, with the masked point cloud as input, M3CS introduces two decoders to reconstruct masked representations and the masked points simultaneously. While an extra decoder doubles parameters for the decoding process and may lead to overfitting, we propose siamese decoders to keep the number of learnable parameters unchanged. Further, we propose an online codebook projecting continuous tokens into discrete ones before reconstructing masked points. In such a way, we can compel the decoder to take effect through the combinations of tokens rather than remembering each token. Comprehensive experiments show that M3CS achieves superior performance across both classification and segmentation tasks, outperforming existing methods that are also single-modality and single-scale.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 9\",\"pages\":\"8807-8818\"},\"PeriodicalIF\":11.1000,\"publicationDate\":\"2025-03-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10937188/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10937188/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

掩蔽点建模已成为一种很有前途的点云自监督预训练方案。现有的方法要么重建被掩盖点，要么重建相关特征作为预训练的目标。然而，考虑到下游任务的多样性，在预训练期间，模型有必要同时具有低级和高级表示建模能力。它支持捕获几何细节和语义上下文。为此，提出M3CS赋予模型以上能力。具体来说，M3CS以被掩点云为输入，引入两个解码器来同时重构被掩表示和被掩点。虽然额外的解码器在解码过程中会使参数加倍，并可能导致过拟合，但我们建议使用连体解码器来保持可学习参数的数量不变。此外，我们提出了一个在线码本，在重建掩蔽点之前，将连续令牌投影到离散令牌中。通过这种方式，我们可以迫使解码器通过令牌组合而不是记住每个令牌来生效。综合实验表明，M3CS在分类和分割任务上都取得了优异的性能，优于现有的单一模态和单一尺度的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

M3CS: Multi-Target Masked Point Modeling With Learnable Codebook and Siamese Decoders

Masked point modeling has become a promising scheme of self-supervised pre-training for point clouds. Existing methods reconstruct either the masked points or related features as the objective of pre-training. However, considering the diversity of downstream tasks, it is necessary for the model to have both low- and high-level representation modeling capabilities during pre-training. It enables the capture of both geometric details and semantic contexts. To this end, M³CS is proposed to endow the model with the above abilities. Specifically, with the masked point cloud as input, M³CS introduces two decoders to reconstruct masked representations and the masked points simultaneously. While an extra decoder doubles parameters for the decoding process and may lead to overfitting, we propose siamese decoders to keep the number of learnable parameters unchanged. Further, we propose an online codebook projecting continuous tokens into discrete ones before reconstructing masked points. In such a way, we can compel the decoder to take effect through the combinations of tokens rather than remembering each token. Comprehensive experiments show that M³CS achieves superior performance across both classification and segmentation tasks, outperforming existing methods that are also single-modality and single-scale.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.