Qibo Qiu;Honghui Yang;Jian Jiang;Shun Zhang;Haochao Ying;Haiming Gao;Wenxiao Wang;Xiaofei He
{"title":"M3CS:多目标掩蔽点建模与可学习的代码本和暹罗解码器","authors":"Qibo Qiu;Honghui Yang;Jian Jiang;Shun Zhang;Haochao Ying;Haiming Gao;Wenxiao Wang;Xiaofei He","doi":"10.1109/TCSVT.2025.3553525","DOIUrl":null,"url":null,"abstract":"Masked point modeling has become a promising scheme of self-supervised pre-training for point clouds. Existing methods reconstruct either the masked points or related features as the objective of pre-training. However, considering the diversity of downstream tasks, it is necessary for the model to have both low- and high-level representation modeling capabilities during pre-training. It enables the capture of both geometric details and semantic contexts. To this end, M<sup>3</sup>CS is proposed to endow the model with the above abilities. Specifically, with the masked point cloud as input, M<sup>3</sup>CS introduces two decoders to reconstruct masked representations and the masked points simultaneously. While an extra decoder doubles parameters for the decoding process and may lead to overfitting, we propose siamese decoders to keep the number of learnable parameters unchanged. Further, we propose an online codebook projecting continuous tokens into discrete ones before reconstructing masked points. In such a way, we can compel the decoder to take effect through the combinations of tokens rather than remembering each token. Comprehensive experiments show that M<sup>3</sup>CS achieves superior performance across both classification and segmentation tasks, outperforming existing methods that are also single-modality and single-scale.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8807-8818"},"PeriodicalIF":11.1000,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"M3CS: Multi-Target Masked Point Modeling With Learnable Codebook and Siamese Decoders\",\"authors\":\"Qibo Qiu;Honghui Yang;Jian Jiang;Shun Zhang;Haochao Ying;Haiming Gao;Wenxiao Wang;Xiaofei He\",\"doi\":\"10.1109/TCSVT.2025.3553525\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Masked point modeling has become a promising scheme of self-supervised pre-training for point clouds. Existing methods reconstruct either the masked points or related features as the objective of pre-training. However, considering the diversity of downstream tasks, it is necessary for the model to have both low- and high-level representation modeling capabilities during pre-training. It enables the capture of both geometric details and semantic contexts. To this end, M<sup>3</sup>CS is proposed to endow the model with the above abilities. Specifically, with the masked point cloud as input, M<sup>3</sup>CS introduces two decoders to reconstruct masked representations and the masked points simultaneously. While an extra decoder doubles parameters for the decoding process and may lead to overfitting, we propose siamese decoders to keep the number of learnable parameters unchanged. Further, we propose an online codebook projecting continuous tokens into discrete ones before reconstructing masked points. In such a way, we can compel the decoder to take effect through the combinations of tokens rather than remembering each token. Comprehensive experiments show that M<sup>3</sup>CS achieves superior performance across both classification and segmentation tasks, outperforming existing methods that are also single-modality and single-scale.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 9\",\"pages\":\"8807-8818\"},\"PeriodicalIF\":11.1000,\"publicationDate\":\"2025-03-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10937188/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10937188/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
M3CS: Multi-Target Masked Point Modeling With Learnable Codebook and Siamese Decoders
Masked point modeling has become a promising scheme of self-supervised pre-training for point clouds. Existing methods reconstruct either the masked points or related features as the objective of pre-training. However, considering the diversity of downstream tasks, it is necessary for the model to have both low- and high-level representation modeling capabilities during pre-training. It enables the capture of both geometric details and semantic contexts. To this end, M3CS is proposed to endow the model with the above abilities. Specifically, with the masked point cloud as input, M3CS introduces two decoders to reconstruct masked representations and the masked points simultaneously. While an extra decoder doubles parameters for the decoding process and may lead to overfitting, we propose siamese decoders to keep the number of learnable parameters unchanged. Further, we propose an online codebook projecting continuous tokens into discrete ones before reconstructing masked points. In such a way, we can compel the decoder to take effect through the combinations of tokens rather than remembering each token. Comprehensive experiments show that M3CS achieves superior performance across both classification and segmentation tasks, outperforming existing methods that are also single-modality and single-scale.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.