M3CS: Multi-Target Masked Point Modeling With Learnable Codebook and Siamese Decoders

IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Qibo Qiu;Honghui Yang;Jian Jiang;Shun Zhang;Haochao Ying;Haiming Gao;Wenxiao Wang;Xiaofei He
{"title":"M3CS: Multi-Target Masked Point Modeling With Learnable Codebook and Siamese Decoders","authors":"Qibo Qiu;Honghui Yang;Jian Jiang;Shun Zhang;Haochao Ying;Haiming Gao;Wenxiao Wang;Xiaofei He","doi":"10.1109/TCSVT.2025.3553525","DOIUrl":null,"url":null,"abstract":"Masked point modeling has become a promising scheme of self-supervised pre-training for point clouds. Existing methods reconstruct either the masked points or related features as the objective of pre-training. However, considering the diversity of downstream tasks, it is necessary for the model to have both low- and high-level representation modeling capabilities during pre-training. It enables the capture of both geometric details and semantic contexts. To this end, M<sup>3</sup>CS is proposed to endow the model with the above abilities. Specifically, with the masked point cloud as input, M<sup>3</sup>CS introduces two decoders to reconstruct masked representations and the masked points simultaneously. While an extra decoder doubles parameters for the decoding process and may lead to overfitting, we propose siamese decoders to keep the number of learnable parameters unchanged. Further, we propose an online codebook projecting continuous tokens into discrete ones before reconstructing masked points. In such a way, we can compel the decoder to take effect through the combinations of tokens rather than remembering each token. Comprehensive experiments show that M<sup>3</sup>CS achieves superior performance across both classification and segmentation tasks, outperforming existing methods that are also single-modality and single-scale.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8807-8818"},"PeriodicalIF":11.1000,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10937188/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Masked point modeling has become a promising scheme of self-supervised pre-training for point clouds. Existing methods reconstruct either the masked points or related features as the objective of pre-training. However, considering the diversity of downstream tasks, it is necessary for the model to have both low- and high-level representation modeling capabilities during pre-training. It enables the capture of both geometric details and semantic contexts. To this end, M3CS is proposed to endow the model with the above abilities. Specifically, with the masked point cloud as input, M3CS introduces two decoders to reconstruct masked representations and the masked points simultaneously. While an extra decoder doubles parameters for the decoding process and may lead to overfitting, we propose siamese decoders to keep the number of learnable parameters unchanged. Further, we propose an online codebook projecting continuous tokens into discrete ones before reconstructing masked points. In such a way, we can compel the decoder to take effect through the combinations of tokens rather than remembering each token. Comprehensive experiments show that M3CS achieves superior performance across both classification and segmentation tasks, outperforming existing methods that are also single-modality and single-scale.
M3CS:多目标掩蔽点建模与可学习的代码本和暹罗解码器
掩蔽点建模已成为一种很有前途的点云自监督预训练方案。现有的方法要么重建被掩盖点,要么重建相关特征作为预训练的目标。然而,考虑到下游任务的多样性,在预训练期间,模型有必要同时具有低级和高级表示建模能力。它支持捕获几何细节和语义上下文。为此,提出M3CS赋予模型以上能力。具体来说,M3CS以被掩点云为输入,引入两个解码器来同时重构被掩表示和被掩点。虽然额外的解码器在解码过程中会使参数加倍,并可能导致过拟合,但我们建议使用连体解码器来保持可学习参数的数量不变。此外,我们提出了一个在线码本,在重建掩蔽点之前,将连续令牌投影到离散令牌中。通过这种方式,我们可以迫使解码器通过令牌组合而不是记住每个令牌来生效。综合实验表明,M3CS在分类和分割任务上都取得了优异的性能,优于现有的单一模态和单一尺度的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
13.80
自引率
27.40%
发文量
660
审稿时长
5 months
期刊介绍: The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信