Learnable Query Contrast and Spatio-temporal Prediction on Point Cloud Video Pre-training

IF 1.3 4区工程技术 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Latin America Transactions Pub Date : 2024-10-04 DOI:10.1109/TLA.2024.10705970

Xiaoxiao Sheng;Zhiqiang Shen;Longguang Wang;Gang Xiao

{"title":"Learnable Query Contrast and Spatio-temporal Prediction on Point Cloud Video Pre-training","authors":"Xiaoxiao Sheng;Zhiqiang Shen;Longguang Wang;Gang Xiao","doi":"10.1109/TLA.2024.10705970","DOIUrl":null,"url":null,"abstract":"Point cloud videos capture the time-varying environment and are widely used for dynamic scene understanding. Existing methods develop effective networks for point cloud videos but do not fully utilize the prior information uncovered during pre-training. Furthermore, relying on a single supervised task with a large amount of manually labeled data may be insufficient to capture the foundational structures in point cloud videos. In this paper, we propose a pre-training framework Query-CP to learn the representations of point cloud videos through multiple self-supervised pretext tasks. First, tokenlevel contrast is developed to predict future features under the guidance of historical information. Using a position-guided autoregressor with learnable queries, the predictions are directly contrasted with corresponding targets in the high-level feature space to capture fine-grained semantics. Second, performing only contrastive learning fails to fully explore the complementary structures and dynamics information. To alleviate this, a decoupled spatio-temporal prediction task is designed, where we use a spatial branch to predict low-level features and a temporal branch to predict timestamps of the target sequence explicitly. By combining the above self-supervised tasks, multi-level information is captured during the pre-training stage. Finally, the encoder is fine-tuned and evaluated for action recognition and dynamic semantic segmentation on three datasets. The results demonstrate the effectiveness of our Query-CP. Especially, compared with the state-of-the-art methods, the fine-tuning accuracy on action recognition improves by 3.23% for 24-frame point cloud videos, and the mean accuracy increases by 4.21%.","PeriodicalId":55024,"journal":{"name":"IEEE Latin America Transactions","volume":"22 10","pages":"821-828"},"PeriodicalIF":1.3000,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10705970","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Latin America Transactions","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10705970/","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Point cloud videos capture the time-varying environment and are widely used for dynamic scene understanding. Existing methods develop effective networks for point cloud videos but do not fully utilize the prior information uncovered during pre-training. Furthermore, relying on a single supervised task with a large amount of manually labeled data may be insufficient to capture the foundational structures in point cloud videos. In this paper, we propose a pre-training framework Query-CP to learn the representations of point cloud videos through multiple self-supervised pretext tasks. First, tokenlevel contrast is developed to predict future features under the guidance of historical information. Using a position-guided autoregressor with learnable queries, the predictions are directly contrasted with corresponding targets in the high-level feature space to capture fine-grained semantics. Second, performing only contrastive learning fails to fully explore the complementary structures and dynamics information. To alleviate this, a decoupled spatio-temporal prediction task is designed, where we use a spatial branch to predict low-level features and a temporal branch to predict timestamps of the target sequence explicitly. By combining the above self-supervised tasks, multi-level information is captured during the pre-training stage. Finally, the encoder is fine-tuned and evaluated for action recognition and dynamic semantic segmentation on three datasets. The results demonstrate the effectiveness of our Query-CP. Especially, compared with the state-of-the-art methods, the fine-tuning accuracy on action recognition improves by 3.23% for 24-frame point cloud videos, and the mean accuracy increases by 4.21%.

查看原文本刊更多论文

点云视频预培训的可学习查询对比度和时空预测

点云视频能捕捉时变环境，被广泛用于动态场景理解。现有方法能为点云视频开发有效的网络，但不能充分利用预训练过程中发现的先验信息。此外，依赖大量人工标注数据的单一监督任务可能不足以捕捉点云视频中的基础结构。在本文中，我们提出了一个预训练框架 Query-CP，通过多个自我监督的前置任务来学习点云视频的表征。首先，在历史信息的指导下，开发了令牌级对比来预测未来特征。通过使用带有可学习查询的位置引导自回归器，将预测结果直接与高级特征空间中的相应目标进行对比，以捕捉细粒度语义。其次，仅进行对比学习无法充分探索互补结构和动态信息。为了缓解这一问题，我们设计了一个解耦的时空预测任务，即使用空间分支预测低层次特征，使用时间分支明确预测目标序列的时间戳。通过结合上述自监督任务，在预训练阶段就能捕捉到多层次信息。最后，对编码器进行了微调，并在三个数据集上对动作识别和动态语义分割进行了评估。结果证明了我们的查询-CP 的有效性。特别是与最先进的方法相比，在 24 帧点云视频中，微调后的动作识别准确率提高了 3.23%，平均准确率提高了 4.21%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Latin America Transactions COMPUTER SCIENCE, INFORMATION SYSTEMS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

3.50

自引率

7.70%

发文量

192

审稿时长

3-8 weeks

期刊介绍： IEEE Latin America Transactions (IEEE LATAM) is an interdisciplinary journal focused on the dissemination of original and quality research papers / review articles in Spanish and Portuguese of emerging topics in three main areas: Computing, Electric Energy and Electronics. Some of the sub-areas of the journal are, but not limited to: Automatic control, communications, instrumentation, artificial intelligence, power and industrial electronics, fault diagnosis and detection, transportation electrification, internet of things, electrical machines, circuits and systems, biomedicine and biomedical / haptic applications, secure communications, robotics, sensors and actuators, computer networks, smart grids, among others.