Generalized Robotic Vision-Language Learning Model via Linguistic Foreground-Aware Contrast

IF 11.6 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2025-01-14 DOI:10.1007/s11263-024-02340-z

Kangcheng Liu, Chaoqun Wang, Xiaodong Han, Yong-Jin Liu, Baoquan Chen

{"title":"Generalized Robotic Vision-Language Learning Model via Linguistic Foreground-Aware Contrast","authors":"Kangcheng Liu, Chaoqun Wang, Xiaodong Han, Yong-Jin Liu, Baoquan Chen","doi":"10.1007/s11263-024-02340-z","DOIUrl":null,"url":null,"abstract":"<p>Contrastive learning has recently demonstrated great potential for unsupervised pre-training in 3D scene understanding tasks. However, most existing work randomly selects point features as anchors while building contrast, leading to a clear bias toward background points that often dominate in 3D scenes. Also, object awareness and foreground-to-background discrimination are neglected, making contrastive learning less effective. To tackle these issues, we propose a general foreground-aware feature contrast FAC++ framework to learn more effective point cloud representations in pre-training. FAC++ consists of two novel contrast designs to construct more effective and informative contrast pairs. The first is building positive pairs within the same foreground segment where points tend to have the same semantics. The second is that we prevent over-discrimination between 3D segments/objects and encourage grouped foreground-to-background distinctions at the segment level with adaptive feature learning in a Siamese correspondence network, which adaptively learns feature correlations within and across point cloud views effectively. Our proposed approach enhances both the local coherence as well as the overall feature discrimination. Moreover, we have designed the linguistic foreground-aware regional point sampling to enhance more balanced foreground-aware learning, which is termed FAC++. Visualization with point activation maps shows that our contrast pairs capture clear correspondences among foreground regions during pre-training. Quantitative experiments also show that FAC++ achieves superior knowledge transfer and data efficiency in various downstream 3D semantic segmentation, instance segmentation as well as object detection tasks. All codes, data, and models are available at: (https://github.com/KangchengLiu/FAC_Foreground_Aware_Contrast).</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"22 1","pages":""},"PeriodicalIF":11.6000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-024-02340-z","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Contrastive learning has recently demonstrated great potential for unsupervised pre-training in 3D scene understanding tasks. However, most existing work randomly selects point features as anchors while building contrast, leading to a clear bias toward background points that often dominate in 3D scenes. Also, object awareness and foreground-to-background discrimination are neglected, making contrastive learning less effective. To tackle these issues, we propose a general foreground-aware feature contrast FAC++ framework to learn more effective point cloud representations in pre-training. FAC++ consists of two novel contrast designs to construct more effective and informative contrast pairs. The first is building positive pairs within the same foreground segment where points tend to have the same semantics. The second is that we prevent over-discrimination between 3D segments/objects and encourage grouped foreground-to-background distinctions at the segment level with adaptive feature learning in a Siamese correspondence network, which adaptively learns feature correlations within and across point cloud views effectively. Our proposed approach enhances both the local coherence as well as the overall feature discrimination. Moreover, we have designed the linguistic foreground-aware regional point sampling to enhance more balanced foreground-aware learning, which is termed FAC++. Visualization with point activation maps shows that our contrast pairs capture clear correspondences among foreground regions during pre-training. Quantitative experiments also show that FAC++ achieves superior knowledge transfer and data efficiency in various downstream 3D semantic segmentation, instance segmentation as well as object detection tasks. All codes, data, and models are available at: (https://github.com/KangchengLiu/FAC_Foreground_Aware_Contrast).

查看原文本刊更多论文

基于语言前景感知对比的广义机器人视觉语言学习模型

对比学习最近在3D场景理解任务的无监督预训练中显示出巨大的潜力。然而，大多数现有的工作在建立对比时随机选择点特征作为锚点，导致对背景点的明显偏见，而背景点通常在3D场景中占主导地位。此外，对象意识和前景到背景的区分被忽视，使对比学习不太有效。为了解决这些问题，我们提出了一个通用的前景感知特征对比FAC++框架，以在预训练中学习更有效的点云表示。FAC++包括两种新的对比设计，以构建更有效和信息丰富的对比对。第一种是在相同前景段内构建正对，其中点往往具有相同的语义。其次，我们防止了3D片段/对象之间的过度区分，并通过暹罗通信网络中的自适应特征学习，在片段级别鼓励分组前景到背景的区分，该网络可以有效地自适应学习点云视图内部和跨点云视图的特征相关性。我们提出的方法既增强了局部一致性，又增强了整体特征识别。此外，我们还设计了语言前景感知区域点采样，以增强更平衡的前景感知学习，称为fac++。使用点激活图进行可视化显示，我们的对比对在预训练期间捕获了前景区域之间清晰的对应关系。定量实验也表明FAC++在各种下游3D语义分割、实例分割和目标检测任务中都实现了优越的知识转移和数据效率。所有代码、数据和模型可在：（https://github.com/KangchengLiu/FAC_Foreground_Aware_Contrast）获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.