融合可微分渲染和语言图像对比学习，实现卓越的零拍点云分类

IF 3.7 2区工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Displays Pub Date : 2024-06-15 DOI:10.1016/j.displa.2024.102773

Jinlong Xie , Long Cheng , Gang Wang , Min Hu , Zaiyang Yu , Minghua Du , Xin Ning

{"title":"融合可微分渲染和语言图像对比学习，实现卓越的零拍点云分类","authors":"Jinlong Xie , Long Cheng , Gang Wang , Min Hu , Zaiyang Yu , Minghua Du , Xin Ning","doi":"10.1016/j.displa.2024.102773","DOIUrl":null,"url":null,"abstract":"<div><p>Zero-shot point cloud classification involves recognizing categories not encountered during training. Current models often exhibit reduced accuracy on unseen categories without 3D pre-training, emphasizing the need for improved precision and interoperability. We propose a novel approach integrating differentiable rendering with contrastive language–image pre-training. Initially, differentiable rendering autonomously learns representative viewpoints from the data, enabling the transformation of point clouds into multi-view images while preserving key visual information. This transformation facilitates optimized viewpoint selection during training, refining the final feature representation. Features are extracted from the multi-view images and integrated into a global multi-view feature using a cross-attention mechanism. On the textual side, a large language model (LLM) is provided with 3D heuristic prompts to generate 3D-specific text reflecting category-specific traits, from which textual features are derived. The LLM’s extensive pre-trained knowledge enables it to capture abstract notions and categorical features relevant to distinct point cloud categories. Visual and textual features are aligned in a unified embedding space, enabling zero-shot classification. Throughout training, the Structural Similarity Index (SSIM) is integrated into the loss function to encourage the model to discern more distinctive viewpoints, reduce redundancy in multi-view imagery, and enhance computational efficiency. Experimental results on the ModelNet10, ModelNet40, and ScanObjectNN datasets demonstrate classification accuracies of 75.68%, 66.42%, and 52.03%, respectively, surpassing prevailing methods in zero-shot point cloud classification accuracy.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102773"},"PeriodicalIF":3.7000,"publicationDate":"2024-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fusing differentiable rendering and language–image contrastive learning for superior zero-shot point cloud classification\",\"authors\":\"Jinlong Xie , Long Cheng , Gang Wang , Min Hu , Zaiyang Yu , Minghua Du , Xin Ning\",\"doi\":\"10.1016/j.displa.2024.102773\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Zero-shot point cloud classification involves recognizing categories not encountered during training. Current models often exhibit reduced accuracy on unseen categories without 3D pre-training, emphasizing the need for improved precision and interoperability. We propose a novel approach integrating differentiable rendering with contrastive language–image pre-training. Initially, differentiable rendering autonomously learns representative viewpoints from the data, enabling the transformation of point clouds into multi-view images while preserving key visual information. This transformation facilitates optimized viewpoint selection during training, refining the final feature representation. Features are extracted from the multi-view images and integrated into a global multi-view feature using a cross-attention mechanism. On the textual side, a large language model (LLM) is provided with 3D heuristic prompts to generate 3D-specific text reflecting category-specific traits, from which textual features are derived. The LLM’s extensive pre-trained knowledge enables it to capture abstract notions and categorical features relevant to distinct point cloud categories. Visual and textual features are aligned in a unified embedding space, enabling zero-shot classification. Throughout training, the Structural Similarity Index (SSIM) is integrated into the loss function to encourage the model to discern more distinctive viewpoints, reduce redundancy in multi-view imagery, and enhance computational efficiency. Experimental results on the ModelNet10, ModelNet40, and ScanObjectNN datasets demonstrate classification accuracies of 75.68%, 66.42%, and 52.03%, respectively, surpassing prevailing methods in zero-shot point cloud classification accuracy.</p></div>\",\"PeriodicalId\":50570,\"journal\":{\"name\":\"Displays\",\"volume\":\"84 \",\"pages\":\"Article 102773\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-06-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Displays\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0141938224001379\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141938224001379","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

零拍摄点云分类涉及识别训练过程中未遇到的类别。目前的模型在没有三维预训练的情况下，对未见类别的识别准确率往往较低，这就强调了提高精确度和互操作性的必要性。我们提出了一种将可变渲染与对比语言图像预训练相结合的新方法。首先，可变渲染可自主学习数据中的代表性视点，从而将点云转换为多视点图像，同时保留关键的视觉信息。这种转换有利于在训练过程中优化视点选择，完善最终的特征表示。从多视角图像中提取特征，并通过交叉关注机制整合到全局多视角特征中。在文本方面，大语言模型（LLM）通过三维启发式提示生成反映特定类别特征的三维特定文本，并从中提取文本特征。LLM 广泛的预训练知识使其能够捕捉与不同点云类别相关的抽象概念和分类特征。视觉特征和文本特征在统一的嵌入空间中对齐，从而实现零镜头分类。在整个训练过程中，结构相似性指数（SSIM）被整合到损失函数中，以鼓励模型辨别更独特的视角，减少多视角图像中的冗余，并提高计算效率。在 ModelNet10、ModelNet40 和 ScanObjectNN 数据集上的实验结果表明，分类准确率分别为 75.68%、66.42% 和 52.03%，在零镜头点云分类准确率方面超越了现有方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fusing differentiable rendering and language–image contrastive learning for superior zero-shot point cloud classification

Zero-shot point cloud classification involves recognizing categories not encountered during training. Current models often exhibit reduced accuracy on unseen categories without 3D pre-training, emphasizing the need for improved precision and interoperability. We propose a novel approach integrating differentiable rendering with contrastive language–image pre-training. Initially, differentiable rendering autonomously learns representative viewpoints from the data, enabling the transformation of point clouds into multi-view images while preserving key visual information. This transformation facilitates optimized viewpoint selection during training, refining the final feature representation. Features are extracted from the multi-view images and integrated into a global multi-view feature using a cross-attention mechanism. On the textual side, a large language model (LLM) is provided with 3D heuristic prompts to generate 3D-specific text reflecting category-specific traits, from which textual features are derived. The LLM’s extensive pre-trained knowledge enables it to capture abstract notions and categorical features relevant to distinct point cloud categories. Visual and textual features are aligned in a unified embedding space, enabling zero-shot classification. Throughout training, the Structural Similarity Index (SSIM) is integrated into the loss function to encourage the model to discern more distinctive viewpoints, reduce redundancy in multi-view imagery, and enhance computational efficiency. Experimental results on the ModelNet10, ModelNet40, and ScanObjectNN datasets demonstrate classification accuracies of 75.68%, 66.42%, and 52.03%, respectively, surpassing prevailing methods in zero-shot point cloud classification accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Displays 工程技术-工程：电子与电气

CiteScore

4.60

自引率

25.60%

发文量

138

审稿时长

92 days

期刊介绍： Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface. Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.