MV-CLIP: Multi-View CLIP for Zero-Shot 3D Shape Recognition

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-03-13 DOI:10.1109/TCSVT.2025.3551084

Dan Song;Xinwei Fu;Ning Liu;Wei-Zhi Nie;Wen-Hui Li;Lan-Jun Wang;You Yang;An-An Liu

{"title":"MV-CLIP: Multi-View CLIP for Zero-Shot 3D Shape Recognition","authors":"Dan Song;Xinwei Fu;Ning Liu;Wei-Zhi Nie;Wen-Hui Li;Lan-Jun Wang;You Yang;An-An Liu","doi":"10.1109/TCSVT.2025.3551084","DOIUrl":null,"url":null,"abstract":"Large-scale pre-trained models have demonstrated impressive performance in vision and language tasks within open-world scenarios. Due to the lack of comparable pre-trained models for 3D shapes, recent methods utilize language-image pre-training to realize zero-shot 3D shape recognition. However, due to the modality gap, pretrained language-image models are not confident enough in the generalization to 3D shape recognition. Consequently, this paper aims to improve the confidence with view selection and hierarchical prompts. Building on the well-established CLIP model, we introduce view selection in the vision side that minimizes entropy to identify the most informative views for 3D shape. On the textual side, hierarchical prompts combined of hand-crafted and GPT-generated prompts are proposed to refine predictions. The first layer prompts several classification candidates with traditional class-level descriptions, while the second layer refines the prediction based on function-level descriptions or further distinctions between the candidates. Extensive experiments demonstrate the effectiveness of the proposed modules for zero-shot 3D shape recognition. Remarkably, without the need for additional training, our proposed method achieves impressive zero-shot 3D classification accuracies of 84.44%, 91.51%, and 66.17% on ModelNet40, ModelNet10, and ShapeNet Core55, respectively. Furthermore, we will make the code publicly available to facilitate reproducibility and further research in this area.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8767-8779"},"PeriodicalIF":11.1000,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10925427/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Large-scale pre-trained models have demonstrated impressive performance in vision and language tasks within open-world scenarios. Due to the lack of comparable pre-trained models for 3D shapes, recent methods utilize language-image pre-training to realize zero-shot 3D shape recognition. However, due to the modality gap, pretrained language-image models are not confident enough in the generalization to 3D shape recognition. Consequently, this paper aims to improve the confidence with view selection and hierarchical prompts. Building on the well-established CLIP model, we introduce view selection in the vision side that minimizes entropy to identify the most informative views for 3D shape. On the textual side, hierarchical prompts combined of hand-crafted and GPT-generated prompts are proposed to refine predictions. The first layer prompts several classification candidates with traditional class-level descriptions, while the second layer refines the prediction based on function-level descriptions or further distinctions between the candidates. Extensive experiments demonstrate the effectiveness of the proposed modules for zero-shot 3D shape recognition. Remarkably, without the need for additional training, our proposed method achieves impressive zero-shot 3D classification accuracies of 84.44%, 91.51%, and 66.17% on ModelNet40, ModelNet10, and ShapeNet Core55, respectively. Furthermore, we will make the code publicly available to facilitate reproducibility and further research in this area.

查看原文本刊更多论文

MV-CLIP：多视图剪辑零射击3D形状识别

大规模预训练模型在开放世界场景下的视觉和语言任务中表现出令人印象深刻的性能。由于缺乏可比较的三维形状预训练模型，最近的方法利用语言-图像预训练来实现零射击三维形状识别。然而，由于模态差距的存在，预训练的语言图像模型对三维形状识别泛化的信心不足。因此，本文旨在通过视图选择和分层提示来提高置信度。在完善的CLIP模型的基础上，我们在视觉侧引入了视图选择，以最大限度地减少熵，以识别最具信息量的3D形状视图。在文本方面，提出了手工制作和gpt生成提示相结合的分层提示来改进预测。第一层使用传统的类级别描述提示几个分类候选者，而第二层基于功能级别描述或候选者之间的进一步区分来改进预测。大量的实验证明了所提出的模块在零射击三维形状识别中的有效性。值得注意的是，在不需要额外训练的情况下，我们提出的方法在ModelNet40、ModelNet10和ShapeNet Core55上分别达到了令人印象深刻的84.44%、91.51%和66.17%的零射击3D分类准确率。此外，我们将公开代码，以促进再现性和在该领域的进一步研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.