3D scene generation for zero-shot learning using ChatGPT guided language prompts

IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Sahar Ahmadi , Ali Cheraghian , Townim Faisal Chowdhury , Morteza Saberi , Shafin Rahman
{"title":"3D scene generation for zero-shot learning using ChatGPT guided language prompts","authors":"Sahar Ahmadi ,&nbsp;Ali Cheraghian ,&nbsp;Townim Faisal Chowdhury ,&nbsp;Morteza Saberi ,&nbsp;Shafin Rahman","doi":"10.1016/j.cviu.2024.104211","DOIUrl":null,"url":null,"abstract":"<div><div>Zero-shot learning in the realm of 3D point cloud data remains relatively unexplored compared to its 2D image counterpart. This domain introduces fresh challenges due to the absence of robust pre-trained feature extraction models. To tackle this, we introduce a prompt-guided method for 3D scene generation and supervision, enhancing the network’s ability to comprehend the intricate relationships between seen and unseen objects. Initially, we utilize basic prompts resembling scene annotations generated from one or two point cloud objects. Recognizing the limited diversity of basic prompts, we employ ChatGPT to expand them, enriching the contextual information within the descriptions. Subsequently, leveraging these descriptions, we arrange point cloud objects’ coordinates to fabricate augmented 3D scenes. Lastly, employing contrastive learning, we train our proposed architecture end-to-end, utilizing pairs of 3D scenes and prompt-based captions. We posit that 3D scenes facilitate more efficient object relationships than individual objects, as demonstrated by the effectiveness of language models like BERT in contextual understanding. Our prompt-guided scene generation method amalgamates data augmentation and prompt-based annotation, thereby enhancing 3D ZSL performance. We present ZSL and generalized ZSL results on both synthetic (ModelNet40, ModelNet10, and ShapeNet) and real-scanned (ScanOjbectNN) 3D object datasets. Furthermore, we challenge the model by training with synthetic data and testing with real-scanned data, achieving state-of-the-art performance compared to existing 2D and 3D ZSL methods in the literature. Codes and models are available at: <span><span>https://github.com/saharahmadisohraviyeh/ChatGPT_ZSL_3D</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104211"},"PeriodicalIF":4.3000,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224002923","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Zero-shot learning in the realm of 3D point cloud data remains relatively unexplored compared to its 2D image counterpart. This domain introduces fresh challenges due to the absence of robust pre-trained feature extraction models. To tackle this, we introduce a prompt-guided method for 3D scene generation and supervision, enhancing the network’s ability to comprehend the intricate relationships between seen and unseen objects. Initially, we utilize basic prompts resembling scene annotations generated from one or two point cloud objects. Recognizing the limited diversity of basic prompts, we employ ChatGPT to expand them, enriching the contextual information within the descriptions. Subsequently, leveraging these descriptions, we arrange point cloud objects’ coordinates to fabricate augmented 3D scenes. Lastly, employing contrastive learning, we train our proposed architecture end-to-end, utilizing pairs of 3D scenes and prompt-based captions. We posit that 3D scenes facilitate more efficient object relationships than individual objects, as demonstrated by the effectiveness of language models like BERT in contextual understanding. Our prompt-guided scene generation method amalgamates data augmentation and prompt-based annotation, thereby enhancing 3D ZSL performance. We present ZSL and generalized ZSL results on both synthetic (ModelNet40, ModelNet10, and ShapeNet) and real-scanned (ScanOjbectNN) 3D object datasets. Furthermore, we challenge the model by training with synthetic data and testing with real-scanned data, achieving state-of-the-art performance compared to existing 2D and 3D ZSL methods in the literature. Codes and models are available at: https://github.com/saharahmadisohraviyeh/ChatGPT_ZSL_3D.
利用 ChatGPT 引导式语言提示为零镜头学习生成 3D 场景
与二维图像相比,三维点云数据领域的零点学习仍处于相对探索阶段。由于缺乏稳健的预训练特征提取模型,这一领域面临着新的挑战。为了解决这个问题,我们引入了一种用于三维场景生成和监督的提示引导方法,以增强网络理解可见物体和未见物体之间错综复杂关系的能力。最初,我们使用的基本提示类似于由一个或两个点云对象生成的场景注释。由于基本提示的多样性有限,我们采用 ChatGPT 对其进行扩展,丰富了描述中的上下文信息。随后,利用这些描述,我们排列点云对象的坐标,创建增强 3D 场景。最后,通过对比学习,我们利用成对的三维场景和基于提示的字幕,对我们提出的架构进行端到端训练。我们认为,与单个物体相比,三维场景能更有效地促进物体关系,BERT 等语言模型在上下文理解方面的有效性就证明了这一点。我们的提示引导场景生成方法融合了数据增强和基于提示的注释,从而提高了 3D ZSL 的性能。我们展示了在合成(ModelNet40、ModelNet10 和 ShapeNet)和真实扫描(ScanOjbectNN)三维物体数据集上的 ZSL 和广义 ZSL 结果。此外,我们通过合成数据训练和真实扫描数据测试对模型进行了挑战,与文献中现有的二维和三维 ZSL 方法相比,取得了最先进的性能。代码和模型请访问:https://github.com/saharahmadisohraviyeh/ChatGPT_ZSL_3D。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computer Vision and Image Understanding
Computer Vision and Image Understanding 工程技术-工程:电子与电气
CiteScore
7.80
自引率
4.40%
发文量
112
审稿时长
79 days
期刊介绍: The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信