Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models

Huy Ha, Shuran Song
{"title":"Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models","authors":"Huy Ha, Shuran Song","doi":"10.48550/arXiv.2207.11514","DOIUrl":null,"url":null,"abstract":"We study open-world 3D scene understanding, a family of tasks that require agents to reason about their 3D environment with an open-set vocabulary and out-of-domain visual inputs - a critical skill for robots to operate in the unstructured 3D world. Towards this end, we propose Semantic Abstraction (SemAbs), a framework that equips 2D Vision-Language Models (VLMs) with new 3D spatial capabilities, while maintaining their zero-shot robustness. We achieve this abstraction using relevancy maps extracted from CLIP, and learn 3D spatial and geometric reasoning skills on top of those abstractions in a semantic-agnostic manner. We demonstrate the usefulness of SemAbs on two open-world 3D scene understanding tasks: 1) completing partially observed objects and 2) localizing hidden objects from language descriptions. Experiments show that SemAbs can generalize to novel vocabulary, materials/lighting, classes, and domains (i.e., real-world scans) from training on limited 3D synthetic data. Code and data is available at https://semantic-abstraction.cs.columbia.edu/","PeriodicalId":273870,"journal":{"name":"Conference on Robot Learning","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"43","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Conference on Robot Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2207.11514","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 43

Abstract

We study open-world 3D scene understanding, a family of tasks that require agents to reason about their 3D environment with an open-set vocabulary and out-of-domain visual inputs - a critical skill for robots to operate in the unstructured 3D world. Towards this end, we propose Semantic Abstraction (SemAbs), a framework that equips 2D Vision-Language Models (VLMs) with new 3D spatial capabilities, while maintaining their zero-shot robustness. We achieve this abstraction using relevancy maps extracted from CLIP, and learn 3D spatial and geometric reasoning skills on top of those abstractions in a semantic-agnostic manner. We demonstrate the usefulness of SemAbs on two open-world 3D scene understanding tasks: 1) completing partially observed objects and 2) localizing hidden objects from language descriptions. Experiments show that SemAbs can generalize to novel vocabulary, materials/lighting, classes, and domains (i.e., real-world scans) from training on limited 3D synthetic data. Code and data is available at https://semantic-abstraction.cs.columbia.edu/
语义抽象:基于2D视觉语言模型的开放世界3D场景理解
我们研究了开放世界3D场景理解,这是一系列任务,需要智能体用开放集词汇和域外视觉输入来推理他们的3D环境——这是机器人在非结构化3D世界中操作的关键技能。为此,我们提出了语义抽象(SemAbs)框架,该框架为2D视觉语言模型(VLMs)提供了新的3D空间功能,同时保持了它们的零距鲁棒性。我们使用从CLIP中提取的相关性图来实现这种抽象,并以语义不可知的方式在这些抽象之上学习3D空间和几何推理技能。我们展示了SemAbs在两个开放世界3D场景理解任务中的有用性:1)完成部分观察到的对象和2)从语言描述中定位隐藏对象。实验表明,SemAbs可以从有限的3D合成数据训练中推广到新的词汇、材料/照明、类别和领域(即真实世界的扫描)。代码和数据可在https://semantic-abstraction.cs.columbia.edu/上获得
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信