用深度神经网络逼近人类水平的三维视觉推理。

Q1 Social Sciences
Open Mind Pub Date : 2025-02-16 eCollection Date: 2025-01-01 DOI:10.1162/opmi_a_00189
Thomas P O'Connell, Tyler Bonnen, Yoni Friedman, Ayush Tewari, Vincent Sitzmann, Joshua B Tenenbaum, Nancy Kanwisher
{"title":"用深度神经网络逼近人类水平的三维视觉推理。","authors":"Thomas P O'Connell, Tyler Bonnen, Yoni Friedman, Ayush Tewari, Vincent Sitzmann, Joshua B Tenenbaum, Nancy Kanwisher","doi":"10.1162/opmi_a_00189","DOIUrl":null,"url":null,"abstract":"<p><p>Humans make rich inferences about the geometry of the visual world. While deep neural networks (DNNs) achieve human-level performance on some psychophysical tasks (e.g., rapid classification of object or scene categories), they often fail in tasks requiring inferences about the underlying shape of objects or scenes. Here, we ask whether and how this gap in 3D shape representation between DNNs and humans can be closed. First, we define the problem space: after generating a stimulus set to evaluate 3D shape inferences using a match-to-sample task, we confirm that standard DNNs are unable to reach human performance. Next, we construct a set of candidate 3D-aware DNNs including 3D neural field (Light Field Network), autoencoder, and convolutional architectures. We investigate the role of the learning objective and dataset by training single-view (the model only sees one viewpoint of an object per training trial) and multi-view (the model is trained to associate multiple viewpoints of each object per training trial) versions of each architecture. When the same object categories appear in the model training and match-to-sample test sets, multi-view DNNs approach human-level performance for 3D shape matching, highlighting the importance of a learning objective that enforces a common representation across viewpoints of the same object. Furthermore, the 3D Light Field Network was the model most similar to humans across all tests, suggesting that building in 3D inductive biases increases human-model alignment. Finally, we explore the generalization performance of multi-view DNNs to out-of-distribution object categories not seen during training. Overall, our work shows that multi-view learning objectives for DNNs are necessary but not sufficient to make similar 3D shape inferences as humans and reveals limitations in capturing human-like shape inferences that may be inherent to DNN modeling approaches. We provide a methodology for understanding human 3D shape perception within a deep learning framework and highlight out-of-domain generalization as the next challenge for learning human-like 3D representations with DNNs.</p>","PeriodicalId":32558,"journal":{"name":"Open Mind","volume":"9 ","pages":"305-324"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11864798/pdf/","citationCount":"0","resultStr":"{\"title\":\"Approximating Human-Level 3D Visual Inferences With Deep Neural Networks.\",\"authors\":\"Thomas P O'Connell, Tyler Bonnen, Yoni Friedman, Ayush Tewari, Vincent Sitzmann, Joshua B Tenenbaum, Nancy Kanwisher\",\"doi\":\"10.1162/opmi_a_00189\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Humans make rich inferences about the geometry of the visual world. While deep neural networks (DNNs) achieve human-level performance on some psychophysical tasks (e.g., rapid classification of object or scene categories), they often fail in tasks requiring inferences about the underlying shape of objects or scenes. Here, we ask whether and how this gap in 3D shape representation between DNNs and humans can be closed. First, we define the problem space: after generating a stimulus set to evaluate 3D shape inferences using a match-to-sample task, we confirm that standard DNNs are unable to reach human performance. Next, we construct a set of candidate 3D-aware DNNs including 3D neural field (Light Field Network), autoencoder, and convolutional architectures. We investigate the role of the learning objective and dataset by training single-view (the model only sees one viewpoint of an object per training trial) and multi-view (the model is trained to associate multiple viewpoints of each object per training trial) versions of each architecture. When the same object categories appear in the model training and match-to-sample test sets, multi-view DNNs approach human-level performance for 3D shape matching, highlighting the importance of a learning objective that enforces a common representation across viewpoints of the same object. Furthermore, the 3D Light Field Network was the model most similar to humans across all tests, suggesting that building in 3D inductive biases increases human-model alignment. Finally, we explore the generalization performance of multi-view DNNs to out-of-distribution object categories not seen during training. Overall, our work shows that multi-view learning objectives for DNNs are necessary but not sufficient to make similar 3D shape inferences as humans and reveals limitations in capturing human-like shape inferences that may be inherent to DNN modeling approaches. We provide a methodology for understanding human 3D shape perception within a deep learning framework and highlight out-of-domain generalization as the next challenge for learning human-like 3D representations with DNNs.</p>\",\"PeriodicalId\":32558,\"journal\":{\"name\":\"Open Mind\",\"volume\":\"9 \",\"pages\":\"305-324\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-02-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11864798/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Open Mind\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1162/opmi_a_00189\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"Social Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Open Mind","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1162/opmi_a_00189","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"Social Sciences","Score":null,"Total":0}
引用次数: 0

摘要

人类对视觉世界的几何图形有丰富的推论。虽然深度神经网络(dnn)在一些心理物理任务(例如,物体或场景类别的快速分类)上达到了人类水平的表现,但它们在需要推断物体或场景的潜在形状的任务中经常失败。在这里,我们询问dnn和人类之间在3D形状表示方面的差距是否以及如何缩小。首先,我们定义了问题空间:在使用匹配样本任务生成一个刺激集来评估3D形状推断之后,我们确认标准dnn无法达到人类的表现。接下来,我们构建了一组候选的3D感知dnn,包括3D神经场(光场网络)、自动编码器和卷积架构。我们通过训练每个架构的单视图(每次训练试验模型只看到一个对象的一个视点)和多视图(每次训练试验模型被训练以关联每个对象的多个视点)版本来研究学习目标和数据集的作用。当相同的对象类别出现在模型训练和匹配样本测试集中时,多视图dnn在3D形状匹配方面接近人类水平的性能,突出了学习目标的重要性,该目标强制在同一对象的多个视点之间进行共同表示。此外,在所有测试中,3D光场网络是与人类最相似的模型,这表明建立3D归纳偏差增加了人类模型的一致性。最后,我们探讨了多视图dnn对训练中未见的分布外对象类别的泛化性能。总的来说,我们的工作表明,DNN的多视图学习目标是必要的,但不足以做出与人类相似的3D形状推断,并揭示了捕获类人形状推断的局限性,这可能是DNN建模方法固有的。我们提供了一种在深度学习框架内理解人类3D形状感知的方法,并强调域外泛化是使用dnn学习类人3D表示的下一个挑战。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Approximating Human-Level 3D Visual Inferences With Deep Neural Networks.

Humans make rich inferences about the geometry of the visual world. While deep neural networks (DNNs) achieve human-level performance on some psychophysical tasks (e.g., rapid classification of object or scene categories), they often fail in tasks requiring inferences about the underlying shape of objects or scenes. Here, we ask whether and how this gap in 3D shape representation between DNNs and humans can be closed. First, we define the problem space: after generating a stimulus set to evaluate 3D shape inferences using a match-to-sample task, we confirm that standard DNNs are unable to reach human performance. Next, we construct a set of candidate 3D-aware DNNs including 3D neural field (Light Field Network), autoencoder, and convolutional architectures. We investigate the role of the learning objective and dataset by training single-view (the model only sees one viewpoint of an object per training trial) and multi-view (the model is trained to associate multiple viewpoints of each object per training trial) versions of each architecture. When the same object categories appear in the model training and match-to-sample test sets, multi-view DNNs approach human-level performance for 3D shape matching, highlighting the importance of a learning objective that enforces a common representation across viewpoints of the same object. Furthermore, the 3D Light Field Network was the model most similar to humans across all tests, suggesting that building in 3D inductive biases increases human-model alignment. Finally, we explore the generalization performance of multi-view DNNs to out-of-distribution object categories not seen during training. Overall, our work shows that multi-view learning objectives for DNNs are necessary but not sufficient to make similar 3D shape inferences as humans and reveals limitations in capturing human-like shape inferences that may be inherent to DNN modeling approaches. We provide a methodology for understanding human 3D shape perception within a deep learning framework and highlight out-of-domain generalization as the next challenge for learning human-like 3D representations with DNNs.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Open Mind
Open Mind Social Sciences-Linguistics and Language
CiteScore
3.20
自引率
0.00%
发文量
15
审稿时长
53 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信